A major problem of classification learning is the lack of ground-truth labeled data. It is usually expensive to label new data instances for training a model. To solve this problem, domain adaptation in transfer learning has been proposed to classify target domain data by using some other source domain data, even when the data may have different distributions. Though effective this adaptation can't be implemented properly when differences between source and target domain is large.
In this paper, we design a novel transfer learning approach, called BIG (Bridging Information Gap), to effectively extract useful knowledge in a worldwide knowledge base, which is then used to link the source and target domains for improving the classification performance. BIG works when the source and target domains share the same feature space but different underlying data distributions. The major work here is to make use of large amount of worldwide knowledge to build a bridge for linking cross domains, which essentially is a combination of feature space and data distribution, differences between the training and test data for classification and demonstrate that this approach outperforms the base line methods significantly. Using the auxiliary source data, we can extract a "bridge" that allows cross-domain text classification problems to be solved using standard SemiSupervised learning algorithms.
Keyword: Text Classification, Classification Learning, SemiSupervised Learning, Bridge Information Gap
INTRODUCTION
Text classification aims to assign a document to one or more categories based on its content. It is a fundamental task for Web and document data mining applications, ranging from information retrieval, spam detection, to online advertisement and Web search. Traditional supervised learning approaches for text classification require sufficient labeled instances in a problem domain in order to train a high quality model. However, it is not always easy or feasible to obtain new labeled data in the target domain of interest. The lack of labeled data problem can seriously hurt classification performance in many real world applications. To solve this problem, transfer learning techniques are introduced by capturing the shared knowledge from some related domains where labeled data are available, and use the knowledge to improve the performance of data mining tasks in a target domain. In transfer learning terminologies, one or more auxiliary domains are identified as the source of knowledge transfer, and the domain of interest is known as the target domain. However, transfer learning may not work well when the difference between the source and target domains is large.
Our observation is that such a gap can potentially be found and bridged using knowledge from other domains. To solve this problem, we introduce a bridge between the two different domains by leveraging additional knowledge sources that are readily available and have wide coverage in scope. This knowledge source can be a third domain, such as Wikipedia or the Open Directory Project (ODP). For example, the connection between commutative algebra and geometry can be found through a large knowledge base on algebraic geometry topics. Once we find such a knowledge bridge, we can use the auxiliary data and semisupervised learning methods to fill in the information gap.
Domain Adaption
Domain adaptation has attracted more and more attention in the recent years. In general, previous domain adaptation approaches can be classified into two categories: instance based approaches, and feature-based approaches.
Instance-based methods try to seek some reweighting strategies on the source data, such that the source distribution can match the target distribution. Feature-based methods try to discover a shared feature space on which the distributions of different domains are pulled closer. Both types are trying to discover the relation between source and target domains within the scope of two domains.
Data Mining with Online Knowledge Repository
A major component of our approach is to use online knowledge repositories as auxiliary information sources to help bridge the gap between the source domain and the target domain. The underlying idea of such a framework is that for each classification task, a very large external data collection is collected and called a "universal data set," then a classification model on both a small set of labeled training data and a rich set of hidden topics discovered from that data collection is built. However, building a semantic kernel from the whole knowledge base is costly. Huge cost could be saved by considering only the "most useful" concepts to bridge the information gap. Moreover, instance-based transfer, especially our method, adds more interpretability into the transfer scheme and it is easy to study "what kind" of instances is useful for bridging the gap instead of the more compact and abstract semantic kernel. In this paper, we propose to incorporate the background knowledge efficiently from the instance based transfer perspective, which could also establish a connection between the transfer learning problems and traditional semisupervised learning problems
Semisupervised Learning
Domain adaptation could also be viewed as transductive transfer learning if the source domain and the target domain had no information gap. In this case, the problem can be reduced to a semisupervised learning problem. However, when there is information gap, the procedure to exploit semisupervised learning is not clear. Semisupervised learning addresses the problem when the labeled data are too few to build a good classifier and makes use of a large amount of unlabeled data, together with a small amount of labeled data to enhance the classifiers. Transductive SVM builds the connection between class distribution and decision boundary by putting the boundary in low density regions. The goal is to find a labeling of the unlabeled data such that a linear boundary has the maximum margin on both the original labeled data and the unlabeled data. It can be viewed as an SVM with additional regularization term on unlabeled data. Graph-based semisupervised methods define a graph where the nodes are labeled and unlabeled examples and edges reflect the similarities between examples.
Features of Domain Adaptation & Semisupervised Learning
Transfer Learning
Cross-domain classification is related to transfer learning, where the knowledge acquired to accomplish a given task is used to tackle another learning task. The resulting term covariance is then applied to the target learning task.
Overall perspective view
For instance, if the covariance between terms "moon" and "rocket" is high, and "moon" usually appears in documents of a certain category, it is inferred that "rocket" also supports the same category, even without observing this directly in the training data. Boosting algorithm to address cross-domain classification problems. Their basic idea is to select useful instances from auxiliary data with a different distribution, and use them as additional training data for predicting the labels of test data. However, in order to identify the most helpful additional training instances, the approach relies on the existence of some labeled testing data, which in practice may not be available.
METHODOLOGY
The problem of domain adaption is likely to be encountered in many real-world applications. For example, we may have trained a sentiment classifier for reviews on movies but we want to use it to classify reviews from other domains such as books or music [12]. Another example is that we may have trained a classifier to classify news into topical categories but we want to use it also on blogs. In these cases, we do not want to relabel the data in the new domains but borrow the knowledge from the old domains. When the differences between the source and target domains are large, the model trained on the source domains cannot generalize well for the target domain data. A natural approach to follow is to consider transductive learning, since unlabeled data from the target domain is available. However, some previous works have found that after introducing some unlabeled data in the target domain, transductive learning is still not sufficient in improving the performance. The reason may be that transductive or semisupervised learning generally assumes that the decision boundary lies in the low-density region of the feature space. When the distributions of source and target domains are different, there may exist, a low density region between different domains which is a gap that disconnects the same-class data in different domains. We refer to this gap as the information gap in domain adaptation.
Fig. 1: Information Bridging
To solve the problem of domain adaptation under large information gaps, an intuitive idea is to find the shared part of different knowledge between the domains, and ignore the differences. One instantiation of this idea is to make use of the abundant and potentially useful information sources that are around, and use them to connect the information separated by the gap. Such an intuition motivates us to think of a different way for solving the domain adaption problem, i.e., through finding an information bridge.
Margin as Information Gap
An intuitive way to understand the concept of information gap is to consider separability of the source and target domains. Consider the simplest case when we want to transfer knowledge from a single source domain to a target domain. Intuitively, the difficulty in separating these domains shows how large the information gap is between them. If the two domains can be easily separated, then there exists a large information gap between them, which may prevent our adapting the original learned model from the source to the target domain. On the contrary, if the two domains cannot be separated from each other easily, then the information gap is small, in which case we can treat the two domains as essentially data that are sampled from a single underlying distribution. In other words, the original "domain adaptation problem" is transformed into a classification problem under the supervised setting or a semisupervised (transductive) setting. A similar idea is used in where a classifier is trained to distinguish the source and target domains and the classification error is used as an empirical estimation for domain distance. Although this idea is useful, it does not consider the existence of auxiliary information sources that can be used to bridge two domains.
BIG: A Min-Margin Algorithm to Reduce Information Gap
Its inputs include the source domain data set Ds, target domain data set Dt and auxiliary domain data K. The output of the algorithm consists of some unlabeled data that are chosen so that we can apply semisupervised learning algorithms for training a classifier. These unlabeled data carry important information about the distribution between the two domains.
Experimental Results
1. Can our min-margin-based semisupervised learning approach outperform traditional transfer learning approaches on the domain adaptation tasks?
2. Can our algorithm automatically identify the most important documents connecting the source domain to the target domain?
Question 1: Comparison with Traditional Transfer Learning Methods.
We adopt three classification models for domain adaptation:
a) The first model is Support Vector Machines (SVM), which is usually used for supervised learning;
b) The second model is Transductive SVM (TSVM), which is a semisupervised learning model;
c) And the last model is Co-Cluster-based classification (CoCC) which is a transfer learning model designed for cross domain classification.
For all of the three baseline models, we only use the labeled documents in the source domain and unlabeled documents in the target domain for training, and evaluate the model on the test documents in the target domain. Since our BIG algorithm is based on TSVM, in order to obtain the optimal parameter settings for SVM, TSVM, and BIG we first tune the parameter C to achieve the optimal accuracy via 10-fold cross validation on each data set individually.
Question 2: Comparison with Random Selection Methods
We adopt two random selection methods for comparison. The "Random 1" method randomly selects 500 data instances from the entire knowledge base; The "Random 2" method first uses the CandSelect algorithm to generate a candidate set, and then randomly selects 500 instances from the candidate set. Once we have selected the 500 instances through either kind of Random methods, the 500 instances are added to the training set as unlabeled data. Then we apply the TSVM model for semisupervised learning. This method is much better than both of the Random baselines, indicating that we really found the most important nodes during the domain drift process through which distribution is drifted from the source domain to the target domain. We can also observe that "Random 2" is better than "Random 1," since the Topic-model-based CandSelect algorithm can really filter out a huge amount of irrelevant data from the knowledge base.
Convergence and Stability
We first demonstrate that our algorithm can reduce the information gap between domains during the process of including the unlabeled data from the related domains. We randomly sample three tasks for each of the three data sets and display the performance, together with their corresponding margin sizes.
CONCLUSIONS
In this paper, we proposed a novel framework for tackling the problem of domain adaption under large information gaps. We model the learning problem as a semisupervised learning problem aided by a method for filling in the information gap between the source and target domains with the help of an auxiliary knowledge base (such as the Wikipedia). By conducting experiments on different difficult domain adaptation tasks, we show that our algorithm can significantly outperform several existing domain adaptation approaches in situations when the source and target domains are far from each other. In each case, an auxiliary domain can be used to fill in the information gap efficiently. We make three major contributions in this paper.
1) Instead of the traditional instance-based or feature-based perspective to view the problem of domain adaptation, we view the problem from a new perspective, i.e., we consider the problem of transfer learning as one of filling in the information gap based on a large document corpus. We show that we can obtain useful information to bridge the source and the target domains from auxiliary data sources.
2) Instead of devising new models for tackling the domain adaptation problems, we show that we can successfully bridge the source and target domains using well developed semisupervised learning algorithms.
3) We propose a minmargin algorithm that can effectively identify and reduce the information gap between two domains.
FUTURE WORK
We plan to continue our research work on this direction in the future, by pursuing several avenues. It should be to validate the effectiveness of our approach through other semisupervised learning algorithms and other relational knowledge bases to more extensively demonstrate the effectiveness of our approach.