Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
國立雲林科技大學 N.Y.U.S.T. I. M. National Yunlin University of Science and Technology Effective multi-label active learning for text classification Presenter : Wu, Jia-Hao Authors : Bishan Yang , Jian-Tao Sun , Tengjiao Wang , Zheng Chen KDD (2009) Intelligent Database Systems Lab Outline Motivation Objective Problem definition Methodology Experiments Conclusion Personal Comments N.Y.U.S.T. I. M. 2 Intelligent Database Systems Lab Motivation N.Y.U.S.T. I. M. Multi-labeled text classification problems have received considerable attention, since many text classification tasks are multi-labeled. 觀光 影劇 財經 Intelligent Database Systems Lab Motivation (Cont.) N.Y.U.S.T. I. M. Multi-label information – instance : three categories c1,c2,c3 x1 are [c1 : 0.8 , c2 : 0.5 , c3 : 0.1] Multi-label classification task x2 are [c1 : 0.7 , c2 : 0.1 , c3 : 0.1] x1 are [c1 : 0.8] x2 are [c1 : 0.7] Single-label classification task Considering multi-label information in the sample selection strategy is very important. Intelligent Database Systems Lab Objective The authors propose a novel multi-label active learning approach for text classification. N.Y.U.S.T. I. M. The sample selection strategy aims to label data which can help maximize the reduction rate of the expected model loss. Also propose an effective method to predict labels for multi-label data. Intelligent Database Systems Lab Problem definition N.Y.U.S.T. I. M. Training examples as x1,…,xn and the k classes as 1,…,k The label set of xi by a binary vector yi = [ yi1 ,..., yik ] The set of all possible class combinations as x1 x1 x1 x1 y11 +1 y12 +1 +1 -1 -1 +1 -1 -1 Intelligent Database Systems Lab Problem definition (Cont.) N.Y.U.S.T. I. M. In SVM (binary classifiers) f i as the binary classifier associated with target class i. Given a test instance x’ , if f i (x’) > 0 , then x’ belongs to class i use a pool-based active learning approach. The data with labels by Dl Remaining data without labels by Du Intelligent Database Systems Lab Methodology N.Y.U.S.T. I. M. P(x) be the input distribution the multi-label prediction function given training set Dl as fDl the predicted label set of x is fDl (x) The true label set of x is y the estimated loss on x as L(fDl (x) , y) → L(fDl) Intelligent Database Systems Lab Methodology (Cont.) N.Y.U.S.T. I. M. The active learner will evaluate each possible set of unlabeled data Ds to find the optimal query set Ds* The new training set as Dl' Dl Ds , and the expected loss for the classifier trained on The optimization problem is to find the optimal query set Ds* Intelligent Database Systems Lab Methodology (Cont.) – Sample Selection Strategy with SVM The optimization problem How to measure the loss reduction of the multi-label classifier How to provide a good probability estimation for the conditional probability p(y|x) Estimate Loss Reduction Use SVM margin as the measure of the version space size. Intelligent Database Systems Lab N.Y.U.S.T. I. M. Methodology (Cont.) N.Y.U.S.T. I. M. Denote as the size of version space of the binary classifier associated with target class i and learnt from labeled data Dl After adding new data point , where is the true label for data x on class i , the new model loss versus the old one on the binary classifier Intelligent Database Systems Lab Methodology (Cont.) N.Y.U.S.T. I. M. Label Prediction Suppose there are k classes. We can have k binary classifiers . Given data x , denote as the probability of x belonging to class i. Next use the Logistic regression (LR) algorithm to predict the number of labels. Before LR is used the authors transform the decision output on the training data to classification probabilities , use the Sigmoid function. SVM classifier Output the label with the largest probability Sort the probabilities Train Logistic regression classifier Intelligent Database Systems Lab Methodology (Cont.) N.Y.U.S.T. I. M. Incorporating the predicted label vector into the expected loss estimation , we obtain our data selection strategy Maximum loss reduction with Maximal Confidence (MMC) The yi → Intelligent Database Systems Lab Experiments N.Y.U.S.T. I. M. Use the Micro-Average F1 score as the evaluation measure. n is the number of test data yi is the true label vector of the i-th data instance. ŷi is the predicted label vector. k is the classes number Intelligent Database Systems Lab Experiments N.Y.U.S.T. I. M. Label prediction methods Use the RCV1-V2 data set Intelligent Database Systems Lab Experiments N.Y.U.S.T. I. M. Sensitive experiments Sampling sizes per run Intelligent Database Systems Lab Experiments N.Y.U.S.T. I. M. Intelligent Database Systems Lab Conclusion N.Y.U.S.T. I. M. The method MMC is to reduce the required size of labeled data in multi-label classification while maintaining favorable accuracy performance. The method outperforms the other active learning techniques on multi-label text classification by a large margin and can significantly reduce the labeling cost. Intelligent Database Systems Lab Comments Advantage This paper has many experiment to show their performance. Drawback N.Y.U.S.T. I. M. … Application News , Email classification Image classification Intelligent Database Systems Lab