Download Multi-label classification task x

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data assimilation wikipedia , lookup

Time series wikipedia , lookup

Transcript
國立雲林科技大學
N.Y.U.S.T.
I. M.
National Yunlin University of Science and Technology
Effective multi-label active learning for
text classification
Presenter : Wu, Jia-Hao
Authors : Bishan Yang , Jian-Tao Sun , Tengjiao Wang ,
Zheng Chen
KDD (2009)
Intelligent Database Systems Lab
Outline

Motivation

Objective

Problem definition

Methodology

Experiments

Conclusion

Personal Comments
N.Y.U.S.T.
I. M.
2
Intelligent Database Systems Lab
Motivation

N.Y.U.S.T.
I. M.
Multi-labeled text classification problems have received
considerable attention, since many text classification tasks
are multi-labeled.
觀光
影劇
財經
Intelligent Database Systems Lab
Motivation (Cont.)

N.Y.U.S.T.
I. M.
Multi-label information – instance : three categories
c1,c2,c3

x1 are [c1 : 0.8 , c2 : 0.5 , c3 : 0.1]
Multi-label classification task

x2 are [c1 : 0.7 , c2 : 0.1 , c3 : 0.1]

x1 are [c1 : 0.8]

x2 are [c1 : 0.7]
Single-label classification task
Considering multi-label information in the sample selection strategy is very important.
Intelligent Database Systems Lab
Objective

The authors propose a novel multi-label active learning
approach for text classification.


N.Y.U.S.T.
I. M.
The sample selection strategy aims to label data which can help
maximize the reduction rate of the expected model loss.
Also propose an effective method to predict labels for
multi-label data.
Intelligent Database Systems Lab
Problem definition
N.Y.U.S.T.
I. M.

Training examples as x1,…,xn and the k classes as 1,…,k

The label set of xi by a binary vector yi = [ yi1 ,..., yik ]


The set of all possible class combinations as
x1
x1
x1
x1
y11
+1
y12
+1
+1
-1
-1
+1
-1
-1
Intelligent Database Systems Lab
Problem definition (Cont.)


N.Y.U.S.T.
I. M.
In SVM (binary classifiers)

f i as the binary classifier associated with target class i.

Given a test instance x’ , if f i (x’) > 0 , then x’ belongs to class i
use a pool-based active learning approach.

The data with labels by Dl

Remaining data without labels by Du
Intelligent Database Systems Lab
Methodology
N.Y.U.S.T.
I. M.

P(x) be the input distribution

the multi-label prediction function given training set Dl as fDl

the predicted label set of x is fDl (x)

The true label set of x is y the estimated loss on x as
L(fDl (x) , y) → L(fDl)
Intelligent Database Systems Lab
Methodology (Cont.)
N.Y.U.S.T.
I. M.

The active learner will evaluate each possible set of
unlabeled data Ds to find the optimal query set Ds*

The new training set as Dl'  Dl  Ds , and the expected
loss for the classifier trained on

The optimization problem is to find the optimal query
set Ds*
Intelligent Database Systems Lab
Methodology (Cont.) – Sample Selection Strategy with SVM


The optimization problem

How to measure the loss reduction of the multi-label classifier

How to provide a good probability estimation for the conditional
probability p(y|x)
Estimate Loss Reduction

Use SVM margin as the measure of the version space size.
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Methodology (Cont.)
N.Y.U.S.T.
I. M.

Denote
as the size of version space of the binary
classifier
associated with target class i and learnt from
labeled data Dl

After adding new data point
, where
is the
true label for data x on class i , the new model loss versus
the old one on the binary classifier
Intelligent Database Systems Lab
Methodology (Cont.)

N.Y.U.S.T.
I. M.
Label Prediction

Suppose there are k classes. We can have k binary classifiers . Given
data x , denote
as the probability of x belonging to class i.

Next use the Logistic regression (LR) algorithm to predict the number
of labels.

Before LR is used the authors transform the decision output on the
training data to classification probabilities , use the Sigmoid function.
SVM classifier
Output the label with the largest probability
Sort the probabilities
Train Logistic regression classifier
Intelligent Database Systems Lab
Methodology (Cont.)

N.Y.U.S.T.
I. M.
Incorporating the predicted label vector into the expected
loss estimation , we obtain our data selection strategy

Maximum loss reduction with Maximal Confidence (MMC)

The yi →
Intelligent Database Systems Lab
Experiments

N.Y.U.S.T.
I. M.
Use the Micro-Average F1 score as the evaluation measure.

n is the number of test data

yi is the true label vector of the i-th data instance.

ŷi is the predicted label vector.

k is the classes number
Intelligent Database Systems Lab
Experiments

N.Y.U.S.T.
I. M.
Label prediction methods

Use the RCV1-V2 data set
Intelligent Database Systems Lab
Experiments

N.Y.U.S.T.
I. M.
Sensitive experiments

Sampling sizes per run
Intelligent Database Systems Lab
Experiments
N.Y.U.S.T.
I. M.
Intelligent Database Systems Lab
Conclusion
N.Y.U.S.T.
I. M.

The method MMC is to reduce the required size of labeled
data in multi-label classification while maintaining
favorable accuracy performance.

The method outperforms the other active learning
techniques on multi-label text classification by a large
margin and can significantly reduce the labeling cost.
Intelligent Database Systems Lab
Comments

Advantage


This paper has many experiment to show their performance.
Drawback


N.Y.U.S.T.
I. M.
…
Application

News , Email classification

Image classification
Intelligent Database Systems Lab