* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Download An Evaluation of Gene Selection Methods for Multi
Gene nomenclature wikipedia , lookup
Public health genomics wikipedia , lookup
Quantitative trait locus wikipedia , lookup
Polycomb Group Proteins and Cancer wikipedia , lookup
Gene desert wikipedia , lookup
Pathogenomics wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
History of genetic engineering wikipedia , lookup
Minimal genome wikipedia , lookup
Genomic imprinting wikipedia , lookup
Genome evolution wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Ridge (biology) wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Metagenomics wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Genome (book) wikipedia , lookup
The Selfish Gene wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Designer baby wikipedia , lookup
Microevolution wikipedia , lookup
An Evaluation of Gene Selection
Methods for Multi-class
Microarray Data Classification
by Carlotta Domeniconi and
Hong Chai
Outline
•
•
•
•
•
•
•
Introduction to microarray data
Problem description
Related work
Our methods
Experimental Analysis
Result
Conclusion and future work
Microarray
• Measures gene expression levels across
different conditions, times or tissue samples
• Gene expression levels inform cell activity
and disease status
• Microarray data distinguish between tumor
types, define new subtypes, predict
prognostic outcome, identify possible drugs,
assess drug toxicity, etc.
Microarray Data
• A matrix of measurements: rows are gene expression
levels; columns are samples/conditions.
Example – Lymphoma Dataset
Microarray data analysis
• Clustering applied to genes to identify
genes with similar functions or participate
in similar biological processes, or to
samples to find potential tumor subclasses.
• Classification builds model to predict
diseased samples. Diagnostic value.
Classification Problem
• Large number of genes (features) - may
contain up to 20,000 features.
• Small number of experiments (samples) –
hundreds but usually less than 100 samples.
• The need to identify “marker genes” to
classify tissue types, e.g. diagnose cancer feature selection
Our Focus
• Binary classification and feature selection
methods extensively studied; Multi-class
case received little attention.
• Practically many microarray datasets have
more than two categories of samples
• We focus on multi-class gene ranking and
selection.
Related Work
Some criteria used in feature ranking
•
•
•
•
Correlation coefficient
Information gain
Chi-squared
SVM-RFE
Notation
• Given C classes
• m observations (samples or patients)
• n feature measurements (gene expressions)
x ( x1 ,..., xn )t R n
• class labels y= 1,...,C
Correlation Coefficient
• Two class problem: y = {-1,+1}
• Ranking criterion defined in Golub:
wj
j
j
j
j
• where μj is the mean and σ standard deviation
along dimension j in the + and – classes; Large
|w| indicates discriminant feature
Fischer’s score
• Fisher’s criterion score in Pavlidis:
wj
( )
j
2
j
j
2
( ) ( )
j
2
Assumption of above methods
• Features analyzed in isolation. Not
considering correlations.
• Assumption: independent of each other
• Implication: redundant genes selected into a
top subset.
Information Gain
• A measure of the effectiveness of a feature in classifying the
training data.
• Expected reduction in entropy caused by partitioning the data
according to this feature.
| Sv |
I (S , A) E (S ) vV ( A)
E (S v )
S
• V (A) is the set of all possible values of feature A, and Sv is
the subset of S for which feature A has value v
Information Gain
• E(S) is the entropy of the entire set S.
E (S )
C
i 1
| Ci |
| Ci |
log 2
|S|
|S|
• wherewhere |Ci| is the number of training data in
class Ci, and |S| is thecardinality of the entire set S.
Chi-squared
• Measures features individually
• Continuous valued features discretized into
intervals
• Form a matrix A, where Aij is the number of
samples of the Ci class within the j-th
interval.
• Let CIj be the number of samples in the j-th
interval
Chi-squared
• The expected frequency of Aij is
Ei , j C Ij | Ci | / m
• The Chi-squared statistic of a feature is defined as
2
C
i 1
I
j 1
( Aij Eij )
2
Eij
• Where I is the number of intervals. The larger the statistic,
the more informative the feature is.
SVM-RFE
•
•
Recursive Feature Elimination using SVM
In the linear SVM model on the full feature set
Sign (w•x + b)
w is a vector of weights for each feature, x is an
input instance, and b a threshold.
If wi = 0, feature Xi does not influence
classification and can be eliminated from the set
of features.
SVM-RFE
• After getting w for the full feature set, sort
features in descending order of weights. A
percentage of lower feature is eliminated.
3. A new linear SVM is built using the new set
of features. Repeat the process.
4. The best feature subset is chosen.
Other criteria
• The Brown-Forsythe, the Cochran, and the
Welch test statistics used in Chen, et al.
(Extensions of the t-statistic used in the two-class
classification problem.)
• PCA
(Disadvantage: new dimension formed. None of
the original features can be discarded. Therefore
can’t identify marker genes.)
Our Ranking Methods
•
•
•
•
•
•
BScatter
MinMax
bSum
bMax
bMin
Combined
Notation
• For each class i and each feature j, we define the
mean value of feature j for class Ci:
j ,i
1
xCi x j
| Ci |
• Define the total mean along feature j
j
1
xj
m x
Notation
• Define between-class scatter along feature j
C
B j | Ci | ( j ,i j )
i 1
2
Function 1: BScatter
• Fisher discriminant analysis for multiple classes under
feature independence assumption. It credits the largest score
to the feature that maximizes the ratio of the between-class
scatter to the within-class scatter
BScatter j
Bj
ji
C
i 1
• where σji is the standard deviation of class i along feature j
Function 2: MinMax
• Favors features along which the farthest meanclass difference is large, and the within class
variance is small.
j ,max j ,min
MinMax j
C
i 1 ji
Function 3: bSum
• For each feature j, we sort the C values μj,i in
non-decreasing order: μ j1 <= μj2…<= μ jC
• Define bj,l = μ j1+1 - μ j1
• bSum rewards the features with large distances
between adjacent mean class values:
bSum j
Cl11b j ,l
ji
C
i 1
Function 4: bMax
• Rewards features j with a large between-neighborclass mean difference
bMax j
max l b j ,l
ji
C
i 1
Function 5: bMin
• Favorsthe features with large smallest betweenneighbor-class mean difference
bMin j
min l b j ,l
ji
C
i 1
Function 6: Comb
• Considers a score function which combines
MinMax and bMin
Combj
min l (b j ,l )( j ,max j ,min )
ji
C
i 1
Datasets
Dataset
MLL
sample genes classes Comment
72
Lymphoma 88
Yeast
NCI60
80
61
12582 3
Available at
http://research.nhgri.nih.gov/micr
oarray/Supplement
4026
6
Number of samples in each class
are, 46 in DLBCL, 11 in CLL, 9
in FL (malignant classes), 11 in
ABB, 6 in
3
RAT, and 6 in TCL (normal
samples). available at
http://llmpp.nih.gov/lymphoma
8
Available at http://rana.lbl.gov/
5775
1155
Experiment Design
• Gene expression scaled between [-1,1]
• Performed 9 comparative feature selection methods
(6 proposed scores, Chi-squared, Information Gain, and
SVM-RFE)
• Obtain subsets of top-ranked genes to train SVM classifier
(3 kernel functions: linear, 2-degree polynomial, Gaussian;
Soft-margin [1,100]; Gaussian kernel [0.001,2])
• Leave-one-out cross validation due to small sample size
• One-vs-one multi-class classification implemented on
LIBSVM
Result – MLL Dataset
Result – Lymphoma Dataset
Conclusions
• SVMs classification benefits from gene selection;
• Gene ranking with correlation coefficients gives
higher accuracy than SVM-RFE in low
dimensions in most data sets. The best performing
correlation score varies from problem to problem;
• Although SVM-RFE shows an excellent
performance in general, there is no clear winner.
The performance of feature selection methods
seems to be problem-dependent;
Conclusions
• For a given classification model, different gene selection
methods reach the best performance for different feature
set sizes;
• Very high accuracy was achieved on all the data sets
studied here. In many cases perfect accuracy (based on
leave-one-out error) was achieved;
• The NCI60 dataset [17] shows lower accuracy values. This
dataset has the largest number of classes (eight), and
smaller sample sizes per class. SVM-RFE handles this case
well, achieving 96.72% accuracy with 100 selected genes
and a linear kernel. The gap in accuracy between SVMRFE and the other gene rankingmethods is highest for this
dataset (ca. 11.5%).
Limitations & Future Work
• The selection of features over the whole training
set induces a bias in the results. Will study
valuable suggestions on how to assess and correct
the bias in future experiments.
• Will take into consideration the correlation
between any pair of selected features. Ranking
method will be modified so that correlations are
lower than a certain threshold.
• Evaluate top-ranked genes in our research against
marker genes identified in other studies.