Download PPT

Feature selection and transduction for prediction of molecular bioactivity for drug design Bioinformatics Vol. 19 no. 6 2003 (Pages 764-771) Reporter: Yu Lun Kuo (D95922037) E-mail: sscc6991@gmail.com Date: April 17, 2008 Abstract • Drug discovery – Identify characteristics that separate active (binding) compounds from inactive ones. • Two method for prediction of bioactivity – Feature selection method – Transductive method • Improvement over using only one of the techniques 2017/5/22 2 Introduction (1/4) • Discovery of a new drug – Testing many small molecules for their ability to bind to the target site – The task of determining what separate the active (binding) compounds from the inactive ones 2017/5/22 3 Introduction (2/4) • Design new compounds – Not only bind – But also possess certain other properties required for a drug • The task of determination can be seen in a machine learning context as one of feature selection 2017/5/22 4 Introduction (3/4) • Challenging – Few positive examples • Little information is given indicating positive correlation between features and the labels – Large number of features • Selected from a huge collection of useful features • Some features are in reality uncorrelated with the labels – Different distributions • Cannot expect the data to come from a fix distribution 2017/5/22 5 Introduction (4/4) • Many conventional machine learning algorithms are illequiped to deal with these • Many algorithms generalize poorly – The high dimensionality of the problem – The problem size many methods are no longer computationally feasible – Most cannot deal with training and testing data coming from different distributions 2017/5/22 6 Overcome • Feature selection criterion – Called unbalanced correlation score • Take into account the unbalanced nature of the data • Simple enough to avoid overfitting • Classifier – Takes into account the different distributions in the test data compared to the training data • Induction • Transduction 2017/5/22 7 Overcome • Induction – Builds a model based only on the distribution of the training data • Transduction – Also take into account the test data inputs • Combining these two techniques we obtained improved prediction accuracy 2017/5/22 8 KDD Cup Competition (1/2) • We focused on a well studies data set – KDD Cup 2001 competition • Knowledge Discovery and Data Mining • One of the premier meetings of the data mining community – http://www.kdnuggets.com/datasets/kddcup.html 2017/5/22 9 KDD Cup Competition (2/2) • KDD Cup 2006 – data mining for medical diagnosis, specifically identifying pulmonary embolisms from three-dimensional computed tomography data • KDD Cup 2004 – features tasks in particle physics and bioinformatics evaluated on a variety of different measures • KDD Cup 2002 – focus: bioinformatics and text mining • KDD Cup 2001 – focus: bioinformatics and drug discovery • 2017/5/22 10 KDD Cup 2001 (1/2) • Objective – Prediction of molecular bioactivity for drug design -- binding to Thrombin • Data – Training: 1909 cases (42 positive), 139,351 binary features – Test: 634 cases 2017/5/22 11 KDD Cup 2001 (2/2) • Challenge – Highly imbalanced, high-dimensional, different distribution • Approach – Bayesian network predictive model – Data PreProcessor system – BN PowerPredictor system – BN PowerConstructor system 2017/5/22 12 Data Set (1/3) • Provided by DuPont Pharmaceuticals – Drug binds to a target site on thrombin, a key receptor in blood clotting • Each example has a fixed length vector of 139,351 binary features in {0, 1} – Which describe three-dimensional properties of the molecule 2017/5/22 13 Data Set (2/3) • Positive examples are labeled +1 • Negative examples are labeled -1 • In the training set – 1909 examples, 42 of which bind (rather unbalanced, positive is 2.2%) • In the test set – 634 additional compounds 2017/5/22 14 Data Set (3/3) • An important characteristic of the data – Very few of the feature entries are non-zero (0.68% of the 1,909 X 139,351 training matrix) 2017/5/22 15 System Assessment • Performance is evaluated according to a weighted accuracy criterion – The score of an estimate y’ of the labels y 1 #{ y': y  1^ y'  1} 1 #{ y': y  1^ y'  1} lbal( y, y' )  ( ) ( ) 2 #{ y : y  1} 2 #{ y : y  1} – Complete success is a score of 1 • Multiply this score by 100 as the percentage weighted success rate 2017/5/22 16 Methodology • Predict the labels on the test set by using a machine learning algorithm • The positively and negatively labeled training examples are split randomly into n groups – For n-fold cross validation such that as close to 1/n of the positively labeled examples are present in each group as possible • Called balanced cross validation – As few positive examples 2017/5/22 17 Methodology • The method is – Trained on n-1 of groups – Tested on the remaining group – Repeated n times (different group for testing) – Final score: mean of the n scores 2017/5/22 18 Feature Selection (1/2) • Called the unbalanced correlation score fj   Xij    Xij yi 1 yi  1 – fj: the score of feature j – X: training data as a matrix X where columns are features and examples are rows • Take λ very large in order to select features which have non-zero entries (λ ≧3) 2017/5/22 19 Feature Selection (2/2) • This score is an attempt to encode prior information that – The data is unbalanced – Large number of features – Only positive correlations are likely to be useful 2017/5/22 20 Justification • Justify the unbalanced correlation score using methods of information theory – Entropy: higher  non-regular  pi ln( pi ) • Pi: the probability of appearance of event i 2017/5/22 21 Entropy • The probability of random appearance of a feature with an unbalanced score of N=Np-Nn Np  Nn 1 Nn 1 i 0 i 0 Np  Nn P1(Tp, Tn, Np, Nn)  ( )  (Tp  i)  (Tn  i) Np – Np= number of one entries associated to +1 – Nn= number of one entries associated to -1 – Tp= total number of positive labels in training set – Tn= total number of negative labels in training set 2017/5/22 22 Entropy • Need to compute the probability that a certain N might occur randomly 1 min( Tp  N ,Tn  N ) P 2(Tp, Tn, N )  P1(Tp, Tn, max( 0, N )  i, max( 0, N )  i)  Tp  Tn i 0 • Finally, compute the entropy for each feature  P1P 2 log( P1P 2) 2017/5/22 23 Entropy and unbalanced score • The entropy and unbalanced score will not reach the same feature – Because the unbalanced correlation score will no select samples with low negative • In this particular problem – Reach a similar ranking of the features • Due to the unbalanced nature of the data 2017/5/22 24 Entropy and unbalanced score • The first 6 features for both scores – 5 out of 6 are the same ones – For 16 features, 12 coincide – Pay more attention to positive correlations 2017/5/22 25 Multivariate unbalanced correlation • The feature selection algorithm described so far is univariate – Reduces the chance of overfitting – Between the inputs and targets are too complex this assumption may be to restrictive • We extend our criterion to assign a rank to a subset of feature – Rather than just a single feature 2017/5/22 26 Multivariate unbalanced correlation • By computing the logical OR of the subset of features S (as they are binary) Xi(S )  1   (1  Xij ) jS 2017/5/22 27 Fisher Score ( j (  )  j (  )) fj  2 2 (j (  ))  (j (  )) 2 – μ(+): the mean of the feature values for positive – μ(-): the mean of the feature values for negative – σ(+): standard deviations – σ(-): standard deviations 2017/5/22 28 • In each case, the algorithms are evaluated for different numbers of features d – The range d = 1, …, 40 • Choose a small number of features in order to render interpretability of the decision function • It is anticipated that a large number of features are noisy and should not be selected 2017/5/22 29 Classification algorithms (Inductive) • The task may not simply be just to identify relevant characteristics via feature selection – But also to provide a prediction system • Simplest of classifiers  d f ( x)  1, if i 1 x(i ) d  1, otherwise 0 – We call this a logical OR classifier 2017/5/22 30 Comparison Techniques • We compared a number of rather more sophisticated classification – Support vector machines (SVM) – SVM* • Make a search over all possible values of the threshold parameter in the linear model after training – K-nearest neighbors (K-NN) – K-NN* (parameter γ) – C4.5 (decision tree learner) 2017/5/22 31 Transductive Inference • One is given labeled data from which builds a general model – Then applies this model to classify previously unseen (test) data • Takes into account not only the given (labeled) training set but also unlabeled data – That one wishes to classify 2017/5/22 32 Transductive Inference • Different models can be built – Trying to classify different test sets – Even if the training set is the same in all cases • It is this characteristic which help to solve problem 3 – The data we are given has different distribution in the training and test sets 2017/5/22 33 Transductive Inference • Transduction is not useful in all tasks – In drug discovery in particular we believe it is useful • Developers often have access to huge databases of compounds – Compounds are often generated using virtual Combinatorial Chemistry – Compound descriptors can be computed even though the compounds have not been synthesized yet 2017/5/22 34 Transductive Inference • Drug discovery is an iterative process – Machine learning method is to help choose the next test set – Step in a two-step candidate selection procedure • After candidate test set has been produced • Its result is the final test set 2017/5/22 35 Transductive algorithm 2017/5/22 36 Results (with unbalanced correlation score) • C4.5 gave only 50% success rate for all The tansductive algorithm is consistently selecting more relevant features than the inductive one the Fisher score 2017/5/22 37 Further Results • We also tested some more sophisticated multivariate feature selection methods – Not as good as using the unbalanced criterion score • Using non-linear SVMs – Not improve results (50% success) • SVMs as a base classifier for our transduction – Improvement over using SVMs 2017/5/22 38 Further Results • Also tried training the classifiers with larger numbers of features – Inductive methods • failed to learn anything after 200 features – Transductive methods • Exhibit generalization behavior up to 1000 features • (TRANS-Orcub:58% success with d=1000,77% with d=200) – KDD champion • Success rate 68.4% (7% of entrants higher than 60%) 2017/5/22 39 Thanks for your attention 2017/5/22 40

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download PPT