Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Canadian Bioinformatics Workshops www.bioinformatics.ca 1 Module #: Title of Module 2 Module 2: Clustering, Classification and Feature Selection Sohrab Shah Centre for Translational and Applied Genomics Molecular Oncology Breast Cancer Research Program BC Cancer Agency sshah@bccrc.ca 3 Module Overview • Introduction to clustering – distance metrics – hierarchical, partitioning and model based clustering • Introduction to classification – building a classifier – avoiding overfitting – cross validation • Feature Selection in clustering and classification Module #: Title of Module 4 Introduction to clustering • What is clustering? – unsupervised learning – discovery of patterns in data – class discovery • Grouping together “objects” that are most similar (or least dissimilar) – objects may be genes, or samples, or both • Example question: Are there samples in my cohort that can be subgrouped based on molecular profiling? – Do these groups have correlation to clinical outcome? Module #: Title of Module 5 Distance metrics • In order to perform clustering, we need to have a way to measure how similar (or dissimilar) two objects are • Euclidean distance: p xy (x i y i )2 i1 • Manhattan distance: p dxy x i y i dissimilar similar i1 • 1-correlation – proportional to Euclidean distance, but invariant to range of measurement from one sample to the next Module #: Title of Module 6 Distance metrics compared Euclidean Manhattan 1-Pearson Conclusion: distance matters! Module #: Title of Module 7 Other distance metrics • Hamming distance for ordinal, binary or categorical data: p dxy I(x i y i ) i1 Module #: Title of Module 8 Approaches to clustering • Partitioning methods – K-means – K-medoids (partitioning around medoids) – Model based approaches • Hierarchical methods – nested clusters • start with pairs • build a tree up to the root Module #: Title of Module 9 Partitioning methods • Anatomy of a partitioning based method – data matrix – distance function – number of groups • Output – group assignment of every object Module #: Title of Module 10 Partitioning based methods • Choose K groups – initialise group centers • aka centroid, medoid – assign each object to the nearest centroid according to the distance metric – reassign (or recompute) centroids – repeat last 2 steps until assignment stabilizes Module #: Title of Module 11 K-medoids in action Module #: Title of Module 12 K-means vs K-medoids K-means K-medoids Centroids are the ‘mean’ of the clusters Centroids are an actual object that minimizes the total within cluster distance Centroids need to be recomputed every iteration Centroid can be determined from quick look up into the distance matrix Initialisation difficult as notion of centroid may be unclear before beginning Initialisation is simply K randomly selected objects kmeans pam Module #: Title of Module 13 Partitioning based methods Advantages Disadvantages Number of groups is well defined Have to choose the number of groups A clear, deterministic Sometimes objects do not assignment of an object to a fit well to any cluster group Simple algorithms for inference Module #: Title of Module Can converge on locally optimal solutions and often require multiple restarts with random initializations 14 Agglomerative hierarchical clustering Module #: Title of Module 15 Hierarchical clustering • Anatomy of hierarchical clustering – distance matrix – linkage method • Output – dendrogram • a tree that defines the relationships between objects and the distance between clusters • a nested sequence of clusters Module #: Title of Module 16 Linkage methods single complete distance between centroids average Module #: Title of Module 17 Linkage methods • Ward (1963) – form partitions that minimizes the loss associated with each grouping – loss defined as error sum of squares (ESS) – consider 10 objects with scores (2, 6, 5, 6, 2, 2, 2, 2, 0, 0, 0) ESSOnegroup = (2 -2.5)2 + (6 -2.5)2 + ....... + (0 -2.5)2 = 50.5 On the other hand, if the 10 objects are classified according to their scores into four sets, {0,0,0}, {2,2,2,2}, {5}, {6,6} The ESS can be evaluated as the sum of squares of four separate error sums of squares: ESSOnegroup = ESSgroup1 + ESSgroup2 + ESSgroup3 + ESSgroup4 = 0.0 Thus, clustering the 10 scores into 4 clusters results in no loss of information. Module #: Title of Module 18 Linkage methods in action • clustering based on single linkage • • single <- hclust(dist(t(exprMatSub),method="euclidean"), method=”single"); plot(single); Module #: Title of Module 19 Linkage methods in action • clustering based on complete linkage • • complete <- hclust(dist(t(exprMatSub),method="euclidean"), method="complete"); plot(complete) Module #: Title of Module 20 Linkage methods in action • clustering based on centroid linkage • • centroid <- hclust(dist(t(exprMatSub),method="euclidean"), method=”centroid"); plot(centroid); Module #: Title of Module 21 Linkage methods in action • clustering based on average linkage • • average <- hclust(dist(t(exprMatSub),method="euclidean"), method=”average"); plot(average); Module #: Title of Module 22 Linkage methods in action • clustering based on Ward linkage • • ward <- hclust(dist(t(exprMatSub),method="euclidean"), method=”ward"); plot(ward); Module #: Title of Module 23 Linkage methods in action Conclusion: linkage matters! Module #: Title of Module 24 Hierarchical clustering analyzed Advantages Disadvantages There may be small clusters Clusters might not be nested inside large ones naturally represented by a hierarchical structure No need to specify number groups ahead of time Its necessary to ‘cut’ the dendrogram in order to produce clusters Flexible linkage methods Bottom up clustering can result in poor structure at the top of the tree. Early joins cannot be ‘undone’ Module #: Title of Module 25 Model based approaches • Assume the data are ‘generated’ from a mixture of K distributions – What cluster assignment and parameters of the K distributions best explain the data? • ‘Fit’ a model to the data • Try to get the best fit • Classical example: mixture of Gaussians (mixture of normals) • Take advantage of probability theory and well-defined distributions in statistics Module #: Title of Module 26 Model based clustering: array CGH Module #: Title of Module 27 Model based clustering of aCGH Problem: patient cohorts often exhibit molecular heterogeneity making rarer shared CNAs hard to detect Approach: Cluster the data by extending the profiling to the multi-group setting A mixture of HMMs: HMM-Mix Group g Sparse profiles … … Distribution of calls in a group Profile State c CNA calls State k Patient p Shah et al (Bioinformatics, 2009) Raw data 28 Advantages of model based approaches • In addition to clustering patients into groups, we output a ‘model’ that best represents the patients in a group • We can then associate each model with clinical variables and simply output a classifier to be used on new patients • Choosing the number of groups becomes a model selection problem (ie the Bayesian Information Criterion) – see Yeung et al Bioinformatics (2001) Module #: Title of Module 29 Clustering 106 follicular lymphoma patients with HMM-Mix Initialisation Profiles Converged Clinical Recapitulates known FL subgroups Subgroups have clinical relevance Module #: Title of Module 30 30 Feature selection • Most features (genes, SNP probesets, BAC clones) in high dimensional datasets will be uninformative – examples: unexpressed genes, housekeeping genes, ‘passenger alterations’ • Clustering (and classification) has a much higher chance of success if uninformative features are removed • Simple approaches: – select intrinsically variable genes – require a minimum level of expression in a proportion of samples – genefilter package (Bioonductor): Lab1 • Return to feature selection in the context of classification Module #: Title of Module 31 Advanced topics in clustering • • • • Top down clustering Bi-clustering or ‘two-way’ clustering Principal components analysis Choosing the number of groups – model selection • AIC, BIC • Silhouette coefficient • The Gap curve • Joint clustering and feature selection Module #: Title of Module 32 What Have We Learned? • There are three main types of clustering approaches – hierarchical – partitioning – model based • Feature selection is important – reduces computational time – more likely to identify well-separated groups • The distance metric matters • The linkage method matters in hierarchical clustering • Model based approaches offer principled probabilistic methods Module #: Title of Module 33 Module Overview • Clustering • Classification • Feature Selection Module #: Title of Module 34 Classification • What is classificiation? – Supervised learning – discriminant analysis • Work from a set of objects with predefined classes – ie basal vs luminal or good responder vs poor responder • Task: learn from the features of the objects: what is the basis for discrimination? • Statistically and mathematically heavy Module #: Title of Module 35 Classification poor response poor response poor response learn a classifier good response good response good response new patient What is the most likely response? Module #: Title of Module 36 Example: DLBCL subtypes Wright et al, PNAS (2003) Module #: Title of Module 37 DLBCL subtypes Wright et al, PNAS (2003) Module #: Title of Module 38 Classification approaches • Wright et al PNAS (2003) • Weighted features in a linear predictor score: • aj: weight of gene j determined by t-test statistic • Xj: expression value of gene j • Assume there are 2 distinct distributions of LPS: 1 for ABC, 1 for GCB Module #: Title of Module 39 Wright et al, DLBCL, cont’d • Use Bayes’ rule to determine a probability that a sample comes from group 1: • : probability density function that represents group 1 Module #: Title of Module 40 Learning the classifier, Wright et al • Choosing the genes (feature selection): – use cross validation – Leave one out cross validation • • • • • Pick a set of samples Use all but one of the samples as training, leaving one out for test Fit the model using the training data Can the classifier correctly pick the class of the remaining case? Repeat exhaustively for leaving out each sample in turn – Repeat using different sets and numbers of genes based on tstatistic – Pick the set of genes that give the highest accuracy Module #: Title of Module 41 Overfitting • In many cases in biology, the number of features is much larger than the number of samples • Important features may not be represented in the training data • This can result in overfitting – when a classifier discriminates well on its training data, but does not generalise to orthogonally derived data sets • Validation is required in at least one external cohort to believe the results • example: the expression subtypes for breast cancer have been repeatedly validated in numerous data sets Module #: Title of Module 42 Overfitting • To reduce the problem of overfitting, one can use Bayesian priors to ‘regularize’ the parameter estimates of the model • Some methods now integrate feature selection and classification in a unified analytical framework – see Law et al IEEE (2005): Sparse Multinomial Logistic Regression (SMLR): http://www.cs.duke.edu/~amink/software/smlr/ • Cross validation should always be used in training a classifier Module #: Title of Module 43 Evaluating a classifier • The receiver operator characteristic curve – plots the true positive rate vs the false positive rate • Given ground truth and a probabilistic classifier – for some number of probability thresholds – compute the TPR – proportion of positives that were predicted as true – compute the FPR – number of false predictions over the total number of predictions Module #: Title of Module 44 Other methods for classification • • • • • Support vector machines Linear discriminant analysis Logistic regression Random forests See: – Ma and Huang Briefings in Bioinformatics (2008) – Saeys et al Bioinformatics (2007) Module #: Title of Module 45 Questions? Module #: Title of Module 46 Lab: Clustering and feature selection • Get familiar clustering tools and plotting – – – – Feature selection methods Distance matrices Linkage methods Partition methods • Try to reproduce some of the figures from Chin et al using the freely available data Module #: Title of Module 47 Module 2: Lab Coffee break Back at: 15:00 Module #: Title of Module 48