Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Orthogonal Decision Trees and Beyond Hillol Kargupta Department of Computer Science and Electrical Engineering University of Maryland Baltimore County http://www.cs.umbc.edu/~hillol hillol@cs.umbc.edu & AGNIK, LLC http://www.agnik.com hillol@agnik.com Acknowledgement: Haimonti Dutta, Byung-Hoon Park, Rajeev Ayyagari Roadmap Introduction Analysis of models and ensembles Fourier spectrum of Decision tree ensembles Orthogonal decision trees Genetic code and Fourier analysis Conclusions and future work Research & Development at UMBC DIADIC Laboratory and AGNIK, LLC Distributed data mining and computation. Supported by NASA, NSF CAREER award, US Air Force, NSF 0083946, NSF 9803660, TRW Research Foundation, Maryland Technology Development Council, and others. Agnik, LLC: A Spin-off from DIADIC Lab, specializing on mobile and distributed data mining and management. The Story of Two Watchmakers “There once was two watchmakers, named Hora and Tempus, who manufactured very fine watches. Both of them were highly regarded, and the phones in their workshops rang frequently. New customers were constantly calling them. However, Hora prospered while Tempus became poorer and poorer and finally lost his shop. What was the reason? ………” H. Simon, 1962, Title: Architecture of Complexity. One could make more stable ensembles Exploring and Engineering Complex Systems Most complex systems are ensembles Examples: – Ecology – Large man-made complex systems – Biology Ecology: Ant Colony Ensemble of simpler activities produce emergent behavior Large Engineered Complex Systems Ensemble of different functional modules Biology: Gene Expression DNA Alphabet transformation (Transcription) mRNA Alphabet transformation (Translation) Protein sequence Mapping from sequence to Euclidean space Folded protein Set of representation transformations: – – – Transcription (DNA ----> mRNA) Translation (mRNA ----> Protein Folding of proteins Gene Expression: An Ensemble Effect DNA Protein 1 Protein 2 Protein 3 Different portions of the DNA produce different proteins in different cell Distributed ensemble-based computation of gene expression. Analysis of Model-Ensembles in Science & Engineering: Few Examples K-Armed Bandit problem and allocation of trial problem in an ensemble of organisms (Holland, 1975) Schema theorem, Holland, 1975 More Examples Dimensional Analysis – Describing the physical process in terms of an ensemble of dimensionless quantities. – Find a way to aggregate those quantities. Another Example Variational techniques and finite elements – Example: Solving Ax=b is equivalent to minimizing P(x)=(1/2)xTAx-xTb – Rayleigh-Ritz principle: Choose n trial functions and minimize over the subspace defined by the “ensemble” of trial functions Fourier Analysis of Complex Models Decision Trees Genetic Code-like Transformations Decision Trees high x3 + large red + x2 low x1 small blue red - - x2 blue + • A decision tree builds a classification tree from a labeled data-set. • Nodes correspond to features and links correspond to feature values. • Leaf nodes correspond to class labels Ensemble of Decision Trees Ensemble Classifiers: – – – – Bagging (Breiman, 96) Random forest (Breiman, 01) Arcing (Breiman, 97) SEA (Streaming Ensemble Algorithm) (Street, 2002) Problems: – – – Large ensembles are difficult to interpret Response time of large ensembles can be slow Can we create a non-redundant effective ensemble? Classification Function of Ensemble Classifier … f1(x) f2(x) f3(x) ∑ fn(x) Weighted Sum f(x) = ∑i ai fi(x) ai : weight for Tree i fi(x) : classification of Tree i Decision Trees as Functions Outlook Outlook Sunny Normal Yes 0 1 Wind Wind Yes No 2 Rain Overcast Humidity Humidity High Outlook Outlook Strong Weak No Yes Humidity Humidity Wind Wind 1 1 0 0 1 Decision tree can be viewed as a numeric function. 1 0 0 1 Fourier Representation of a Decision Tree Fourier Coefficient (FC) Outlook Outlook 2 f(x) = 0 1 Humidity Humidity Wind Wind 1 1 0 0 1 ∑j wj Ψj(x) 1 0 0 1 partition Fourier Basis Function Discrete Fourier Spectrum of a Decision Tree •Very sparse representation; polynomial number of non-zero coefficients. If k is the depth then all coefficients involving more than k features are zero. •Higher order coefficients are exponentially smaller compared to the low order coefficients (Kushlewitz and Mansour, 1990). •Can be approximated by the coefficients with significant magnitude. Exponential Decay of FCs Aggregation of Multiple Decision Trees F1(x) = Σwj ψj (x) F2(x) = Σwj ψj (x) F3(x) = Σwj ψj (x) F(x) = a1*F1(x) + a2*F2(x) + a3*F3(x) Weighted average of decision trees through Fourier analysis Fourier Representation to Decision Tree Outlook Outlook 2 ? 0 1 Humidity Humidity Wind Wind 1 1 0 0 1 1 0 0 1 Fourier Spectrum Fourier Spectrum and Decision Trees Decision Tree Fourier Spectrum Developed efficient algorithms to – Compute Fourier spectrum of decision tree (IEEE TKDE, SIAM Data Mining Conf., IEEE Data Mining Conf, ACM SIGKDD Explorations) – Compute tree from the Fourier spectrum (IEEE Transactions on SMCB) Fourier Spectrum and Inner Product of Decision Trees If f1(x) and f2(x) are two decision trees and W1 and W2 be the corresponding Fourier spectra then: <f1(x) , f2(x)> = <W1, W2> The Fourier Spectra Matrix and Its Eigenanalysis Consider W, where Wi,j is the Fourier coefficient of the i-th basis from the spectrum of the tree Tj. Compute the eigenvectors and eigenvalues of WTW. Orthogonal Decision Trees Compute Fourier spectrum of each decision tree in the ensemble PCA: Eigen analysis of the covariance matrix Eigenvectors represent a Fourier spectra of a decision tree Construct a tree from each eigenvector These trees are functionally orthogonal to each other and constitute a redundancy-free ensemble. An Ensemble of Decision Trees An Orthogonal Tree Generated from the Ensemble Attribute 15 Attribute 10 Attribute 19 0 0 1 1 Experimental Results: SPECT Data Single Proton Emission Computer Tomography (SPECT) image data from UC Irvine Repository. 267 images, 22 binary features, Boolean classification Method of Classification Tree Complexity Error Percentage C4.5 13 24.59% Bagging (40 trees) 202 20.85% Aggregated Fourier Trees 3 19.78% Orthogonal Decision Trees 3 8.02% Comparing Random Forests and ODT Ensembles Method of classification Error percentage Random Forest(40 Trees) 23.2 ODT(projection onto the first principle component capturing 99.67% of variance) 9.09 ODTs(projection onto 40 eigen vectors) 9.09 Method of Classification Random Forest(40 Trees) Tree Complexity (number of nodes in all trees in ensemble) 322 ODT(projection onto the first principle component capturing 99.67% of variance) 7 ODTs(projection onto 40 eigen vectors) 120 The ODT formed by projection onto the most dominant eigen vector performs as good as an ensemble of 40 different ODT trees ! Experimental Results: NASDAQ Data Discretized NASDAQ Data. 99 stocks for predicting ups and downs in Yahoo stock. Method of Classification Tree Complexity Error Percentage C4.5 103 12.6% Bagging (60 trees) 92.85 11.2% Aggregated Fourier Trees 33 9.2% Orthogonal Decision Trees 5 9.2% Experimental Results: DNA Data DNA Data from UC Irvine Repository. Method of Classification Tree Complexity Error Percentage C4.5 131 6.5% Bagging (10 trees) 34 8.9% Aggregated Fourier Trees 3 8.3% Dominant Orthogonal Decision Tree 3 10.2% Experimental Results: House of Votes Data House of votes data from UC Irvine Repository 435 instances, 16 Boolean valued attributes Method of Classification Tree Complexity Error Percentage C4.5 9 8.0% Bagging (40 trees) 79 11.0% Aggregated Fourier Trees 5 11.0% Orthogonal Decision Trees 15 11.0% Haar DWT for Representing Classifiers R. Mulvaney and D. Phatak. (2003). "Multiclass Multidimensional Modified Haar DWT for Classification Function Representation and its Realization via Fast Fixed-Depth Network" . Observations Orthogonal decision trees are – redundancy free – functionally orthogonal to each other – efficient but meaningful representation of large ensemble Bringing linear systems theory for advanced analysis of classifier ensembles – E.g. Stability analysis of an ensemble Fourier Analysis of Genetic Codelike Transformations From DNA to Protein DNA Alphabet transformation (Transcription) mRNA Alphabet transformation (Translation) Protein sequence Mapping from sequence to Euclidean space Folded protein Set of representation transformations: – – – Transcription (DNA ----> mRNA) Translation (mRNA ----> Protein Folding of proteins Genetic Code That Controls Translation Alanine Cysteine Aspertic acid Glutamic acid Phenylalanine Glycine Histidine Isoleucine Lysine Leucine Methionine Asparagine Proline Glutamine Arginine Serine Threonine Valine Tryptophan Tyrosine STOP GCA, GCC, GCG, GCU UGC, UGU GAC, GAU GAA, GAG UUC, UUU GGA, GGC, GGG, GGU CAC, CAU AUA, AUC, AUU AAA, AAG UUA, UUG, CUA, CUC, CUG, CUU AUG AAC, AAU CCA, CCC, CCG, CCU CAA, CAG AGA, AGG, CGA, CGC, CGG, CGU AGC, AGU, UCA, UCC, UCG, UCU ACA, ACC, ACG, ACU GUA, GUC, GUG, GUU UGG UAC, UAU UAA, UAG, UGA Genetic Code-like Transformations (GCTs) η maps every feature in x to c features in the new representation. η: Xn → Xcn is defined by a code book. η is a genetic code-like transformation (GCT). An Example GCT x z1z2z3 1 1 1 1 100 011 001 110 0 0 0 0 111 101 010 000 Codon • Every variable x in the Xn is mapped to c variables in Xcn • Redundant Code; codon size c. Illustration Ttranslation Induced Equivalence (TIE) class 11 100 100 100 011 100 001 100 110 …. 110 110 Examples A 4 bit Mapping A 3 bit Mapping A 2 bit Mapping x z1z2 1 1 00 11 0 0 01 10 x z1z2z3 1 1 1 1 001 110 100 011 0 0 0 0 010 101 111 000 x z1z2z3z4 1 1 1 1 1 1 1 1 0010 1101 1000 0111 0101 1010 1001 0110 0 0 0 0 0 0 0 0 0000 1111 0001 1110 0011 1100 0100 1011 What is the Effect of GCT? Consider some function f(x) in a given representation What is the effect of a GCT on the representation of f(x)? Is f(η(x)) “interesting”? Discrete Fourier Analysis of Randomized GCTs Higher order Fourier coefficients become less significant at an exponential rate A “linearizing” effect of randomized GCTs GCTs appear to make functions “more linear”. For more details: H. Kargupta, R. Ayyagari, and S. Ghosh. (2003). Learning Functions Using Randomized Expansions: Probabilistic Properties and Experimentations. IEEE Transaction on Knowledge and Data Engineering, Volume 16, Number 8, pages 894-908. A Single Perceptron f(x) θ w1 x1 w2 x2 Linear classifier; learning algorithm with convergence proof. Can learn functions with only order-1 Fourier coefficients Fourier spectrum of a two bit XOR has a constant and order-2 coefficients So a perceptron cannot learn XOR Performance of Perceptron on XOR Classification error vs. Size of the XOR problem Perceptron + 2-bit Codons Classification error vs. Size of the XOR problem Perceptron + 3-bit Codons Classification error vs. Size of the XOR problem Perceptron + 4-bit Codons Classification error vs. Size of the XOR problem Conclusions Ensembles play a fundamental role in many physical processes Analyzing ensemble properties may provide deeper understanding This may require development of an algebra for ensembles References H. Kargupta, R. Ayyagari, and S. Ghosh. (2003). Learning Functions Using Randomized Expansions: Probabilistic Properties and Experimentations. IEEE Transaction on Knowledge and Data Engineering, Volume 16, Number 8, pages 894-908. H. Kargupta, B. Park, and H. Dutta. (2004). Orthogonal Decision Trees. ICDM’04. (extended version in communication) Kargupta and Park. (2003). Mining data streams from mobile devices using Fourier spectrum of decision trees. IEEE Transactions of Knowledge and Data Engineering. Volume 6, Number 2, pages 216-229. Brief Bio of Hillol Kargupta Hillol Kargupta is an Associate Professor at the Department of Computer Science and Electrical Engineering, University of Maryland Baltimore County. He received his Ph.D. in Computer Science from University of Illinois at Urbana-Champaign in 1996. He is also a cofounder of AGNIK LLC, a ubiquitous data intelligence company. His research interests include mobile and distributed data mining and computation in gene expression. Dr. Kargupta won a National Science Foundation CAREER award in 2001 for his research on ubiquitous and distributed data mining. He along with his co-authors received the best paper award in the 2003 IEEE International Conference on Data Mining for a paper on privacypreserving data mining. He won the 2000 TRW Foundation Award and the 1997 Los Alamos Award for Outstanding Technical Achievement. His dissertation earned him the 1996 Society for Industrial and Applied Mathematics (SIAM) annual best student paper prize. He has published more than eighty peer-reviewed articles in journals, conferences, and books. He is an associate editor of the IEEE Transaction on Knowledge and Data Engineering and the IEEE Transactions on Systems, Man, Cybernetics, Part B. He served as the Associate General Chair of the 2003 ACM SIGKDD Conference. He is also the Program Co-Chair of the 2005 SIAM Data Mining Conference and the vice-chair for the 2005 IEEE International Conference on Data Mining. He has co-edited two books: (1) Advances in Distributed and Parallel Knowledge Discovery, AAAI/MIT Press, and (2) Data Mining: Next Generation Challenges and Future Directions, AAAI/MIT Press. He is in the program committee of almost every major data mining conference (e.g. ACM, IEEE, SIAM). He has been a member of the organizing committee of the SIAM data mining conference every year from 2001 until 2005. He hosted many workshops and journal special issues on distributed data mining and other related topics. He regularly serves as an invited speaker in many international conferences and workshops. More information about him can be found at http://www.cs.umbc.edu/~hillol.