* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download FAST Lab Group Meeting 4/11/06
Computational complexity theory wikipedia , lookup
Pattern recognition wikipedia , lookup
Mathematical optimization wikipedia , lookup
Computational electromagnetics wikipedia , lookup
Discrete cosine transform wikipedia , lookup
Probabilistic context-free grammar wikipedia , lookup
Multidimensional empirical mode decomposition wikipedia , lookup
Simplex algorithm wikipedia , lookup
Operational transformation wikipedia , lookup
Selection algorithm wikipedia , lookup
Smith–Waterman algorithm wikipedia , lookup
Theoretical computer science wikipedia , lookup
Genetic algorithm wikipedia , lookup
Fast Fourier transform wikipedia , lookup
Corecursion wikipedia , lookup
Non-negative matrix factorization wikipedia , lookup
Recursion (computer science) wikipedia , lookup
Dijkstra's algorithm wikipedia , lookup
Cluster analysis wikipedia , lookup
K-nearest neighbors algorithm wikipedia , lookup
Computational phylogenetics wikipedia , lookup
Time complexity wikipedia , lookup
Factorization of polynomials over finite fields wikipedia , lookup
Fast Algorithms & Data Structures for Visualization and Machine Learning on Massive Data Sets Alexander Gray Fundamental Algorithmic and Statistical Tools Laboratory (FASTlab) Computational Science and Engineering Division College of Computing Georgia Institute of Technology The FASTlab • • • • • • • • • • • • • • • • • • Arkadas Ozakin: Research scientist, PhD theoretical physics Dong Ryeol Lee: PhD student, CS + Math Ryan Riegel: PhD student, CS + Math Parikshit Ram: PhD student, CS + Math William March: PhD student, Math + CS James Waters: PhD student, Physics + CS Nadeem Syed: PhD student, CS Hua Ouyang: PhD student, CS Sooraj Bhat: PhD student, CS Ravi Sastry: PhD student, CS Long Tran: PhD student, CS Michael Holmes: PhD student, CS + Physics (co-supervised) Nikolaos Vasiloglou: PhD student, EE (co-supervised) Wei Guan: PhD student, CS (co-supervised) Nishant Mehta: PhD student, CS (co-supervised) Wee Chin Wong: PhD student, ChemE (co-supervised) Abhimanyu Aditya: MS student, CS Yatin Kanetkar: MS student, CS Goal • New displays for high-dimensional data – – – – Isometric non-negative matrix factorization Rank-based embedding Density-preserving maps Co-occurrence embedding • New algorithms for scaling them to big datasets – Distances: Generalized Fast Multipole Method – Dot products: Cosine Trees and QUIC-SVD – MLPACK Plotting high-D in 2-D Dimension reduction beyond PCA: manifolds, embedding, etc. Goal • New displays for high-dimensional data – – – – Isometric non-negative matrix factorization Rank-based embedding Density-preserving maps Co-occurrence embedding • New algorithms for scaling them to big datasets – Distances: Generalized Fast Multipole Method – Dot products: Cosine Trees and QUIC-SVD – MLPACK Isometric Non-negative Matrix Factorization • NMF maintains the interpretability of components of data like images or text or spectra (SDSS) • However as a low-D display it is not faithful in general to the original distances • Isometric NMF [Vasiloglou, Gray, Anderson, to be submitted SIAM DM 2008] preserves both distances and nonnegativity; geometric prog formulation Rank-based Embedding • Suppose you don’t really have meaningful or reliable distances… but you can say “A and B are farther apart than A and C”, e.g. in document relevance • It is still possible to make an embedding! In fact there is some indication that using ranks is more stable than using distances • Can be formulated using hyperkernels; becomes either an SDP or a QP [Ouyang and Gray, ICML 2008] Density-preserving Maps • Preserving densities is statistically more meaningful than preserving distances • Might allow more reliable conclusions from the low-D display about clustering and outliers • DC formulation (Ozakin and Gray, to be submitted AISTATS 2008) Co-occurrence Embedding • Consider InBio data: 3M occurrences of species in Costa Rica • Densities are not reliable as the sampling strategy is unknown • But the overlapping of two species’ densities (cooccurrence) may be more reliable • How can distribution distances be embedded (Syed, Ozakin, Gray to be submitted ICML 2009)? Goal • New displays for high-dimensional data – – – – Isometric non-negative matrix factorization Rank-based embedding Density-preserving maps Co-occurrence embedding • New algorithms for scaling them to big datasets – Distances: Generalized Fast Multipole Method – Dot products: Cosine Trees and QUIC-SVD – MLPACK Computational problem • Such manifold methods are expensive: typically O(N3) – But it is big datasets that are often the most important to visually summarize • What are the underlying computations? – – – – All-k-nearest-neighbors Kernel summations Eigendecomposition Convex optimization Computational problem • Such manifold methods are expensive: typically O(N3) – But it is big datasets that are often the most important to visually summarize • What are the underlying computations? – – – – All-k-nearest-neighbors (distances) Kernel summations (distances) Eigendecomposition (dot products) Convex optimization (dot products) Distances: Generalized Fast Multipole Method • Generalized N-body Problems (Gray and Moore NIPS 2000; Riegel, Boyer, and Gray TR 2008) include: – – – – All-k-nearest-neighbors Kernel summations Force summations in physics A very large number of bottleneck statistics and machine learning computations • Defined using category theory Distances: Generalized Fast Multipole Method • There exists a generalization (Gray and Moore NIPS 2000; Riegel, Boyer, and Gray TR 2008) of the Fast Multipole Method (Greengard and Rokhlin 1987) which: – specializes to each of these problems – is the fastest practical algorithm for these problems – elucidates general principles for such problems • Parallel: THOR (Tree-based Higher-Order Reduce) Distances: Generalized Fast Multipole Method • Elements of the GFMM: – A spatial tree data structure, e.g. kd-trees, metric trees, cover-trees, SVD trees, disk trees – A tree expansion pattern – Tree-stored cached statistics – An error criterion and pruning criterion – A local approximation/pruning scheme with error bounds, e.g. Hermite expansions, Monte Carlo, exact pruning kd-trees: most widely-used spacepartitioning tree [Bentley 1975], [Friedman, Bentley & Finkel 1977],[Moore & Lee 1995] A kd-tree: level 1 A kd-tree: level 2 A kd-tree: level 3 A kd-tree: level 4 A kd-tree: level 5 A kd-tree: level 6 Example: Generalized histogram ˆf (q )   I  q  x j  h  j bandwidth h query point q Range-count recursive algorithm Range-count recursive algorithm Range-count recursive algorithm Range-count recursive algorithm Range-count recursive algorithm Pruned! (inclusion) Range-count recursive algorithm Range-count recursive algorithm Range-count recursive algorithm Range-count recursive algorithm Range-count recursive algorithm Range-count recursive algorithm Range-count recursive algorithm Range-count recursive algorithm Pruned! (exclusion) Range-count recursive algorithm Range-count recursive algorithm Range-count recursive algorithm fastest practical algorithm [Bentley 1975] our algorithms can use any tree Dot products: Cosine Trees and QUIC-SVD QUIC-SVD (Holmes, Gray, Isbell NIPS 2008) • Cosine Trees: Trees for dot products • Use Monte Carlo within cosine trees to achieve best-rank approximation with userspecified relative error • Very fast, but with probabilistic bounds Dot products: Cosine Trees and QUIC-SVD Uses of QUIC-SVD: • PCA, KPCA eigendecomposition • Working on: fast interior-point convex optimization Bigger goal: make all the best statistical/learning methods efficient! • Ground rules: – – – – Asymptotic speedup as well as practical speedup Arbitrarily high accuracy with error guarantees No manual tweaking Really works (validated in a big real-world problem) • Treating entire classes of methods – Methods based on distances (generalized N-body problems) – Methods based on dot products (linear algebra) – Soon: Methods based on discrete structures (combinatorial/graph problems) • Watch for MLPACK, coming Dec. 2008 – Meant to be the equivalent of linear algebra’s LAPACK So far: fastest algs for… • • • • • • • • • 2000 all-nearest-neighbors (1970) 2000 n-point correlation functions (1950) 2003,05,06 kernel density estimation (1953) 2004 nearest-neighbor classification (1965) 2005,06,08 nonparametric Bayes classifier (1951) 2006 mean-shift clustering/tracking (1972) 2006 k-means clustering (1960s) 2007 hierarchical clustering/EMST (1960s) 2007 affinity propagation/clustering (2007) So far: fastest algs for… • • • • 2008 principal component analysis* (1930s) 2008 local linear kernel regression (1960s) 2008 hidden Markov models* (1970s) Working on: – – – – – linear regression, Kalman filters (1960s) Gaussian process regression (1960s) Gaussian graphical models (1970s) Manifolds, spectral clustering (2000s) Convex kernel machines (2000s) Some application highlights so far… • First large-scale dark energy confirmation: top Science breakthrough of 2003) • First large-scale cosmic magnification confirmation (Nature, 2005) • Integration into Google image search (we think), 2005 • Integration into Microsoft SQL Server, 2008 • Working on: – Integration into Large Hadron Collider pipeline, 2008 – Fast IP-based spam filtering (Secure Comp, 2008) – Fast recommendation (Netflix) To find out more: • Best way: agray@cc.gatech.edu • Mostly outdated: www.cc.gatech.edu/~agray
 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                            