Survey							
                            
		                
		                * Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining: Current Status and Research Directions Jiawei Han Intelligent Database Systems Research Lab School of Computing Science Simon Fraser University, Canada http://www.cs.sfu.ca/~han 2017년 5월 22일 Data Mining: Status and Directions 1 Outline     Why is data mining hot? Current status: Major technical progress Is data mining flying high, or not? How to fly data mining high?— Research directions on data mining 2017년 5월 22일 Data Mining: Status and Directions 2 Why Is Data Mining Hot?  Data mining (knowledge discovery in databases)  Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) information (knowledge) or patterns from data in large databases or other information repositories  Necessity is the mother of invention  Data is everywhere—data mining should be everywhere, too!  Understand and use data—an imminent task! 2017년 5월 22일 Data Mining: Status and Directions 3 Data, Data, Everywhere!!  Relational database—A commodity of every enterprise  Huge data warehouses are under construction  POS (Point of Sales): Transactional DBs in terabytes   Object-relational databases, distributed, heterogeneous, and legacy databases Spatial databases (GIS), remote sensing database (EOS), and scientific/engineering databases  Time-series data (e.g., stock trading) and temporal data  Text (documents, emails) and multimedia databases  WWW: A huge, hyper-linked, dynamic, global information system 2017년 5월 22일 Data Mining: Status and Directions 4 Data Mining Is Everywhere, too!—A Multi-Dimensional View of Data Mining  Databases to be mined  Relational, transactional, object-relational, active, spatial, timeseries, text, multi-media, heterogeneous, legacy, WWW, etc.  Knowledge to be mined  Characterization, discrimination, association, classification, clustering, trend, deviation and outlier analysis, etc.  Techniques utilized  Database-oriented, data warehouse (OLAP), machine learning, statistics, visualization, neural network, etc.  Applications adapted  Retail, telecommunication, banking, fraud analysis, DNA mining, stock market analysis, Web mining, Weblog analysis, etc. 2017년 5월 22일 Data Mining: Status and Directions 5 Data Mining: Confluence of Multiple Disciplines Database Technology Machine Learning (AI) Information Science 2017년 5월 22일 Statistics Data Mining Visualization Other Disciplines Data Mining: Status and Directions 6 Data Mining—One Can Trace Back to Early Civilization  Most scientific discoveries involve “data mining”  Kepler’s Law, Newton’s Laws, periodic table of chemical elements, …, from “big bang” to DNA  Statistics: A discipline dedicated to data analysis  Then why data mining? What are the differences?  Huge amount of data—in giga to tera bytes  Fast computer—quick response, interactive analysis  Multi-dimensional, powerful, thorough analysis  High-level, “declarative”—user’s ease and control  Automated or semi-automated—mining functions hidden or built-in in many systems 2017년 5월 22일 Data Mining: Status and Directions 7 A Brief History of Data Mining Activities  1989 IJCAI Workshop on Knowledge Discovery in Databases   1991-1994 Workshops on Knowledge Discovery in Databases    Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 1996) 1995-1998 International Conferences on Knowledge Discovery in Databases and Data Mining (KDD’95-98)   Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W. Frawley, 1991) Journal of Data Mining and Knowledge Discovery (1997) 1998 ACM SIGKDD, SIGKDD’1999-2001 conferences, and SIGKDD Explorations More conferences on data mining  PAKDD, PKDD, SIAM-Data Mining, (IEEE) ICDM, DaWaK, SPIE-DM, etc. 2017년 5월 22일 Data Mining: Status and Directions 8 Research Progress in the Last Decade            Multi-dimensional data analysis: Data warehouse and OLAP (on-line analytical processing) Association, correlation, and causality analysis Classification: scalability and new approaches Clustering and outlier analysis Sequential patterns and time-series analysis Similarity analysis: curves, trends, images, texts, etc. Text mining, Web mining and Weblog analysis Spatial, multimedia, scientific data analysis Data preprocessing and database compression Data visualization and visual data mining Many others, e.g., collaborative filtering 2017년 5월 22일 Data Mining: Status and Directions 9 Multi-Dimensional Data Analysis       Data warehousing: integration from heterogeneous or semi-structured databases Multi-dimensional modeling of data: star & snowflake schemas Efficient and scalable computation of data cubes or iceberg cubes OLAP (on-line analytical processing): drilling, dicing, slicing, etc. Discovery-driven exploration of data cubes From OLAP to OLAM: A multi-dimensional view for on-line analytical mining 2017년 5월 22일 Data Mining: Status and Directions 10 Association and Frequent Pattern Analysis  Efficient mining of frequent patterns and association rules:  Apriori and FP-growth algorithms  Multi-level, multi-dimensional, quantitative association mining  From association to correlation, sequential patterns, partial periodicity, cyclic rules, ratio rules, etc.  Query and constraint-based association analysis 2017년 5월 22일 Data Mining: Status and Directions 11 Classification: Scalable Methods and Handling of Complex Types of Data    Classification has been an essential theme in machine learning, and statistics research  Decision trees, Bayesian classification, neural networks, k-nearest neighbors, etc.  Tree-pruning, Boosting, bagging techniques Efficient and scalable classification methods  Exploration of attribute-class pairs  SLIQ, SPRINT, RainForest, BOAT, etc. Classification of semi-structured and non-structured data  Classification by clustering association rules (ARCS)  Association-based classification  Web document classification 2017년 5월 22일 Data Mining: Status and Directions 12 Clustering and Outlier Analysis       Partitioning methods  k-means, k-medoids, CLARANS Hierarchical methods: micro-clusters  Birch, Cure, Chameleon Density-based methods:  DBSCAN and OPTICS, DENCLU Grid-based methods  STING, CLIQUE, WaveCluster Outlier analysis:  statistics-based, distance-based, deviation-based Constraint-based clustering  COD (Clustering with Obstructed Distance)  User-specified constraints 2017년 5월 22일 Data Mining: Status and Directions 13 Sequential Patterns and TimeSeries Analysis     Trend analysis  Trend movement vs. cyclic variations, seasonal variations and random fluctuations Similarity search in time-series database  Handling gaps, scaling, etc.  Indexing methods and query languages for time-series Sequential pattern mining  Various kinds of sequences, various methods  From GSP to PrefixSpan Periodicity analysis  Full periodicity, partial periodicity, cyclic association rules 2017년 5월 22일 Data Mining: Status and Directions 14 Similarity Search: Similar Curves, Trends, Images, and Texts     Various kinds of data, various similarity mining methods Discovery of similar trends in time-series data  Data transformation & high-dimensional structures Finding similar images based on color, texture, etc.  Content-based vs. keyword-based retrieval  Color histogram-based signature  Multi-feature composed signature Finding documents with similar texts  Similar keywords (synonymy & polysemy)  Term frequency matrix  Latent semantic indexing 2017년 5월 22일 Data Mining: Status and Directions 15 Spatial, Multimedia, Scientific Data Analysis Multi-dimensional analysis of spatial, multimedia and scientific data  Geo-spatial data cube and spatial OLAP  The curse of dimensionality problem  Association analysis  A progressive refinement methodology  Micro-clustering can be used for preprocessing in the analysis of complex types of data  Classification  Association-based for handling high-dimensionality and sparse data  2017년 5월 22일 Data Mining: Status and Directions 16 Data Mining Industry and Applications From research prototypes to data mining products, languages, and standards  IBM Intelligent Miner, SAS Enterprise Miner, SGI MineSet, Clementine, MS/SQLServer 2000, DBMiner, BlueMartini, MineIt, DigiMine, etc.  A few data mining languages and standards (esp. MS OLEDB for Data Mining).  Application achievements in many domains  Market analysis, trend analysis, fraud detection, outlier analysis, Web mining, etc.  2017년 5월 22일 Data Mining: Status and Directions 17 Is Data Mining Flying? Or Not??   Data mining is flying  R & D have been striding forward greatly  Applications have been broadened substantially But not as high as some may have hoped. Why not?  Hope to see billions of $’s within years?   Not bread-and-butter but value-added service   DBMS, WWW, and other information systems will still be a “data mining” aircraft-carrier Not on-the-shelf in nature   A young and coming technology, not a hype! Need training, understanding, and customizing (re-develop.) Young technology—need much R&D to fly high  2017년 5월 22일 Much research, development, and real problem solving! Data Mining: Status and Directions 18 How to Fly Data Mining High?— Research Directions   Web mining Towards integrated data mining environments and tools    “Vertical” (or application-specific) data mining Invisible data mining Towards intelligent, efficient, and scalable data mining methods 2017년 5월 22일 Data Mining: Status and Directions 19 Web Mining: A Fast Expanding Frontier in Data Mining  Mine what Web search engine finds  Automatic classification of Web documents  Discovery of authoritative Web pages, Web structures and Web communities  Meta-Web Warehousing: Web yellow page service  Web usage mining 2017년 5월 22일 Data Mining: Status and Directions 20 Mine What Web Search Engine Finds  Current Web search engines: A convenient source for mining   keyword-based, return too many, often low quality answers, still missing a lot, not customized, etc. Data mining will help:  coverage: “Enlarge and then shrink,” using synonyms and conceptual hierarchies  better search primitives: user preferences/hints  linkage analysis: authoritative pages and clusters  Web-based languages: XML + WebSQL + WebML  customization: home page + Weblog + user profiles 2017년 5월 22일 Data Mining: Status and Directions 21 Discovery of Authoritative Pages in WWW    Page-rank method ( Brin and Page, 1998):  Rank the "importance" of Web pages, based on a model of a "random browser." Hub/authority method (Kleinberg, 1998):  Prominent authorities often do not endorse one another directly on the Web.  Hub pages have a large number of links to many relevant authorities.  Thus hubs and authorities exhibit a mutually reinforcing relationship: Both the page-rank and hub/authority methodologies have been shown to provide qualitatively good search results for broad query topics on the WWW. 2017년 5월 22일 Data Mining: Status and Directions 22 Automatic Classification of Web Documents  Web document classification:    Good human classification: Yahoo!, CS term hierarchies These classifications can be used as training sets to build up learning model Key-word based classification is different from multidimensional classification   Association or clustering-based classification is often more effective Multi-level classification is important 2017년 5월 22일 Data Mining: Status and Directions 23 A Multiple Layered Meta-Web Architecture Layern More Generalized Descriptions ... Layer1 Generalized Descriptions Layer0 2017년 5월 22일 Data Mining: Status and Directions 24 Web Yellow Page Service: A MultiLayer, Meta-Web Approach       XML: facilitates structured and meta-information extraction Automatic classification of Web documents:  based on Yahoo!, etc. as training set + keyword-based correlation/classification analysis (IR/AI assistance) Automatic ranking of important Web pages  authoritative site recognition and clustering Web pages Generalization-based multi-layer meta-Web construction  With the assistance of clustering and classification analysis Meta-Web can be warehoused and incrementally updated Querying and mining can be performed on or assisted by meta-Web 2017년 5월 22일 Data Mining: Status and Directions 25 Importance of Constructing Multi-Layer Meta Web   Benefits of Multi-Layer Meta-Web:  Multi-dimensional Web info summary analysis  Approximate and intelligent query answering  Web high-level query answering (WebSQL, WebML)  Web content and structure mining  Observing the dynamics/evolution of the Web Is it realistic to construct such a meta-Web?   It benefits even if it is partially constructed The benefit may justify the cost of tool development, standardization, and partial restructuring 2017년 5월 22일 Data Mining: Status and Directions 26 Web Usage (Click-Stream) Mining  Weblog provides rich information about Web dynamics  Multidimensional Weblog analysis:   Plan mining (mining general Web accessing regularities):   Web cashing, prefetching, swapping Trend analysis:   Web linkage adjustment, performance improvements Web accessing association/sequential pattern analysis:   disclose potential customers, users, markets, etc. Dynamics of the Web: what has been changing? Customized to individual users 2017년 5월 22일 Data Mining: Status and Directions 27 Towards Integrated Data Mining Environments and Tools    OLAP Mining: Integration of Data Warehousing and Data Mining Querying and Mining: An Integrated Information Analysis Environment Basic Mining Operations and Mining Query Optimization  “Vertical” (or application-specific) data mining  Invisible data mining 2017년 5월 22일 Data Mining: Status and Directions 28 OLAP Mining: An Integration of Data Mining and Data Warehousing  Data mining systems, DBMS, Data warehouse systems coupling   On-line analytical mining data   No coupling, loose-coupling, semi-tight-coupling, tight-coupling integration of mining and OLAP technologies Interactive mining multi-level knowledge  Necessity of mining knowledge and patterns at different levels of abstraction by drilling/rolling, pivoting, slicing/dicing, etc.  Integration of multiple mining functions  Characterized classification, first clustering and then association 2017년 5월 22일 Data Mining: Status and Directions 29 An OLAM Architecture Mining query Mining result Layer4 User Interface User GUI API OLAM Engine OLAP Engine Layer3 OLAP/OLAM Data Cube API Layer2 MDDB MDDB Meta Data Filtering&Integration Database API Filtering Layer1 Data cleaning Databases 2017년 5월 22일 Data Data integration Warehouse Data Mining: Status and Directions Data Repository 30 Querying and Mining: An Integrated Information Analysis Environment  Data mining as a component of DBMS, data warehouse, or Web information system   Integrated information processing environment  MS/SQLServer-2000 (Analysis service)  IBM IntelligentMiner on DB2  SAS EnterpriseMiner: data warehousing + mining Query-based mining   Querying database/DW/Web knowledge Efficiency and flexibility: preprocessing, on-line processing, optimization, integration, etc. 2017년 5월 22일 Data Mining: Status and Directions 31 Basic Mining Operations and Mining Query Optimization  Relational databases: There are a set of basic relational operations and a standard query language, SQL   E.g., selection, projection, join, set difference, intersection, Cartesian product, etc. Are there a set of standard data mining operations, on which optimizations can be done?   Difficulty: different definitions on operations Importance: optimization can be performed on them systematically, standardization to facilitate information exchange and system interoperability 2017년 5월 22일 Data Mining: Status and Directions 32 “Vertical” Data Mining  Generic data mining tools? —Too simple to match domainspecific, sophisticated applications     Expert knowledge and business logic represent many years of work in their own fields! Data mining + business logic + domain experts A multi-dimensional view of data miners  Complexity of data: Web, sequence, spatial, multimedia, …  Complexity of domains: DNA, astronomy, market, telecom, … Domain-specific data mining tools  Provide concrete, killer solution to specific problems  Feedback to build more powerful tools 2017년 5월 22일 Data Mining: Status and Directions 33 Invisible Data Mining  Build mining functions into daily information services  Web search engine (link analysis, authoritative pages, user profiles)—adaptive web sites, etc.   Improvement of query processing: history + data  Making service smart and efficient Benefits from/to data mining research  Data mining research has produced many scalable, efficient, novel mining solutions  Applications feed new challenge problems to research 2017년 5월 22일 Data Mining: Status and Directions 34 Towards Intelligent Tools for Data Mining  Integration paves the way to intelligent mining  Smart interface brings intelligence   One picture may worth 1,000 words    Easy to use, understand and manipulate Visual and audio data mining Human-Centered Data Mining Towards self-tuning, self-managing, selftriggering data mining 2017년 5월 22일 Data Mining: Status and Directions 35 Integrated Mining: A Booster for Intelligent Mining  Integration paves the way to intelligent mining  Data mining integrates with DBMS, DW, WebDB, etc  Integration inherits the power of up-to-date information technology: querying, MD analysis, similarity search, etc.   Mining can be viewed as querying database knowledge Integration leads to standard interface/language, function/process standardization, utility, and reachability  Efficiency and scalability bring intelligent mining to reality 2017년 5월 22일 Data Mining: Status and Directions 36 One Picture May Worth 1000 Words!   Visual Data Mining  Visualization of data  Visualization of data mining results  Visualization of data mining processes  Interactive data mining: visual classification One melody may worth 1000 words too!   Audio data mining: turn data into music and melody! Uses audio signals to indicate the patterns of data or the features of data mining results 2017년 5월 22일 Data Mining: Status and Directions 37 Visualization of data mining results in SAS Enterprise Miner: scatter plots 2017년 5월 22일 Data Mining: Status and Directions 38 Visualization of association rules in MineSet 3.0 2017년 5월 22일 Data Mining: Status and Directions 39 Visualization of a decision tree in MineSet 3.0 2017년 5월 22일 Data Mining: Status and Directions 40 Visualization of Data Mining Processes by Clementine 2017년 5월 22일 Data Mining: Status and Directions 41 Interactive Visual Mining by Perception-Based Classification (PBC) 2017년 5월 22일 Data Mining: Status and Directions 42 Human-Centered Data Mining      Finding all the patterns autonomously in a database? — unrealistic because the patterns could be too many but uninteresting Data mining should be an interactive process  User directs what to be mined Users must be provided with a set of primitives to be used to communicate with the data mining system — using a data mining query language User should provide constraints on what to be mined System should use such constraints to guide the mining process (constraint-based mining or mining query optimization) 2017년 5월 22일 Data Mining: Status and Directions 43 Constraint-Based Mining  What kinds of constraints can be used in mining?  Knowledge type constraint: classification, association, etc.  Data constraint: SQL-like queries  Find products sold together in Vancouver in Feb.’01.  Dimension/level constraints:  in relevance to region, price, brand, customer category.  Rule constraints:  small sales (price < $10) triggers big sales (sum > $200).  Interestingness constraints:  E.g., strong rules (min_support  3%, min_confidence  60%, min_lift > 3.0). 2017년 5월 22일 Data Mining: Status and Directions 44 Rule Constraints: A Classification Succinctness Anti-monotonicity Monotonicity Convertible constraints Inconvertible constraints 2017년 5월 22일 Data Mining: Status and Directions 45 Constraint-Based Clustering Analysis  User-specified constraints: no cluster has less than 1000 gold customers  Resource allocation (clustering) with obstacles 2017년 5월 22일 Data Mining: Status and Directions 46 Towards Automated Data Mining?     It is not realistic to automatically find all the knowledge in a large database Thus we promote human-centered, constraint-based mining However, to achieve genuine intelligent data mining, data mining process should be self-tuning, self-managing, self-triggering Functions should be developed to achieve such performance 2017년 5월 22일 Data Mining: Status and Directions 47 Conclusions  Data mining—A promising research frontier  Data mining research has been striding forward greatly in the last decade  However, data mining, as an industry, has not been flying as high as expected  Much research and application exploration are needed  Web mining  Towards integrated data mining environments and tools  Towards intelligent, efficient, and scalable data mining methods 2017년 5월 22일 Data Mining: Status and Directions 48 http://www.cs.sfu.ca/~han http://db.cs.sfu.ca Thank you !!! 2017년 5월 22일 Data Mining: Status and Directions 49 References   J. Han and M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann, 2001. J. Han, L. V. S. Lakshmanan, and R. T. Ng, "Constraint-Based, Multidimensional Data Mining", COMPUTER (special issues on Data Mining), 32(8): 46-50, 1999. 2017년 5월 22일 Data Mining: Status and Directions 50