Download Mining Knowledge in Data Explosion Age

Mining Knowledge in Data Explosion Age （在資料爆炸時代中挖掘知識）廖宜恩中興大學資訊科學與工程系 1 Outline • • • • • • • • • • Some News Reports Why Data Mining What is Data Mining Knowledge Discovery Process Data Mining Functionalities Data Mining Process Data Mining Tools Trends in Data Mining Some Research Results on Data Mining Conclusions 2 Some News Reports • Time's Person of the Year for 2006 • 12 IT skills that employers can't say no to • F.B.I. Data Mining Reached Beyond Initial Targets • MIT names its top 10 emerging technologies for 2008 • Effect of US Recession on Data Mining Demand (July 2008) 3 Why Data Mining • Data Explosion Problem（資料爆炸問題） – Data in the world doubles every 20 months! – NASA’s Earth Orbiting System: forty-six megabytes of data per second • 4,000,000,000,000 bytes a day（4 TeraByte/day； 20×200GB Hard Disk） – FBI fingerprints image library: • 200,000,000,000,000 bytes（200 TB） – In-line image analysis for particle detection: 1 megabyte in one second 4 Why Data Mining? Commercial Viewpoint • Lots of data is being collected and warehoused – Web data, e-commerce – purchases at department/ grocery stores – Bank/Credit Card transactions • Competitive Pressure is Strong – Provide better, customized services for an edge (e.g. in Customer Relationship Management) 5 Why Data Mining? Scientific Viewpoint • Data collected and stored at enormous speeds (GB/hour) – remote sensors on a satellite – telescopes scanning the skies – microarrays generating gene expression data – scientific simulations generating terabytes of data • Traditional techniques infeasible for raw data 6 Mining Large Data Sets - Motivation • • • • There is often information “hidden” in the data that is not readily evident Human analysts may take weeks to discover useful information Much of the data is never analyzed at all We are drowning in data, but starving for knowledge! （淹沒於資料，飢渴於知識） 4,000,000 3,500,000 The Data Gap 3,000,000 2,500,000 2,000,000 1,500,000 Total new disk (TB) since 1995 1,000,000 Number of analysts 500,000 0 1995 1996 1997 1998 1999 7 What is Data Mining? • Data Mining (Knowledge Discovery in Databases, KDD) （資料挖掘、資料探勘、資料採礦）: – Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns and rules（以自動化或半自動化方式探索、分析大量資料以發現有意義的樣式和規則） 8 Knowledge Discovery Process • Data mining: the core of knowledge discovery process. Pattern Evaluation Data Mining Task-relevant Data Data Warehouse Selection Data Cleaning Data Integration 9 Databases Origins of Data Mining • Draws ideas from machine learning/AI, pattern recognition, statistics, and database systems • Traditional Techniques may be unsuitable due to Statistics/ AI – Enormity of data （龐大的資料） – Curse of high dimensionality （高維度資料的魔咒） – Heterogeneous, distributed nature of data（分散且異質的資料） Machine Learning/ Pattern Recognition Data Mining Database systems 10 Data Mining Functionalities 1. Concept description: Characterization and discrimination（資料集特徵或差異的描述） 2. Classification（分類） 3. Association rule mining（關聯法則挖掘） 4. Clustering（分群） 5. Sequence analysis（序列分析） 6. Anomaly detection（異常偵測） 11 Concept description: Characterization and discrimination • Concept description: – Characterization: provides a concise summarization of the given collection of data • Example: Describe general characteristics of graduate students in the NCHU database – Discrimination: provides descriptions comparing two or more collections of data • Example: Compare graduate and undergraduate students of NCHU using discriminant rule 12 Classification（分類） • Given a collection of records (training set ) – Each record contains a set of attributes, one of the attributes is the class. • Find a model for class attribute as a function of the values of other attributes. • Goal: previously unseen records should be assigned a class as accurately as possible. – A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it. 13 Decision Tree Classification Task Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes Learn Model 10 Tid Attrib1 Attrib2 Attrib3 Class 11 No Small 55K ? 12 Yes Medium 80K ? 13 Yes Large 110K ? 14 No Small 95K ? 15 No Large 67K ? Apply Model Decision Tree 10 14 Example of a Decision Tree Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 60K Splitting Attributes Refund Yes No NO MarSt Single, Divorced TaxInc < 80K NO Married NO > 80K YES 10 Training Data Model: Decision Tree 15 Apply Model to Test Data Test Data Start from the root of tree. Refund Yes Refund Marital Status Taxable Income Cheat No 80K Married ? 10 No NO MarSt Single, Divorced TaxInc < 80K NO Married NO > 80K YES 16 Examples of Classification Task • Predicting tumor cells as benign or malignant • Classifying credit card transactions as legitimate or fraudulent • Classifying secondary structures of protein as alpha-helix, beta-sheet, or random coil • Categorizing news stories as finance, weather, entertainment, sports, etc 17 Association rule mining（關聯法則挖掘） • Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction Market-Basket transactions TID Items 1 Bread, Milk 2 3 4 5 Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke Example of Association Rules {Diaper} → {Beer}, {Milk, Bread} → {Eggs,Coke}, {Beer, Bread} → {Milk}, 18 Association Rule Discovery: Application 1 • Marketing and Sales Promotion: – Let the rule discovered be {Beer, … } --> {Potato Chips} – Potato Chips as consequent => Can be used to determine what should be done to boost its sales. – Beer in the antecedent => Can be used to see which products would be affected if the store discontinues selling beer. – Beer in antecedent and Potato chips in consequent => Can be used to see what products should be sold with Beer to promote sale of Potato chips! 19 Clustering（分群） • Given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that – Data points in one cluster are more similar to one another. – Data points in separate clusters are less similar to one another. 20 Illustrating Clustering ⌧Euclidean Distance Based Clustering in 3-D space. Intracluster distances are minimized Intercluster distances are maximized 21 Clustering: Applications • Market Segmentation:（市場區隔） – Goal: subdivide a market into distinct subsets of customers where any subset may conceivably be selected as a market target to be reached with a distinct marketing mix. • Document Clustering:（文件分群） – Goal: To find groups of documents that are similar to each other based on the important terms appearing in them. 22 Clustering of Microarray Data（微陣列資料分群） 23 Sequence analysis（序列分析） Sequence Database Sequence Element (Transaction) Event (Item) Customer Purchase history of a given customer A set of items bought by Books, diary a customer at time t products, CDs, etc Web Data Browsing activity of a particular Web visitor A collection of files viewed by a Web visitor after a single mouse click Home page, index page, contact info, etc Event data History of events generated by a given sensor Events triggered by a sensor at time t Types of alarms generated by sensors Genome sequences DNA sequence of a particular species An element of the DNA sequence Bases A,T,G,C Element (Transaction) Sequence E1 E2 E1 E3 E2 E2 E3 E4 Event (Item) 24 25 Human Genome Program, U.S. Department of Energy, Genomics and Its Impact on Medicine and Society: A 2001 Primer, 2001 How does the human genome stack up? Organism Genome Size (Bases) Estimated Genes Human (Homo sapiens) 3 billion 25,000 Laboratory mouse (M. musculus) 2.6 billion 30,000 Mustard weed (A. thaliana) 100 million 25,000 Roundworm (C. elegans) 97 million 19,000 Fruit fly (D. melanogaster) 137 million 13,000 Yeast (S. cerevisiae) 12.1 million 6,000 Bacterium (E. coli) 4.6 million 3,200 Human immunodeficiency virus (HIV) 9700 9 26 Why Finding (15,4) Motif is Difficult? atgaccgggatactgatAgAAgAAAGGttGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacAAtAAAAcGGcGGGa tgagtatccctgggatgacttAAAAtAAtGGaGtGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccga gctgagaattggatgcAAAAAAAGGGattGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAtAAtAAAGGaaGGGcttatag gtcaatcatgttcttgtgaatggatttAAcAAtAAGGGctGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaa cggttttggcccttgttagaggcccccgtAtAAAcAAGGaGGGccaattatgagagagctaatctatcgcgtgcgtgttcat aacttgagttAAAAAAtAGGGaGccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatActAAAAAGGaGcGGaccgaaagggaag ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttActAAAAAGGaGcGGa AgAAgAAAGGttGGG ..|..|||.|..||| cAAtAAAAcGGcGGG 27 Anomaly Detection（異常偵測） • Detect significant deviations from normal behavior • Applications: – Credit Card Fraud Detection – Network Intrusion Detection • Typical network traffic at University level may reach over 100 million connections per day 28 Social Network Analysis (Link Mining) • 舞台劇＜六度分離＞（Six Degrees of Separation）：「我從某處得知，在地球上，人與人之間只被六個人隔絕。六度的分隔，正是這個星球的人際距離。」 • Link: relationship among data objects • Link-Based Object Ranking (LBR): Exploit the link structure of a graph to order or prioritize the set of objects within the graph • Web information analysis such as PageRank and Hits are typical LBR approaches 29 Complex Network • A complex network is a network (graph) that has certain non-trivial topological features that do not occur in simple networks. • Such non-trivial features include: a heavy-tail in the degree distribution; a high clustering coefficient; assortativity (a correlation between two nodes) or disassortativity among vertices; and evidence of a hierarchical structure. 30 Web Mining • Web Usage Mining • Web Structure Mining • Web Content Mining – Google has a precious asset: Database of Intensions（人類意圖資料庫） 31 Graph Mining • Find frequent subgraph in a given graph database • Graphs are ubiquitous – Web databases, XML databases – Cheminformatics (chemical compound) – Bioinformactics (protein structure, pathway) – Workflow analysis – Social network analysis 32 Example (Chemistry-informatics) Graph Dataset (A) (B) (C) Frequent Patterns (min support is 2) (1) (2) 33 Data Mining Process • • • • • • • Define the problem Build data mining database Explore data Prepare data for modeling Build model Evaluate model Deploy model 34 Examples of data mining in science & engineering • Data mining in Biomedical Engineering – “Robotic Arm Control Using Data Mining Techniques” 35 Data Mining Process: 1. Define the problem • Control a robotic arm by means of EMG signals from biceps and triceps muscles. • Electromyography (EMG,肌電描記器) is a medical technique for evaluating and recording physiologic properties of muscles at rest and while contracting. Muscle Contraction Biceps Triceps （二頭肌）（三頭肌） Supination H H L L H L L H （旋後） Pronation （前旋） Flexion （彎曲） Extension （伸張） Supination Pronation Flexion Extension 36 Data Mining Process: 2. Build a data mining database The dataset includes 80 records. There are two input variables; biceps signal and triceps signal. One output variable, with four possible values; supination, pronation, flexion and extension. 37 Data Mining Process: 3. Explore data Scatter Plot Triceps Record# Flexion Extension Supination Pronation 38 Data Mining Process: 3. Explore data (cont.) Scatter Plot Biceps Record# Flexion Extension Supination Pronation 39 Data Mining Process: 4. Prepare data for modeling Build a dataset with the ARFF format: @relation EMG @attribute Triceps real @attribute Biceps real @attribute Move {Flexion,Extension,Pronation,Supination} @data 13,31,Flexion 14,30,Flexion 10,31,Flexion 13,29,Flexion …… 40 Data Mining Process: 5. Build Model Classification OneR Decision Tree Naïve Bayesian K-Nearest Neighbors Neural Networks Linear Discriminant Analysis Support Vector Machines … 41 Data Mining Process: 5. Decision Tree 1. Find the attribute that best classifies the training data. 2. Use this attribute as the root of the decision tree. 3. Repeat the process for each subtree. Triceps <=37 >37 Triceps Biceps <=14 >14 <=17 >17 Flexion Pronation Extension Supination 42 Data Mining Process: 6. Evaluate Models Simple validation : training set and test set n-fold cross-validation Leave-one-out 10 -fold cross-validation OneR 76% Decision Tree 90% Naïve Bayesian 98% 1-Nearest Neighbors 100% Neural Networks 100% 43 Data Mining Process: 7. Deploy Model The neural network model was successfully implemented inside the robotic arm. 44 Data Mining Tools • Commercial tools: SAS Enterprise Miner , IBM Intelligent Miner, SPSS Clementine • Open source tools: – WEKA: http://www.cs.waikato.ac.nz/ml/weka – RapidMiner: http://rapid-i.com/index.php?lang=en • Poll: Data mining/analytic tools you used in 2006 • Good portals for data mining: KDnuggets 45 Trends in Data Mining • Application exploration – development of application-specific data mining system – Invisible data mining (mining as built-in function) • Scalable data mining methods – Constraint-based mining: use of constraints to guide data mining systems in their search for interesting patterns • Integration of data mining with database systems, data warehouse systems, and Web database systems 46 Trends in Data Mining • Web mining • Social network analysis • Recommender systems: – US$1 Million prize for 10% improvement on Cinematch movie recommender system – Netflix – If You Liked This, You’re Sure to Love That (New York Times, Nov. 21, 2008) 47 Trends in Data Mining • Spam filters: – Cost of Spam: – How much does spam cost you? Google will calculate – http://www.google.com/a/help/intl/en/security/r oi_calculator.html • Privacy protection and information security in data mining • Bioinformatics 48 Some Research Results on DM • Localization system for WLAN • Rogue Access Point Detection System Based on Packet Analysis • Library Recommender System Based on Personal Ontology Model 49 Localization system for WLAN • Enhancing the Accuracy of WLAN-based Location Determination Systems Using Predicted Orientation Information (Information Sciences, Vol. 178, No. 4, Feb. 15, 2008, pp. 1049–1068.) • We proposed Accumulated Orientation Strength (AOS) algorithm based on Bayesian classifier to predict the orientation of a mobile user for improving the accuracy of localization system. 50 Rogue Access Point Detection System • A paper entitled "Detecting Rogue Access Points Using Client-side Bottleneck Bandwidth Analysis" has been accepted for publication in Computers & Security. 51 Rogue Access Point Detection System • Big challenge in managing APs in university campus: NCHU is a class B network with more than 50 departmental networks 52 Rogue Access Point Detection System: Intruders from the Air 53 Rogue Access Point Detection System • Proposed a novel approach for detecting rogue access points by estimating client-side bottleneck bandwidth based on ACK packet pair technique. • The system is implemented and tested in the Computer and Information Network Center at NCHU. • Experimental results show that the accuracy is higher than 90%. 54 Library Recommender System Based on Personal Ontology Model (PORE) • A paper entitled "PORE: A Personal Ontology Recommender System for Digital Library" has been accepted for publication in The Electronic Library. • Proposed personal ontology model for recommending books to library patrons based on keywords extracted from the books borrowed by the user 55 Library Recommender System Based on Personal Ontology Model (PORE) • Collaborative filtering techniques are also incorporated into the PORE system • PORE system is in service at NCHU Library 56 Conclusions • We are drowning in data, but starving for knowledge! • Data mining is the key to knowledge discovery. • Applications of data mining techniques can be found in almost every research area of computer science and engineering. • Even in a recession, data mining services are still in strong demand. 57 References 1. Pang-Ning Tan, Michael Steinbach, and Vipin Kumar, Introduction to Data Mining, Addison-Wesley, 2006. 2. Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques, 2nd Ed., Morgan Kaufmann, 2005. 3. Jones, Neil and Pevzner, Pavel, An Introduction to Bioinformatics Algorithms, MIT Press, 2004. 4. http://www.chem-eng.utoronto.ca/~datamining/ 5. Duncan Watts，6個人的小世界（Six Degrees），大塊文化，2004。 6. Mark Buchanan，連結（Nexus），天下文化，2003。 7. http://www.kdnuggets.com/ 58

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Mining Knowledge in Data Explosion Age