Download Overview - Texas Tech University

Class Information  Contact: Tel: 325-742-3527 E-mail: Rattikorn.Hewett@ttu.edu  Course Materials: http://redwood.cs.ttu.edu/~hewett/te ach.html Data Analytics Fall 2014 Rattikorn Hewett Computer Science Department Texas Tech University 1 Acknowledgements  2 Texts Materials in this course are adapted from various sources including our texts and data mining courses by:  Prof. Jeff Ullman, Stanford University Chris Clifton, Purdue University  Prof. Osmar Zaiane, University of Alberta  Prof. 3  Data Mining: Concepts and Techniques by J. Han and M. Kamber, Morgan Kaufmann 2000  Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations by I. Witten and E. Frank, Morgan Kaufmann 1999. 4 1 What you should get out of this course  Concepts and techniques in data analytics, data mining and knowledge discovery in data (KDD)  Understanding underlying processes and algorithms  Experience with tools  Exposure to complex applications and research in data analytics Evaluation  Projects/reports  Paper presentation  Class participation 60% 35% 5% There will be implementation projects and research papers to read, review and present 5 6 Remarks   Academic integrity: read the statement of Academic Conduct for Engineering students (see the syllabus) Data Analytics: Overview Citation: unless noted, work submitted should reflect your own capabilities  If unsure, acknowledge sources and help 7 8 2 Outline: Part I     Motivation What are data analytics, data mining and KDD? Why is it a new multidisciplinary subject? Research Community & Resources Where do we see data analytics being used? Advanced technology Computerization of for data collection business and government + generation and storage transactions and documents Flood of undigested data Can we automate this process? Useful knowledge For Decision-making 9 What we need 10 Why KDD? New technologies that can intellectually and automatically assist humans in analyzing and transforming rapidly growing volumes of digital data into useful information  Manual analysis and interpretation  Slow,  expensive and highly subjective Databases are rapidly growing in size  Hundreds  Hundreds  KDD (Knowledge Discovery in Databases) [Fayad et al., 96] 11  of millions objects to thousands attributes Need to scale up human analysis capabilities to cope with data overload problem 12 3 Data mining, a KDD process Pre-processing Selected cleaned data Data Mining - Then Databases or Data warehouse  Data Mining Patterns  Bonferroni’s theorem suggests that if there are too many possible conclusions, some will be true for purely statistical reasons with no physical validity Refinement Post-processing Useful Information • Data Mining is the core step of discovery in KDD • Blindly apply Data Mining can lead to meaningless and invalid patterns • Pre and Post processing are essential to ensure that useful knowledge is derived from the data  Famous example: ESP test by David Rhine at Duke in 1950 – declare students who can guess cards correctly 100% to have ESP  Data mining has negative implication 13 Data Mining - Now 14 Data Analytics  Extraction of “interesting” information (knowledge) from huge amount of data  Discovery of useful summaries of data (Ullman)  Alternative terms:  A new buzzword in business intelligence  Data leverage in specific applications or functional processes to enable context-specific insight that is actionable (by Gartner)  Scientific process of transforming data into insight for making better decisions (by INFORMS) Data analysis, pattern analysis, data dredging, data exploration, data understanding, data summarization, data abstraction, KDD (other places) etc.  The term (~1983) in statistics community for “overusing data to draw invalid inferences” A misnomer?  In this class … Data Analytics ~ Data Science ~ Data Mining Used with Big Data ~ KDD? 15 4 Data Mining & our daily life Outline: Part I   Groceries:   Beer -- Diapers (add Chips)  Wine -- Chocolate -- Flowers   What are data analytics, data mining and KDD? Why is it a new multidisciplinary subject? Research Community & Resources Where do we see data mining being used? Internet: Google search  E-commerce:   Amazon.com  Expedia.com 17 KDD Process KDD Process Interpretation/ Evaluation Data Mining 1. 2. Knowledge Preprocessing may take 60% of effort Preprocessing 3. Patterns Selection Preprocessed Data Data 18 Data cleaning: remove noise & inconsistent data Stored in Data integration: from multiple sources Data Warehouse Data transformation and reduction: transform or consolidate data into forms appropriate for data mining, select relevant data Iterative Process 4. Target Data 5. Data mining: extracts patterns Pattern evaluation/interpretation: by using interestingness measures adapted from: Chris Clifton, Purdue University and U. Fayyad, et al. (1995), “From Knowledge Discovery to Data Mining: An Overview,” Advances in Knowledge Discovery and Data Mining, U. Fayyad et al. (Eds.), AAAI/MIT Press 6. Knowledge Presentation: visualization and knowledge representation are used to present the mined knowledge to the user 19 20 5 Data Mining Algorithms Data Mining many possible characteristics: • deterministic/stochastic relationships Data Set Involves: • static/dynamic processes  many different types, including: • classification algorithms (e.g., C4.5) Data Mining Algorithm  • association algorithms (e.g., Apriori) • causal learning algorithms (e.g., PC) provides: Model (Pattern or Knowledge) • prediction/classification of unseen cases • understanding relationships among variables  Fitting models to observed data as in  Statistics Generalizing models that represent behaviors of the system generating the data as in  Machine Learning Finding patterns in observed data as in  Pattern Recognition 21 Interdisciplinary KDD Data Infrastructures High Performance Computing: Parallel and Distributed Computing Databases Information Retrieval: Indexing, Inverted files Data Warehousing Knowledge Acquisition Pre-processing Big Data Analytics 22 Data Analytics/Mining Must cope with at least three issues: Statistics  Very large amount of data  Scalability in size and complexity  Not Pattern Recognition KDD Other AI areas Machine Learning all data can contain in main-memory  “Scalable” Data Analytics  Expert Systems if run time grows linearly in proportion to size Efficiency  High performance algorithms are desired Visualization, HCI Computer Graphic Post-processing 23 24 6 Data Mining – A new discipline? Data Mining – in database context How is it different from existing fields? Can be thought of as  Statistics – hypothesis testing learning – all data contains in main memory  Database systems – typically do not infer/generalize data  Pattern Recognition – hard for high volume and high  Machine   dimensional data  All – not explicitly concerned with efficiency and huge Algorithms for executing very complex queries on non-main-memory data An advanced on-line analytical processing (OLAP) OLAP – supports summarization, consolidation, aggregation and viewing in multiple perspectives amount of data 25 Outline: Part I     26 KDD Research Community What are data analytics, data mining and KDD? Why is it a multidisciplinary subject? Why is it a new discipline? Research Community & Resources Where do we see data mining being used?  Key founders:  Usama Fayyad, JPL (then Microsoft, now has his own company, Digimine)  Gregory Piatetsky-Shapiro (then GTE, now his own data mining consulting company, Knowledge Stream Partners)  Rakesh Agrawal (IBM Research)  1989 IJCAI Workshop on Knowledge Discovery in Databases (Piatetsky-Shapiro)  27 Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W. Frawley, 1991) 28 7 KDD Research Community (contd)  1991-1994 Workshops on Knowledge Discovery in Databases  1995-1998 International Conferences on Knowledge Discovery in Databases and Data Mining (KDD’95-98)  1998 ACM SIGKDD, SIGKDD’1999-2001 conferences, and SIGKDD Explorations More conferences on data mining    Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 1996)   KDD Research Community (contd) Journal of Data Mining and Knowledge Discovery (1997) Other research community in related fields:  Statistics  Machine Learning  Clustering  Visualization  Databases  Information Retrieval  Distributed and Parallel Computation PAKDD (1997), PKDD (1997), SIAM-Data Mining (2001), (IEEE) ICDM (2001), etc. 29 Useful Resources 30 Outline: Part I  KDNuggets (http://www.kdnuggets.com)  Weka 3 – open source data mining software (http://www.cs.waikato.ac.nz/ml/weka/inde x.html)  UCI machine learning repository (http://archive.ics.uci.edu/ml/)  KDD archive (http://kdd.ics.uci.edu/)  31     What is data mining and KDD? Why is it a multidisciplinary subject? Why is it a new discipline? Research Community & Resources Where do we see data mining being used? 32 8 Example Applications  Example Applications Marketing & Retailing  Cross     Identify potential money laundering & financial crimes from reports of large cash transactions E.g., FAIS of U.S. Treas. Financial Crimes Enforcement Network 33 Example Applications    fraud Manufacturing & Engineering  Construct control E.g. use records on phone services - destination, time, duration - to detect patterns that deviate from expected norm model for controlling manufacturing processes (e.g., semi-conductor industries) Forecast – avoid overstock  Improve aviation safety, from FAA’s pilot deviation  Inventory  Improve availability or promote sales of communication services  34 Example Applications Telecommunication  Detect trends of stock investment E.g., LBS Capital Management manages portfolios totaling $600 millions since 1993 retention From purchasing records – loyalty card and credit card transactions – detect changes in customer consumption to adjust price/quality and Loan Use bank-loan records (of factors that may influence loan payment) to build a predictive model to decide whether a loan should be granted  Predict recommendation Customer profiling to advertise to most likely buyers (e.g., hot items, amazon.com)  Customer Finance and investment  Credit reference of items Market-basket analysis to find associations of items bought to increase retail (e.g., diapers and beer  adding chips)  Purchase   E.g., from communication traffic records, associate communication needs and events to avoid overload of communication facilities 35 database and NTSB’s accident and incident database  Describe types of human errors (e.g., mistakes, slips, others) that caused accidents  Predict accident problems 36 9 Example Applications  Example Applications Science  Earth    Web & Environmental Science Construct predictive model for lake inflows from solar activity and climate conditions  Bioinformatics  Comparing genotype of people with/without a condition allowed discovery of a set of genes that together account for many cases of diabetes  Astronomy  Internet Search (e.g., Google) Find pages with matching contents, rank, and summarize content  E-commerce  IBM Surf-Aid analyzes web access logs to target customers, improve web organization or identify pages for advertisement  FIREFLY – music recommendation agents  Skycat and Sloan Sky Survey – clustering sky objects by their radiation levels – distinguish galaxies, stars 37 38 Example Applications  Sport & Entertainment  IBM’s advanced scout: analyzes NBA game statistics to gain competitive advantage for NY Knicks and Miami Heat  Sharp Lab: uses data mining to summarize sport video  A closer look Homeland Security  Intelligent analysis  Surveillance cameras – detect suspected individuals 39 40 10 Outline: Part II  Input: What kind of data to be mined? Data Mining   Structured data: Relational (or Object-oriented or Object-relational) Databases, Data Warehouses, Transactional Databases  Semi-structured data: web pages, XML, html, other special purpose domain  Unstructured data: text, e-mail  Input/Output  Tasks & Functionalities  System Architecture & System Categories  Mining the Data  Steps  Tools  Forms: & Demos Challenges and Issues 41 Input: What kind of data to be mined? Examples  42 A relational database: Relation: customer Cust_ID Name Contact Credit_info  Types of media & content:  Multimedia: A multidimensional data cube used in data warehousing  A transactional database Date/Time/Register 12/6 13:15 2 12/6 13:16 3 Fish N Y Turkey Y N Cranberries Y N Wine N Y Date Country  Image/Audio/Video Databases: Maps, Geographic database  Temporal and Time series Database  WWW (Web pages, Web access logs)  Heterogenous database: an interconnected set of different types of stand-alone databases  Legacy database: a group of heterogenous databases created in the past  Spatial ... ... ... 43 44 11 Data Sources: Where are the data from?  Public Scientific databases  National Output: What are the mined outputs?  Knowledge Types: (depends on data mining tasks)  Descriptions laboratories and data centers  Health-related service databases (e.g., benefits, medical analysis)  models (classifiers), Categories or Clusters of data Financial, Commercial and Business transactions (e.g., credit card transactions, loyalty cards,   Pattern of Irregularities Sequences or trends of regularities  Inferences on discount coupons, customer complaint calls)  of general properties Summary reports  Answers of complex queries  Patterns (or Models) of regularities - Classification  (e.g., NOAA, human genome, NASA’s EOS, DOD & Intelligence)  News group, e-mail, documents available data Predictive models for predicting unseen cases 45 Output: What are the mined outputs? 46 Examples Income   Forms: (depends on data mining functions) Decision trees: or query languages  Mathematical models, e.g.,  M H Risk  Texts debt  models, e.g., Rules: LHS  RHS Rules – association rules, DNF forms Decision Trees  Bayesian network H H Risk H Credit history U Neural net or regression models  Symbolic L Bad Good H Risk M Risk Credit history U Bad Good L Risk M Risk L M Risk L Risk     Visual   presentation 47 Color = yellow & shape = cylinder-like  fruit = banana Turkey  Cranberries, with support 90% and confidence 80% Event = Failed Midterm & Unfinished Project Future Event = Drop or Fail the course 48 12 Examples (cont.)  Outline: Part II Visualization of file organization using ring visualization representation  Data Mining  Input/Output  Tasks & Functionalities  System Architecture & System Categories  From NSF and Science Magazine Visualization Grand Challenge Mining the Data  Steps First Prize in category illustration.  Tools  & Demos Challenges and Issues 49 50 Data Mining Tasks How? Discovery: (patterns in various granularities from databases)   Description: find human-interpretable patterns describing general properties of data  Prediction: find patterns that predict future behavior by using variables in the data to predict other unknown variable values Summarize  Cluster  Classify  Identify Sequences/links/dependencies  Detect Deviation Verification: find patterns that confirm user’s hypothesis 51 52 13 Data Mining Functionality  Data Mining Functionality (cont.) Characterization:  Summarizes general features of objects in a target concept (or class or pattern to describe)  Concept description  Association: Studies the frequency of items occurring together in transaction databases Ex: buys(x, beer)  buys(x, nuts) Discrimination:  Compares general features of objects between a target class and a contrasting class  Concept comparison Prediction: Predicts some unknown or missing values based on known data Ex: Forecast stock values based on company records, political climates and economy 53 54 Data Mining Functionality (cont.) Data Mining Functionality (cont.)  Classification:  Describes data in a given class based on class features of known classes (labeled data)  Supervised learning Ex: Classify housing prices based on locations and conditions  Outlier analysis: Identifies and explains exceptions (surprises)  Time-series analysis: Identifies trends and deviations; sequential patterns, similar sequences Clustering: Groups data in classes (or categories or clusters) based on similarity of their features  Unsupervised learning * Min. inter-class similarity and Max. intra-class similarity 55 56 14 Outline: Part II  System Architectures Graphical user interface Data Mining  Input/Output Pattern evaluation  Tasks & Functionalities  System Architecture & System Categories  Data mining engine Mining the Data Data Cleaning & Data Integration  Steps  Tools  Knowledge base Database or data warehouse server Filtering & Demos Challenges and Issues Databases Data Warehouse 57 System Categories 58 System Categories Data Mining systems can be classified based on Data Mining systems can be classified based on  Types  Types of knowledge to be discovered  Types of data to be mined  Types of techniques applied  Types of application domains of knowledge to be discovered Summary, comparison, association, classification knowledge, deviation, trends  Knowledge can be at various levels of abstractions, e.g., year, quarter, month, date, time  59 60 15 System Categories System Categories Data Mining systems can be classified based on  Types  Data Mining systems can be classified based on  Types of data to be mined Transaction data, time-series data, spatial data, text data, www data, heterogeneous/distributed data of data models and techniques used Database-oriented Machine learning models  Statistical models  Visualization models   61 System Categories Outline: Part II Data Mining systems can be classified based on  Types 62  Data Mining  Input/Output of application domains  Tasks & Functionalities  System Architecture & System Categories Text mining systems  Web mining systems  Gene sequence analyzers  Multimedia mining systems  Micro array data analysis systems   Mining the Data  Steps  Tools  63 & Demos Challenges and Issues 64 16 Steps in mining the data          Some Data Mining Tools & Systems Learning the application domain  relevant prior knowledge and goals of application Creating a target data set: data selection Data cleaning and preprocessing: (may take 60% of effort!) Data reduction and transformation  Find useful features, dimensionality/variable reduction, invariant representation. Choosing functions of data mining  summarization, classification, regression, association, clustering. Choosing the mining algorithm(s) Data mining: search for patterns of interest Pattern evaluation and knowledge presentation  visualization, transformation, removing redundant patterns, etc. Use of discovered knowledge         C4.5, a decision tree learning system [Quinlan, 1994]  See5 SOM, Self-organizing Maps [Kohonen, 1995] Neural Net with Back Propagation learning [Ramerhart, 89] CBA, Classifier Based on Association rule mining [Liu et al., 1998] SORCER, Second-Order Relation Compaction for Extraction of Rules [Hewett and Leuchner, 2002] Naïve Bayes Classifier (Microsoft) Tetrad, a Bayes net learning system (CMU) BNT, Bayes Net Toolbox (MIT) 65 Outline: Part II Some Data Mining Suites   66 DBMiner, IBM’s DataQuest Group WEKA, Machine learning group at Waikato University  Data Mining  Input/Output Many more can be found at www.kdnuggets.com  Tasks & Functionalities  System Architecture & System Categories Let’s see them in action ….  Mining the Data  Steps  Tools  67 & Demos Challenges and Issues 68 17 Issues in Data Mining User Interface issues User Interface issues  Performance issues  Data source issues  Security and Social issues  Mining Methodology issues   Visualization issues:  Understandability and interpretation of results  Information representation and rendering  Interactivity  Manipulation of mined knowledge  Focus and refine tasks  Focus and refine results 69 Performance issues  70 Data source issues Efficiency and scalability of mining algorithms   Handling complex types of data  Is it possible to build a system that perform well on all kinds of data?  Need at least linear time complexity algorithms or bounded computation  Sampling   Parallelism  Incremental Diversity of data types Data Collection  Many collect data for archive  Identify problems before mining them – can we use divide and conquer? 71 72 18 Security and Social issues  Mining Methodology issues  Social Impacts  Private/sensitive data are mined without  consent  New implicit knowledge is disclosed (confidentiality, integrity)  Knowledge sharing      Regulations   There is need for data mining policy to protect data security, integrity and privacy Mining different types of knowledge from diverse data type (e.g., bio, stream, Web) Incorporation with background knowledge Handling noise and missing data Performance: efficiency, effectiveness and scalability Parallel, distributed and Incremental mining methods Evaluation: the interestingness problem Knowledge fusion: Integration of discovered knowledge with existing one 73 74 The Interestingness Problems Measures of “interestingness” Is all that is discovered “interesting”? No.  How do we measure “interestingness”? A pattern is “interesting” if it is:  Easy to understand by humans  Valid on test data with some degree of certainty  Potentially useful (for users)  Novel or validate user’s hypothesis   Objective: used statistics based on frequency of occurrences – e.g., regular – might miss important rare events  Subjective: user’s beliefs 75 76 19 The Interestingness Problems (cont) Can the data mining system find all interesting patterns?  completeness ??? Read text and tell me in next class  Can the data mining system find only interesting patterns?  optimality Yes, in some. E.g., mining query optimization  77 20

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Overview - Texas Tech University