Download Overview - Texas Tech University

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Class Information
 Contact:
Tel: 325-742-3527
E-mail: Rattikorn.Hewett@ttu.edu
 Course Materials:
http://redwood.cs.ttu.edu/~hewett/te
ach.html
Data Analytics
Fall 2014
Rattikorn Hewett
Computer Science Department
Texas Tech University
1
Acknowledgements

2
Texts
Materials in this course are adapted from
various sources including our texts and
data mining courses by:
 Prof.
Jeff Ullman, Stanford University
Chris Clifton, Purdue University
 Prof. Osmar Zaiane, University of Alberta
 Prof.
3

Data Mining: Concepts and Techniques by
J. Han and M. Kamber, Morgan Kaufmann
2000

Data Mining: Practical Machine Learning
Tools and Techniques with Java
Implementations by I. Witten and E. Frank,
Morgan Kaufmann 1999.
4
1
What you should get out of this course

Concepts and techniques in data analytics, data
mining and knowledge discovery in data (KDD)

Understanding underlying processes and
algorithms

Experience with tools

Exposure to complex applications and research
in data analytics
Evaluation
 Projects/reports
 Paper
presentation
 Class participation
60%
35%
5%
There will be implementation projects
and research papers to read, review and
present
5
6
Remarks


Academic integrity: read the statement of
Academic Conduct for Engineering
students (see the syllabus)
Data Analytics:
Overview
Citation: unless noted, work submitted
should reflect your own capabilities
 If
unsure, acknowledge sources and help
7
8
2
Outline: Part I




Motivation
What are data analytics, data mining and KDD?
Why is it a new multidisciplinary subject?
Research Community & Resources
Where do we see data analytics being used?
Advanced technology
Computerization of
for data collection
business
and government
+
generation and storage transactions and documents
Flood of undigested data
Can we automate this process?
Useful knowledge
For Decision-making
9
What we need
10
Why KDD?
New technologies that can
intellectually and automatically
assist humans in
analyzing and transforming
rapidly growing volumes of
digital data into useful information

Manual analysis and interpretation
 Slow,

expensive and highly subjective
Databases are rapidly growing in size
 Hundreds
 Hundreds
 KDD (Knowledge Discovery in Databases)
[Fayad et al., 96]
11

of millions objects
to thousands attributes
Need to scale up human analysis capabilities
to cope with data overload problem
12
3
Data mining, a KDD process
Pre-processing
Selected cleaned data
Data Mining - Then
Databases or
Data warehouse

Data Mining
Patterns
 Bonferroni’s
theorem suggests that if there are too
many possible conclusions, some will be true for purely
statistical reasons with no physical validity
Refinement
Post-processing
Useful
Information
• Data Mining is the core step of discovery in KDD
• Blindly apply Data Mining can lead to meaningless
and invalid patterns
• Pre and Post processing are essential to ensure
that useful knowledge is derived from the data
 Famous
example: ESP test by David Rhine at Duke in
1950 – declare students who can guess cards correctly
100% to have ESP

Data mining has negative implication
13
Data Mining - Now
14
Data Analytics

Extraction of “interesting” information
(knowledge) from huge amount of data
 Discovery of useful summaries of data
(Ullman)
 Alternative terms:

A new buzzword in business intelligence
 Data
leverage in specific applications or functional
processes to enable context-specific insight that is
actionable (by Gartner)
 Scientific process of transforming data into insight for
making better decisions (by INFORMS)
Data analysis, pattern analysis, data dredging, data
exploration, data understanding, data summarization,
data abstraction, KDD (other places) etc.

The term (~1983) in statistics community for
“overusing data to draw invalid inferences”
A misnomer?

In this class …
Data Analytics ~ Data Science ~ Data Mining
Used with Big Data
~ KDD?
15
4
Data Mining & our daily life
Outline: Part I


Groceries:

 Beer -- Diapers (add Chips)
 Wine -- Chocolate -- Flowers


What are data analytics, data mining and KDD?
Why is it a new multidisciplinary subject?
Research Community & Resources
Where do we see data mining being used?
Internet: Google search
 E-commerce:

 Amazon.com
 Expedia.com
17
KDD Process
KDD Process
Interpretation/
Evaluation
Data Mining
1.
2.
Knowledge
Preprocessing may take
60% of effort
Preprocessing
3.
Patterns
Selection
Preprocessed
Data
Data
18
Data cleaning: remove noise & inconsistent data
Stored in
Data integration: from multiple sources
Data Warehouse
Data transformation and reduction: transform or
consolidate data into forms appropriate for data mining, select
relevant data
Iterative
Process
4.
Target
Data
5.
Data mining: extracts patterns
Pattern evaluation/interpretation: by using
interestingness measures
adapted from: Chris Clifton, Purdue University and
U. Fayyad, et al. (1995), “From Knowledge Discovery to Data
Mining: An Overview,” Advances in Knowledge Discovery and
Data Mining, U. Fayyad et al. (Eds.), AAAI/MIT Press
6.
Knowledge Presentation: visualization and knowledge
representation are used to present the mined knowledge to the user
19
20
5
Data Mining Algorithms
Data Mining
many possible characteristics:
• deterministic/stochastic relationships
Data Set
Involves:
• static/dynamic processes

many different types, including:
• classification algorithms (e.g., C4.5)
Data Mining
Algorithm

• association algorithms (e.g., Apriori)
• causal learning algorithms (e.g., PC)
provides:
Model
(Pattern or
Knowledge)
• prediction/classification of unseen cases
• understanding relationships among variables

Fitting models to observed data as in
 Statistics
Generalizing models that represent behaviors of
the system generating the data as in
 Machine Learning
Finding patterns in observed data as in
 Pattern Recognition
21
Interdisciplinary KDD
Data Infrastructures
High Performance Computing:
Parallel and Distributed Computing
Databases
Information Retrieval:
Indexing, Inverted files
Data Warehousing
Knowledge Acquisition
Pre-processing
Big Data
Analytics
22
Data Analytics/Mining
Must cope with at least three issues:
Statistics

Very large amount of data

Scalability in size and complexity
 Not
Pattern Recognition
KDD
Other AI areas
Machine Learning
all data can contain in main-memory
 “Scalable”
Data Analytics

Expert Systems
if run time grows linearly in proportion to size
Efficiency
 High
performance algorithms are desired
Visualization, HCI
Computer Graphic
Post-processing
23
24
6
Data Mining – A new discipline?
Data Mining – in database context
How is it different from existing fields?
Can be thought of as
 Statistics – hypothesis testing
learning – all data contains in main memory
 Database systems – typically do not infer/generalize data
 Pattern Recognition – hard for high volume and high
 Machine


dimensional data
 All – not explicitly concerned with efficiency and huge
Algorithms for executing very complex queries on
non-main-memory data
An advanced on-line analytical processing (OLAP)
OLAP – supports summarization, consolidation,
aggregation and viewing in multiple perspectives
amount of data
25
Outline: Part I




26
KDD Research Community
What are data analytics, data mining and KDD?
Why is it a multidisciplinary subject?
Why is it a new discipline?
Research Community & Resources
Where do we see data mining being used?

Key founders:

Usama Fayyad, JPL (then Microsoft, now has his own company,
Digimine)
 Gregory Piatetsky-Shapiro (then GTE, now his own data mining
consulting company, Knowledge Stream Partners)
 Rakesh Agrawal (IBM Research)

1989 IJCAI Workshop on Knowledge Discovery in
Databases (Piatetsky-Shapiro)

27
Knowledge Discovery in Databases (G. Piatetsky-Shapiro and
W. Frawley, 1991)
28
7
KDD Research Community (contd)

1991-1994 Workshops on Knowledge Discovery in
Databases

1995-1998 International Conferences on Knowledge
Discovery in Databases and Data Mining (KDD’95-98)

1998 ACM SIGKDD, SIGKDD’1999-2001 conferences,
and SIGKDD Explorations
More conferences on data mining



Advances in Knowledge Discovery and Data Mining (U. Fayyad,
G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 1996)


KDD Research Community (contd)
Journal of Data Mining and Knowledge Discovery (1997)
Other research community in related fields:
 Statistics
 Machine Learning
 Clustering
 Visualization
 Databases
 Information Retrieval
 Distributed and Parallel Computation
PAKDD (1997), PKDD (1997), SIAM-Data Mining (2001), (IEEE)
ICDM (2001), etc.
29
Useful Resources
30
Outline: Part I

KDNuggets (http://www.kdnuggets.com)
 Weka 3 – open source data mining
software
(http://www.cs.waikato.ac.nz/ml/weka/inde
x.html)
 UCI machine learning repository
(http://archive.ics.uci.edu/ml/)
 KDD archive (http://kdd.ics.uci.edu/)

31




What is data mining and KDD?
Why is it a multidisciplinary subject?
Why is it a new discipline?
Research Community & Resources
Where do we see data mining being used?
32
8
Example Applications

Example Applications
Marketing & Retailing
 Cross



 Identify
potential money laundering & financial
crimes from reports of large cash transactions
E.g., FAIS of U.S. Treas. Financial Crimes Enforcement Network
33
Example Applications



fraud
Manufacturing & Engineering
 Construct control
E.g. use records on phone services - destination, time,
duration - to detect patterns that deviate from
expected norm
model for controlling
manufacturing processes (e.g., semi-conductor
industries)
Forecast – avoid overstock
 Improve aviation safety, from FAA’s pilot deviation
 Inventory
 Improve
availability or promote sales of
communication services

34
Example Applications
Telecommunication
 Detect
trends of stock investment
E.g., LBS Capital Management manages portfolios totaling $600
millions since 1993
retention
From purchasing records – loyalty card and credit
card transactions – detect changes in customer
consumption to adjust price/quality
and Loan
Use bank-loan records (of factors that may influence
loan payment) to build a predictive model to decide
whether a loan should be granted
 Predict
recommendation
Customer profiling to advertise to most likely
buyers (e.g., hot items, amazon.com)
 Customer
Finance and investment
 Credit
reference of items
Market-basket analysis to find associations of
items bought to increase retail
(e.g., diapers and beer  adding chips)
 Purchase


E.g., from communication traffic records, associate
communication needs and events to avoid overload of
communication facilities
35
database and NTSB’s accident and incident database
 Describe types of human errors (e.g., mistakes, slips,
others) that caused accidents
 Predict accident problems
36
9
Example Applications

Example Applications
Science
 Earth


 Web
& Environmental Science
Construct predictive model for lake inflows from solar
activity and climate conditions
 Bioinformatics

Comparing genotype of people with/without a
condition allowed discovery of a set of genes that
together account for many cases of diabetes
 Astronomy

Internet
Search (e.g., Google)
Find pages with matching contents, rank, and
summarize content
 E-commerce
 IBM Surf-Aid analyzes web access logs to target
customers, improve web organization or identify
pages for advertisement
 FIREFLY – music recommendation agents

Skycat and Sloan Sky Survey – clustering sky objects
by their radiation levels – distinguish galaxies, stars
37
38
Example Applications

Sport & Entertainment
 IBM’s
advanced scout: analyzes NBA game statistics
to gain competitive advantage for NY Knicks and
Miami Heat
 Sharp Lab: uses data mining to summarize sport
video

A closer look
Homeland Security
 Intelligent
analysis
 Surveillance cameras – detect suspected individuals
39
40
10
Outline: Part II

Input: What kind of data to be mined?
Data Mining

 Structured
data: Relational (or Object-oriented or
Object-relational) Databases, Data Warehouses,
Transactional Databases
 Semi-structured data: web pages, XML, html, other
special purpose domain
 Unstructured data: text, e-mail
 Input/Output
 Tasks &
Functionalities
 System Architecture & System Categories

Mining the Data
 Steps
 Tools

Forms:
& Demos
Challenges and Issues
41
Input: What kind of data to be mined?
Examples

42
A relational database: Relation: customer
Cust_ID
Name
Contact
Credit_info

Types of media & content:
 Multimedia:
A multidimensional data cube
used in data warehousing

A transactional database
Date/Time/Register
12/6 13:15 2
12/6 13:16 3
Fish
N
Y
Turkey
Y
N
Cranberries
Y
N
Wine
N
Y
Date
Country

Image/Audio/Video
Databases: Maps, Geographic database
 Temporal and Time series Database
 WWW (Web pages, Web access logs)
 Heterogenous database: an interconnected set of
different types of stand-alone databases
 Legacy database: a group of heterogenous
databases created in the past
 Spatial
...
...
...
43
44
11
Data Sources: Where are the data from?

Public Scientific databases
 National
Output: What are the mined outputs?

Knowledge Types: (depends on data mining tasks)
 Descriptions
laboratories and data centers

Health-related service databases
(e.g., benefits, medical analysis)

models (classifiers), Categories or Clusters of data
Financial, Commercial and Business
transactions (e.g., credit card transactions, loyalty cards,


Pattern of Irregularities
Sequences or trends of regularities
 Inferences on
discount coupons, customer complaint calls)

of general properties
Summary reports
 Answers of complex queries
 Patterns (or Models) of regularities - Classification

(e.g., NOAA, human genome, NASA’s EOS, DOD & Intelligence)

News group, e-mail, documents
available data
Predictive models for predicting unseen cases
45
Output: What are the mined outputs?
46
Examples
Income


Forms: (depends on data mining functions)
Decision trees:
or query languages
 Mathematical models, e.g.,

M
H Risk
 Texts
debt

models, e.g.,
Rules:
LHS  RHS
Rules – association rules, DNF forms
Decision Trees
 Bayesian network
H
H Risk
H
Credit history
U
Neural net or regression models
 Symbolic
L
Bad Good
H Risk
M Risk
Credit history
U
Bad
Good
L Risk
M Risk
L
M Risk
L Risk



 Visual


presentation
47
Color = yellow & shape = cylinder-like  fruit = banana
Turkey  Cranberries, with support 90% and confidence 80%
Event = Failed Midterm & Unfinished Project Future Event =
Drop or Fail the course
48
12
Examples (cont.)

Outline: Part II
Visualization of file organization using ring
visualization representation

Data Mining
 Input/Output
 Tasks &
Functionalities
 System Architecture & System Categories

From NSF and Science Magazine
Visualization Grand Challenge
Mining the Data
 Steps
First Prize in category illustration.
 Tools

& Demos
Challenges and Issues
49
50
Data Mining Tasks
How?
Discovery: (patterns in various granularities from databases)

 Description:
find human-interpretable patterns
describing general properties of data
 Prediction: find patterns that predict future
behavior by using variables in the data to predict
other unknown variable values
Summarize
 Cluster
 Classify
 Identify Sequences/links/dependencies
 Detect Deviation
Verification: find patterns that confirm user’s hypothesis
51
52
13
Data Mining Functionality

Data Mining Functionality (cont.)
Characterization:

Summarizes general features of objects in a target
concept (or class or pattern to describe)
 Concept description

Association:
Studies the frequency of items occurring together in
transaction databases
Ex: buys(x, beer)  buys(x, nuts)
Discrimination:

Compares general features of objects between a target
class and a contrasting class
 Concept comparison
Prediction:
Predicts some unknown or missing values based on
known data
Ex: Forecast stock values based on company records,
political climates and economy
53
54
Data Mining Functionality (cont.)
Data Mining Functionality (cont.)

Classification:

Describes data in a given class based on class features
of known classes (labeled data)
 Supervised learning
Ex: Classify housing prices based on locations and
conditions

Outlier analysis:
Identifies and explains exceptions (surprises)

Time-series analysis:
Identifies trends and deviations; sequential patterns,
similar sequences
Clustering:
Groups data in classes (or categories or clusters) based
on similarity of their features
 Unsupervised learning
* Min. inter-class similarity and Max. intra-class similarity
55
56
14
Outline: Part II

System Architectures
Graphical user interface
Data Mining
 Input/Output
Pattern evaluation
 Tasks &
Functionalities
 System Architecture & System Categories

Data mining engine
Mining the Data
Data Cleaning &
Data Integration
 Steps
 Tools

Knowledge
base
Database or data
warehouse server
Filtering
& Demos
Challenges and Issues
Databases
Data
Warehouse
57
System Categories
58
System Categories
Data Mining systems can be classified
based on
Data Mining systems can be classified
based on
 Types
 Types
of knowledge to be discovered
 Types of data to be mined
 Types of techniques applied
 Types of application domains
of knowledge to be discovered
Summary, comparison, association, classification
knowledge, deviation, trends
 Knowledge can be at various levels of
abstractions, e.g., year, quarter, month, date, time

59
60
15
System Categories
System Categories
Data Mining systems can be classified
based on
 Types

Data Mining systems can be classified
based on
 Types
of data to be mined
Transaction data, time-series data, spatial data,
text data, www data, heterogeneous/distributed
data
of data models and techniques used
Database-oriented
Machine learning models
 Statistical models
 Visualization models


61
System Categories
Outline: Part II
Data Mining systems can be classified
based on
 Types
62

Data Mining
 Input/Output
of application domains
 Tasks &
Functionalities
 System Architecture & System Categories
Text mining systems
 Web mining systems
 Gene sequence analyzers
 Multimedia mining systems
 Micro array data analysis systems


Mining the Data
 Steps
 Tools

63
& Demos
Challenges and Issues
64
16
Steps in mining the data









Some Data Mining Tools & Systems
Learning the application domain
 relevant prior knowledge and goals of application
Creating a target data set: data selection
Data cleaning and preprocessing: (may take 60% of effort!)
Data reduction and transformation
 Find useful features, dimensionality/variable reduction, invariant
representation.
Choosing functions of data mining
 summarization, classification, regression, association, clustering.
Choosing the mining algorithm(s)
Data mining: search for patterns of interest
Pattern evaluation and knowledge presentation
 visualization, transformation, removing redundant patterns, etc.
Use of discovered knowledge








C4.5, a decision tree learning system [Quinlan, 1994]  See5
SOM, Self-organizing Maps [Kohonen, 1995]
Neural Net with Back Propagation learning [Ramerhart, 89]
CBA, Classifier Based on Association rule mining [Liu et al., 1998]
SORCER, Second-Order Relation Compaction for Extraction of
Rules [Hewett and Leuchner, 2002]
Naïve Bayes Classifier (Microsoft)
Tetrad, a Bayes net learning system (CMU)
BNT, Bayes Net Toolbox (MIT)
65
Outline: Part II
Some Data Mining Suites


66
DBMiner, IBM’s DataQuest Group
WEKA, Machine learning group at Waikato University

Data Mining
 Input/Output
Many more can be found at www.kdnuggets.com
 Tasks &
Functionalities
 System Architecture & System Categories
Let’s see them in action ….

Mining the Data
 Steps
 Tools

67
& Demos
Challenges and Issues
68
17
Issues in Data Mining
User Interface issues
User Interface issues
 Performance issues
 Data source issues
 Security and Social issues
 Mining Methodology issues


Visualization issues:
 Understandability
and interpretation of results
 Information representation and rendering

Interactivity
 Manipulation
of mined knowledge
 Focus and refine tasks
 Focus and refine results
69
Performance issues

70
Data source issues
Efficiency and scalability of mining
algorithms

 Handling
complex types of data
 Is it possible to build a system that perform
well on all kinds of data?
 Need
at least linear time complexity
algorithms or bounded computation
 Sampling


Parallelism
 Incremental
Diversity of data types
Data Collection
 Many
collect data for archive
 Identify problems before mining them
– can we use divide and
conquer?
71
72
18
Security and Social issues

Mining Methodology issues

Social Impacts
 Private/sensitive
data are mined without

consent
 New implicit knowledge is disclosed
(confidentiality, integrity)
 Knowledge sharing





Regulations

 There
is need for data mining policy to
protect data security, integrity and privacy
Mining different types of knowledge from diverse data
type (e.g., bio, stream, Web)
Incorporation with background knowledge
Handling noise and missing data
Performance: efficiency, effectiveness and scalability
Parallel, distributed and Incremental mining methods
Evaluation: the interestingness problem
Knowledge fusion: Integration of discovered knowledge
with existing one
73
74
The Interestingness Problems
Measures of “interestingness”
Is all that is discovered “interesting”?
No.
 How do we measure “interestingness”?
A pattern is “interesting” if it is:
 Easy to understand by humans
 Valid on test data with some degree of
certainty
 Potentially useful (for users)
 Novel or validate user’s hypothesis

 Objective:
used statistics based on frequency
of occurrences – e.g., regular – might miss
important rare events
 Subjective: user’s beliefs
75
76
19
The Interestingness Problems (cont)
Can the data mining system find all
interesting patterns?  completeness
??? Read text and tell me in next class
 Can the data mining system find only
interesting patterns?  optimality
Yes, in some.
E.g., mining query optimization

77
20