Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Data Mining:
Concepts and Techniques
— Chapter 1 —
Richong Zhang
Office: New Main Building, G521
Email:zhangrc@act.buaa.edu.cn
This slide is made based on the slides provided by Jiawei Han,
Micheline Kamber, and Jian Pei. ©2012 Han, Kamber & Pei.
1
Course Information
Office hour: Thursday 14:00-16:00
Course web site: http://act.buaa.edu.cn/zhangrc/
Course Organization
• The first part of the course will be an introduction by the instructor
• The second part of the course will have students presenting relevant
research papers (recent three year papers from ICDM and KDD or review
papers from reputable journals, such as ACM Trans on Knowledge
Discovery from Data or IEEE Trans on Knowledge and Data Engineering)
• The final part of the course will be presenting their research projects on
topics of data mining.
Workload and Assessment:
•
•
•
•
Presentation of papers on a topic of data mining (with one or two-page
handout/summary): 20%
A comprehensive survey (written report, 6 pages, IEEE double column) 20%
Class participation: 10%
Term-long research project which is considered as the take-home final exam for
the course and consists of two components: (a) project presentation: 20% and (b)
written project report (written report, 6 pages, IEEE double column): 30%.
This report is potentially to be published in a reputable conference/journal.
April 28, 2017
Data Mining: Concepts and Techniques
2
Course Information
Important Dates:
Nov. 4th: Choice of at least three papers for the comprehensive survey,
due by 12:00 noon.
Nov. 15th: Two-pages project proposal due by 12:00 noon.
Jan. 11st: Written project report due by 12:00 noon.
April 28, 2017
Data Mining: Concepts and Techniques
3
Coverage (Chapters 1-10, 3rd Ed.)
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
Introduction
Getting to Know Your Data
Data Preprocessing
Data Warehouse and OLAP Technology: An Introduction
Advanced Data Cube Technology
Mining Frequent Patterns & Association: Basic Concepts
Mining Frequent Patterns & Association: Advanced Methods
Classification: Basic Concepts
Classification: Advanced Methods
Cluster Analysis: Basic Concepts
Data Mining:
Cluster Analysis: Advanced Methods
Concepts and
Techniques, 3rd ed.
Jiawei Han, Micheline
Kamber and Jian Pei
Available at www.chinapub.com or
http://product.chinapub.com/3683062
4
More Advanced Topics
Mining data streams, time-series, and sequence data
(Temporal/Spatial Data Mining)
Mining graph data
Mining social and information networks
Mining object, spatial, multimedia, text and Web data
Mining complex data objects
Spatial and spatiotemporal data mining
Multimedia data mining
Text and Web mining
Additional (often current) themes if time permits
many real-world applications of data mining
5
A comprehensive survey on a focused topic
Possible topics:
1. Stream data mining
2. Sequential pattern mining, sequence classification and clustering
3. Time-series analysis, regression and trend analysis
4. Graph pattern mining, graph classification and clustering
5. Social network analysis
6. Spatial, spatiotemporal data mining
7. Multimedia data mining
8. Text mining
9. Mining software programs
10. Statistical data mining methods
11. Other possible topics, which needs to be approved by the instructor
April 28, 2017
Data Mining: Concepts and Techniques
6
Course Projects
Finish a related (to the survey) problem as the final project
http://snap.stanford.edu/data/
http://www.kdnuggets.com/
Choose a data set/data sets from public dataset collections and propose a research
topic on this data set
The code should of your implementation should be submitted together with your
finial paper.
The topic of your survey paper should be submitted to zhangrc@act.buaa.edu.cn
before Oct. 20st. The proposal (2 pages) of your finial project should be submitted
to the above email address before Nov. 15st.
Examples of topics (need to be focused and specific)
Discovering Communities from Flickr Users
Mining Temporal/Spatial Patterns from User-generated Photos
7
Written Reports (Survey and Project)
Introduction of the topic
Related works
Choose at least three related areas/problems to be investigated
The advantages and disadvantages of these works
What is the difference between your approach and the existing works
Problem formulation/Models/Algorithms
Background
Motivation
Summarization/Limitation of existing works
The contribution of this work
Notations/Equations/Theories/Inference/Algorithms
Empirical studies/Evaluation
Data set
Evaluation metric
Comparative results
Discussion
Conclusion and Future work
8
Presentations
Choose a topic
Presentation
Paper presentations will be in 25-minute time slots: 20 minutes for presentation followed by 5
minutes for questions and answers.
the proposed presentation time here is just tentative for now and may need to be adjusted later
on, once the actual size of the class is known.
Prepare handouts for your presentation.
Send me an email by 12:00 noon on Oct 20, stating your names, the topic you choose as well as
your choice of at least 5 papers your are going to read. Send me the proposal of your finial
project before Nov. 15st. I will discuss the topic with students individually and evaluate the
feasibility of your proposed approach.
I will construct the Schedule of Paper Presentations and post it on the course website by
Oct. 25.
A handout is a one- or two-page summary of the paper to be presented, which is handed out to
the audience right before your presentation.
Things to be included
What is the main idea or contribution of the papers (or project)?
Is it useful or valuable? Why?
What assumptions were made?
Are these assumptions reasonable or practical?
Are the reported results correct or convincing?
How can the results be extended or improved?
What applications may arise from the main idea of the paper (or project)?
Are there any unclear points in the paper (or project)?
9
Chapter 1. Introduction
Why Data Mining?
What Is Data Mining?
A Multi-Dimensional View of Data Mining
What Kinds of Data Can Be Mined?
What Kinds of Patterns Can Be Mined?
What Kinds of Technologies Are Used?
What Kinds of Applications Are Targeted?
Major Issues in Data Mining
A Brief History of Data Mining and Data Mining Society
Summary
10
Why Data Mining?
The Explosive Growth of Data: from terabytes to petabytes BIG DATA!
Data collection and data availability
Automated data collection tools, database systems, Web,
computerized society
Major sources of abundant data
Business: Web, e-commerce, transactions, stocks, …
Science: Remote sensing, bioinformatics, scientific simulation, …
Society and everyone: news, digital cameras, YouTube
We are drowning in data, but starving for knowledge!
“Necessity is the mother of invention”—Data mining—Automated
analysis of massive data sets
11
Why Data Mining
Credit ratings/targeted marketing:
Identify likely responders to sales promotions
Fraud detection
Given a database of 100,000 names, which persons are the
least likely to default on their credit cards?
Which types of transactions are likely to be fraudulent, given
the demographics and transactional history of a particular
customer?
Customer relationship management:
Which of my customers are likely to be the most loyal, and
which are most likely to leave for a competitor? :
Data Mining helps extract such
information
Data mining
Process of semi-automatically analyzing large
databases to find patterns that are:
valid: hold on new data with some certainity
novel: non-obvious to the system
useful: should be possible to act on the item
understandable: humans should be able to interpret
the pattern
Also known as Knowledge Discovery in
Databases (KDD)
Chapter 1. Introduction
Why Data Mining?
What Is Data Mining?
A Multi-Dimensional View of Data Mining
What Kinds of Data Can Be Mined?
What Kinds of Patterns Can Be Mined?
What Kinds of Technologies Are Used?
What Kinds of Applications Are Targeted?
Major Issues in Data Mining
A Brief History of Data Mining and Data Mining Society
Summary
14
What Is Data Mining?
Data mining (knowledge discovery from data)
Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or knowledge from
huge amount of data
Alternative names
Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data
dredging, information harvesting, business intelligence, etc.
Watch out: Is everything “data mining”?
Simple search and query processing
(Deductive) expert systems
15
What is (not) Data Mining?
What is not Data
Mining?
l
l
What is Data Mining?
– Look up phone
number in phone
directory
– Certain names are more
prevalent in certain US
locations (O’Brien, O’Rurke,
O’Reilly… in Boston area)
– Query a Web
search engine for
information
about “Amazon”
– Group together similar
documents returned by
search engine according to
their context (e.g. Amazon
rainforest, Amazon.com,)
Applications
Banking: loan/credit card approval
Customer relationship management:
identify likely responders to promotions
Fraud detection: telecommunications, financial
transactions
identify those who are likely to leave for a competitor.
Targeted marketing:
predict good customers based on old customers
from an online stream of event identify fraudulent events
Manufacturing and production:
automatically adjust knobs when process parameter changes
Applications (continued)
Medicine: disease outcome, effectiveness of
treatments
Molecular/Pharmaceutical: identify new drugs
Scientific data analysis:
analyze patient disease history: find relationship
between diseases
identify new galaxies by searching for sub clusters
Web site/store design and promotion:
find affinity of visitor to pages and modify layout
Knowledge Discovery (KDD) Process
This is a view from typical
database systems and data
Pattern Evaluation
warehousing communities
Data mining plays an essential
role in the knowledge discovery
Data Mining
process
Task-relevant Data
Data Warehouse
Selection
Data Cleaning
Data Integration
Databases
19
The KDD process
Problem fomulation
Data collection
Pre-processing: cleaning
subset data: sampling might hurt if highly skewed data
feature selection: principal component analysis, heuristic
search
name/address cleaning, different meanings (annual, yearly),
duplicate removal, supplying missing values
Transformation:
map complex objects e.g. time series data to features e.g.
frequency
Choosing mining task and mining method:
Result evaluation and Visualization:
Knowledge discovery is an iterative process
Relationship with other fields
Overlaps with machine learning, statistics,
artificial intelligence, databases, visualization
but more stress on
scalability of number of features and instances
stress on algorithms and architectures whereas
foundations of methods and formulations provided
by statistics and machine learning.
automation for handling large, heterogeneous data
Some basic operations
Predictive:
Regression
Classification
Collaborative Filtering
Descriptive:
Clustering / similarity matching
Association rules and variants
Deviation detection
Example: A Web Mining Framework
Web mining usually involves
Data cleaning
Data integration from multiple sources
Warehousing the data
Data cube construction
Data selection for data mining
Data mining
Presentation of the mining results
Patterns and knowledge to be used or stored into
knowledge-base
23
Data Mining in Business Intelligence
Increasing potential
to support
business decisions
Decision
Making
Data Presentation
Visualization Techniques
End User
Business
Analyst
Data Mining
Information Discovery
Data
Analyst
Data Exploration
Statistical Summary, Querying, and Reporting
Data Preprocessing/Integration, Data Warehouses
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
DBA
24
KDD Process: A Typical View from ML and
Statistics
Input Data
Data PreProcessing
Data integration
Normalization
Feature selection
Dimension reduction
Data
Mining
Pattern discovery
Association & correlation
Classification
Clustering
Outlier analysis
…………
PostProcessing
Pattern
Pattern
Pattern
Pattern
evaluation
selection
interpretation
visualization
This is a view from typical machine learning and statistics communities
25
Which View Do You Prefer?
Which view do you prefer?
KDD vs. ML/Stat. vs. Business Intelligence
Depending on the data, applications, and your focus
Data Mining vs. Data Exploration
Business intelligence view
Warehouse, data cube, reporting but not much
mining
Business objects vs. data mining tools
Supply chain example: mining vs. OLAP vs.
presentation tools
Data presentation vs. data exploration
26
Chapter 1. Introduction
Why Data Mining?
What Is Data Mining?
A Multi-Dimensional View of Data Mining
What Kinds of Data Can Be Mined?
What Kinds of Patterns Can Be Mined?
What Kinds of Technologies Are Used?
What Kinds of Applications Are Targeted?
Major Issues in Data Mining
A Brief History of Data Mining and Data Mining Society
Summary
27
Multi-Dimensional View of Data Mining
Data to be mined
Database data (extended-relational, object-oriented, heterogeneous,
legacy), data warehouse, transactional data, stream, spatiotemporal,
time-series, sequence, text and web, multi-media, graphs & social
and information networks
Knowledge to be mined (or: Data mining functions)
Characterization, discrimination, association, classification, clustering,
trend/deviation, outlier analysis, etc.
Descriptive vs. predictive data mining
Multiple/integrated functions and mining at multiple levels
Techniques utilized
Data-intensive, data warehouse (OLAP), machine learning, statistics,
pattern recognition, visualization, high-performance, etc.
Applications adapted
Retail, telecommunication, banking, fraud analysis, bio-data mining,
stock market analysis, text mining, Web mining, etc.
28
Chapter 1. Introduction
Why Data Mining?
What Is Data Mining?
A Multi-Dimensional View of Data Mining
What Kinds of Data Can Be Mined?
What Kinds of Patterns Can Be Mined?
What Kinds of Technologies Are Used?
What Kinds of Applications Are Targeted?
Major Issues in Data Mining
A Brief History of Data Mining and Data Mining Society
Summary
29
Data Mining: On What Kinds of Data?
Database-oriented data sets and applications
Relational database, data warehouse, transactional database
Advanced data sets and advanced applications
Data streams and sensor data
Time-series data, temporal data, sequence data (incl. bio-sequences)
Structure data, graphs, social networks and multi-linked data
Object-relational databases
Heterogeneous databases and legacy databases
Spatial data and spatiotemporal data
Multimedia database
Text databases
The World-Wide Web
30
Chapter 1. Introduction
Why Data Mining?
What Is Data Mining?
A Multi-Dimensional View of Data Mining
What Kinds of Data Can Be Mined?
What Kinds of Patterns Can Be Mined?
What Kinds of Technologies Are Used?
What Kinds of Applications Are Targeted?
Major Issues in Data Mining
A Brief History of Data Mining and Data Mining Society
Summary
31
Data Mining Function: (1) Generalization
Information integration and data warehouse construction
Data cube technology
Data cleaning, transformation, integration, and
multidimensional data model
Scalable methods for computing (i.e., materializing)
multidimensional aggregates
OLAP (online analytical processing)
Multidimensional concept description: Characterization
and discrimination
Generalize, summarize, and contrast data
characteristics, e.g., dry vs. wet region
32
Data Mining Function: (2) Association and
Correlation Analysis
Frequent patterns (or frequent itemsets)
What items are frequently purchased together in your
Walmart?
Association, correlation vs. causality
A typical association rule
Diaper Beer [0.5%, 75%] (support, confidence)
Are strongly associated items also strongly correlated?
How to mine such patterns and rules efficiently in large
datasets?
How to use such patterns for classification, clustering,
and other applications?
33
Association rules
T
Milk, cereal
Tea, milk
Given set T of groups of items
Example: set of item sets purchased
Goal: find all rules on itemsets of the Tea, rice, bread
form a-->b such that
support of a and b > user threshold s
conditional probability (confidence) of b
given a > user threshold c
Example: Milk --> bread
Purchase of product A --> service B
cereal
The Apriori Algorithm—An Example
Database TDB
Tid
Items
10
A, C, D
20
B, C, E
30
A, B, C, E
40
B, E
Supmin = 2
Itemset
{A, C}
{B, C}
{B, E}
{C, E}
sup
{A}
2
{B}
3
{C}
3
{D}
1
{E}
3
C1
1st scan
C2
L2
Itemset
sup
2
2
3
2
Itemset
{A, B}
{A, C}
{A, E}
{B, C}
{B, E}
{C, E}
sup
1
2
1
2
3
2
Itemset
sup
{A}
2
{B}
3
{C}
3
{E}
3
L1
C2
2nd scan
Itemset
{A, B}
{A, C}
{A, E}
{B, C}
{B, E}
{C, E}
C3
April 28, 2017
Itemset
{B, C, E}
3rd scan
L3
Itemset
sup
{B, C, E}
2
Data Mining: Concepts and Techniques
35
Variants
High confidence may not imply high correlation
Use correlations. Find expected support and
large departures from that interesting..
see statistical literature on contingency tables.
Still too many rules, need to prune...
Applications of fast itemset
counting
Find correlated events:
Applications in medicine: find redundant tests
Cross selling in retail, banking
Improve predictive capability of classifiers that
assume attribute independence
New similarity measures of categorical attributes
[Mannila et al, KDD 98]
Data Mining Function: (3) Classification
Classification and label prediction
Construct models (functions) based on some training examples
Describe and distinguish classes or concepts for future prediction
Predict some unknown class labels
Typical methods
E.g., classify countries based on (climate), or classify cars
based on (gas mileage)
Decision trees, naïve Bayesian classification, support vector
machines, neural networks, rule-based classification, patternbased classification, logistic regression, …
Typical applications:
Credit card fraud detection, direct marketing, classifying stars,
diseases, web-pages, …
38
Classification
Given old data about customers and payments,
predict new applicant’s loan eligibility.
Previous customers
Age
Salary
Profession
Location
Customer type
Classifier
Decision rules
Salary > 5 L
Prof. = Exec
New applicant’s data
Good/
bad
Classification Example
Tid Refund Marital
Status
Taxable
Income Cheat
Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
No
Single
75K
?
2
No
Married
100K
No
Yes
Married
50K
?
3
No
Single
70K
No
No
Married
150K
?
4
Yes
Married
120K
No
Yes
Divorced 90K
?
5
No
Divorced 95K
Yes
No
Single
40K
?
6
No
Married
No
No
Married
80K
?
60K
10
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
10
No
Single
90K
Yes
Training
Set
Learn
Classifier
Test
Set
Model
Classification—A Two-Step Process
41
Model construction: describing a set of predetermined classes
Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute
The set of tuples used for model construction is training set
The model is represented as classification rules, decision trees, or
mathematical formulae
Model usage: for classifying future or unknown objects
Estimate accuracy of the model
The known label of test sample is compared with the classified
result from the model
Accuracy rate is the percentage of test set samples that are
correctly classified by the model
Test set is independent of training set (otherwise overfitting)
If the accuracy is acceptable, use the model to classify new data
Note: If the test set is used to select models, it is called validation
(test) set
Process (1): Model Construction
Training
Data
NAME
M ike
M ary
B ill
Jim
D ave
A nne
42
RANK
YEARS TENURED
A ssistant P rof
3
no
A ssistant P rof
7
yes
P rofessor
2
yes
A ssociate P rof
7
yes
A ssistant P rof
6
no
A ssociate P rof
3
no
Classification
Algorithms
Classifier
(Model)
IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
Process (2): Using the Model in
Prediction
Classifier
Testing
Data
Unseen Data
(Jeff, Professor, 4)
NAME
T om
M erlisa
G eorge
Joseph
43
RANK
YEARS TENURED
A ssistant P rof
2
no
A ssociate P rof
7
no
P rofessor
5
yes
A ssistant P rof
7
yes
Tenured?
The Classification Problem
(informal definition)
Given a collection of annotated
data. In this case 5 instances
Katydids of and five of
Grasshoppers, decide what
type of insect the unlabeled
example is.
Katydid or Grasshopper?
Katydids
Grasshoppers
For any domain of interest, we can measure featu
Color {Green, Brown, Gray,
Other}
Abdom
Thora
en
x
Length
Lengt
h
Has Wings?
Antenn
ae
Length
Mandible
Size
Spiracle
Diameter
Leg Length
We can store
features in a
database.
The classification
problem can now be
expressed as:
My_Collection
Insect Abdomen Antennae Insect Class
ID
Length
Length
Grasshopper
1
2.7
5.5
2
3
4
5
6
7
8
9
10
8.0
0.9
1.1
5.4
2.9
6.1
0.5
8.3
8.1
• Given a training
database
(My_Collection),
predict the class
label of a previously
unseen
instance
previously
unseen instance11=
9.1
4.7
3.1
8.5
1.9
6.6
1.0
6.6
4.7
5.1
7.0
Katydid
Grasshopper
Grasshopper
Katydid
Grasshopper
Katydid
Grasshopper
Katydid
Katydids
???????
Grasshoppers
Katydids
Antenna Length
10
9
8
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10
Abdomen
Length
Antenna Length
Grasshoppers
We will also use this lager
dataset as a motivating
10 example…
9
8
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10
Abdomen
Length
Katydids
Each of these data
objects are
called…
• exemplars
• (training)
examples
• instances
• tuples
Classification methods
Goal: Predict class Ci = f(x1, x2, .. Xn)
Regression: (linear or any other polynomial)
a*x1 + b*x2 + c = Ci.
Nearest neighour
Decision tree classifier: divide decision space into
piecewise constant regions.
Probabilistic/generative models
Neural networks: partition by non-linear boundaries
Nearest neighbor
Define proximity between instances, find
neighbors of new instance and assign majority
class
Case based reasoning: when attributes are
more complicated than real-valued.
• Pros
+ Fast
training
• Cons
– Slow during
application.
– No feature selection.
– Notion of proximity
Decision trees
Tree where internal nodes are simple
decision rules on one or more
attributes and leaf nodes are
predicted class labels.
Salary < 1 M
Prof = teacher
Good
Bad
Age < 30
Bad
Good
Data Mining Function: (4) Cluster Analysis
Unsupervised learning (i.e., Class label is unknown)
Group data to form new categories (i.e., clusters), e.g.,
cluster houses to find distribution patterns
Principle: Maximizing intra-class similarity & minimizing
interclass similarity
Many methods and applications
52
Clustering
Unsupervised learning when old data with class
labels not available e.g. when introducing a new
product.
Group/cluster existing customers based on time
series of payment history such that similar
customers in same cluster.
Key requirement: Need a good measure of
similarity between instances.
Identify micro-markets and develop policies for
each
Applications
Customer segmentation e.g. for targeted
marketing
Collaborative filtering:
Group/cluster existing customers based on time
series of payment history such that similar customers
in same cluster.
Identify micro-markets and develop policies for each
group based on common items purchased
Text clustering
Compression
Distance functions
Numeric data: euclidean, manhattan distances
Categorical data: 0/1 to indicate
presence/absence followed by
Hamming distance (# dissimilarity)
Jaccard coefficients: #similarity in 1s/(# of 1s)
data dependent measures: similarity of A and B
depends on co-occurance with C.
Combined numeric and categorical data:
weighted normalized distance:
Clustering methods
Hierarchical clustering
agglomerative Vs divisive
single link Vs complete link
Partitional clustering
distance-based: K-means
model-based: EM
density-based:
Example of clustering
Rangeela
QSQT Rangeela
100 daysAnand
Sholay
Anand
QSQT
100 days
VertigoDeewar
DeewarVertigo
Sholay
Smita
Vijay
Vijay
Rajesh
Mohan
Mohan
Rajesh
Nina
Nina
Smita
Nitin
Nitin
?
?
?
?
?
?
?
?
??
??
Model-based approach
People and movies belong to unknown classes
Pk = probability a random person is in class k
Pl = probability a random movie is in class l
Pkl = probability of a class-k person liking a class-l
movie
Gibbs sampling: iterate
Pick a person or movie at random and assign to a class
with probability proportional to Pk or Pl
Estimate new parameters
Need statistics background to understand details
Partitional methods: K-means
Criteria: minimize sum of square of distance
Between each point and centroid of the cluster.
Between each pair of points in the cluster
Algorithm:
Select initial partition with K clusters: random, first
K, K separated points
Repeat until stabilization:
Assign each point to closest cluster center
Generate new cluster centers
Adjust clusters by merging/splitting
Collaborative Filtering
Given database of user preferences, predict
preference of new user
Example: predict what new movies you will like
based on
your past preferences
others with similar past preferences
their preferences for the new movies
Example: predict what books/CDs a person may
want to buy
(and suggest it, or give discounts to tempt customer)
Collaborative recommendation
•Possible approaches:
•
Average vote along columns [Same prediction for all]
• Weight vote based on similarity of likings [GroupLens]
RangeelaQSQT
RangeelaQSQT
Smita
Vijay
Mohan
Rajesh
Nina
Nitin
?
?
100 daysAnand Sholay Deewar Vertigo
?
?
?
?
Cluster-based approaches
External attributes of people and movies to
cluster
Cluster people based on movie preferences
age, gender of people
actors and directors of movies.
[ May not be available]
misses information about similarity of movies
Repeated clustering:
cluster movies based on people, then people based on
movies, and repeat
ad hoc, might smear out groups
Data Mining Function: (5) Outlier Analysis
Outlier analysis
Outlier: A data object that does not comply with the general
behavior of the data
Noise or exception? ― One person’s garbage could be another
person’s treasure
Methods: by product of clustering or regression analysis, …
Useful in fraud detection, rare events analysis
64
Time and Ordering: Sequential Pattern,
Trend and Evolution Analysis
Sequence, trend and evolution analysis
Trend, time-series, and deviation analysis: e.g.,
regression and value prediction
Sequential pattern mining
e.g., first buy digital camera, then buy large SD
memory cards
Periodicity analysis
Motifs and biological sequence analysis
Approximate and consecutive motifs
Similarity-based analysis
Mining data streams
Ordered, time-varying, potentially infinite, data streams
65
Structure and Network Analysis
Graph mining
Finding frequent subgraphs (e.g., chemical compounds), trees
(XML), substructures (web fragments)
Information network analysis
Social networks: actors (objects, nodes) and relationships (edges)
e.g., author networks in CS, terrorist networks
Multiple heterogeneous networks
A person could be multiple information networks: friends,
family, classmates, …
Links carry a lot of semantic information: Link mining
Web mining
Web is a big information network: from PageRank to Google
Analysis of Web information networks
Web community discovery, opinion mining, usage mining, …
66
Evaluation of Knowledge
Are all mined knowledge interesting?
One can mine tremendous amount of “patterns” and knowledge
Some may fit only certain dimension space (time, location, …)
Some may not be representative, may be transient, …
Evaluation of mined knowledge → directly mine only
interesting knowledge?
Descriptive vs. predictive
Coverage
Typicality vs. novelty
Accuracy
Timeliness
…
67
Chapter 1. Introduction
Why Data Mining?
What Is Data Mining?
A Multi-Dimensional View of Data Mining
What Kinds of Data Can Be Mined?
What Kinds of Patterns Can Be Mined?
What Kinds of Technologies Are Used?
What Kinds of Applications Are Targeted?
Major Issues in Data Mining
A Brief History of Data Mining and Data Mining Society
Summary
68
Data Mining: Confluence of Multiple Disciplines
Machine
Learning
Applications
Algorithm
Pattern
Recognition
Data Mining
Database
Technology
Statistics
Visualization
High-Performance
Computing
69
Why Confluence of Multiple Disciplines?
Tremendous amount of data
High-dimensionality of data
Micro-array may have tens of thousands of dimensions
High complexity of data
Algorithms must be highly scalable to handle such as tera-bytes of
data
Data streams and sensor data
Time-series data, temporal data, sequence data
Structure data, graphs, social networks and multi-linked data
Heterogeneous databases and legacy databases
Spatial, spatiotemporal, multimedia, text and Web data
Software programs, scientific simulations
New and sophisticated applications
70
Chapter 1. Introduction
Why Data Mining?
What Is Data Mining?
A Multi-Dimensional View of Data Mining
What Kinds of Data Can Be Mined?
What Kinds of Patterns Can Be Mined?
What Kinds of Technologies Are Used?
What Kinds of Applications Are Targeted?
Major Issues in Data Mining
A Brief History of Data Mining and Data Mining Society
Summary
71
Applications of Data Mining
Web page analysis: from web page classification, clustering to
PageRank & HITS algorithms
Collaborative analysis & recommender systems
Basket data analysis to targeted marketing
Biological and medical data analysis: classification, cluster analysis
(microarray data analysis), biological sequence analysis, biological
network analysis
Data mining and software engineering (e.g., IEEE Computer, Aug.
2009 issue)
From major dedicated data mining systems/tools (e.g., SAS, MS SQLServer Analysis Manager, Oracle Data Mining Tools) to invisible data
mining
72
Data Mining in Practice
73
Application Areas
Industry
Finance
Insurance
Telecommunication
Transport
Consumer goods
Data Service providers
Utilities
Application
Credit Card Analysis
Claims, Fraud Analysis
Call record analysis
Logistics management
promotion analysis
Value added data
Power usage analysis
Why Now?
Data is being produced
Data is being warehoused
The computing power is available
The computing power is affordable
The competitive pressures are strong
Commercial products are available
Data Mining works with
Warehouse Data
Data Warehousing provides the
Enterprise with a memory
ÑData Mining provides the
Enterprise with intelligence
Usage scenarios
Data warehouse mining:
assimilate data from operational sources
mine static data
Mining log data
Continuous mining: example in process control
Stages in mining:
data selection pre-processing: cleaning
transformation mining result evaluation
visualization
Vertical integration: Mining on the web
Web log analysis for site design:
what are popular pages,
what links are hard to find.
Electronic stores sales enhancements:
recommendations, advertisement:
Collaborative filtering: Net perception, Wisewire
Inventory control: what was a shopper looking for and
could not find..
Chapter 1. Introduction
Why Data Mining?
What Is Data Mining?
A Multi-Dimensional View of Data Mining
What Kinds of Data Can Be Mined?
What Kinds of Patterns Can Be Mined?
What Kinds of Technologies Are Used?
What Kinds of Applications Are Targeted?
Major Issues in Data Mining
A Brief History of Data Mining and Data Mining Society
Summary
79
Major Issues in Data Mining (1)
Mining Methodology
Mining various and new kinds of knowledge
Mining knowledge in multi-dimensional space
Data mining: An interdisciplinary effort
Boosting the power of discovery in a networked environment
Handling noise, uncertainty, and incompleteness of data
Pattern evaluation and pattern- or constraint-guided mining
User Interaction
Interactive mining
Incorporation of background knowledge
Presentation and visualization of data mining results
80
Major Issues in Data Mining (2)
Efficiency and Scalability
Efficiency and scalability of data mining algorithms
Parallel, distributed, stream, and incremental mining methods
Diversity of data types
Handling complex types of data
Mining dynamic, networked, and global data repositories
Data mining and society
Social impacts of data mining
Privacy-preserving data mining
Invisible data mining
81
Chapter 1. Introduction
Why Data Mining?
What Is Data Mining?
A Multi-Dimensional View of Data Mining
What Kinds of Data Can Be Mined?
What Kinds of Patterns Can Be Mined?
What Kinds of Technologies Are Used?
What Kinds of Applications Are Targeted?
Major Issues in Data Mining
A Brief History of Data Mining and Data Mining Society
Summary
82
A Brief History of Data Mining Society
1989 IJCAI Workshop on Knowledge Discovery in Databases
1991-1994 Workshops on Knowledge Discovery in Databases
Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W. Frawley,
1991)
Advances in Knowledge Discovery and Data Mining (U. Fayyad, G.
Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 1996)
1995-1998 International Conferences on Knowledge Discovery in Databases
and Data Mining (KDD’95-98)
Journal of Data Mining and Knowledge Discovery (1997)
ACM SIGKDD conferences since 1998 and SIGKDD Explorations
More conferences on data mining
PAKDD (1997), PKDD (1997), SIAM-Data Mining (2001), (IEEE) ICDM
(2001), WSDM (2008), etc.
ACM Transactions on KDD (2007)
83
Conferences and Journals on Data Mining
KDD Conferences
ACM SIGKDD Int. Conf. on
Knowledge Discovery in
Databases and Data Mining (KDD)
SIAM Data Mining Conf. (SDM)
(IEEE) Int. Conf. on Data Mining
(ICDM)
European Conf. on Machine
Learning and Principles and
practices of Knowledge Discovery
and Data Mining (ECML-PKDD)
Pacific-Asia Conf. on Knowledge
Discovery and Data Mining
(PAKDD)
Int. Conf. on Web Search and
Data Mining (WSDM)
Other related conferences
DB conferences: ACM SIGMOD,
VLDB, ICDE, EDBT, ICDT, …
Web and IR conferences: WWW,
SIGIR, WSDM
ML conferences: ICML, NIPS
PR conferences: CVPR,
Journals
Data Mining and Knowledge
Discovery (DAMI or DMKD)
IEEE Trans. On Knowledge and
Data Eng. (TKDE)
KDD Explorations
ACM Trans. on KDD
84
Where to Find References? DBLP, CiteSeer, Google
Data mining and KDD (SIGKDD: CDROM)
Database systems (SIGMOD: ACM SIGMOD Anthology—CD ROM)
Conferences: SIGIR, WWW, CIKM, etc.
Journals: WWW: Internet and Web Information Systems,
Statistics
Conferences: Machine learning (ML), AAAI, IJCAI, COLT (Learning Theory), CVPR, NIPS, etc.
Journals: Machine Learning, Artificial Intelligence, Knowledge and Information Systems,
IEEE-PAMI, etc.
Web and IR
Conferences: ACM-SIGMOD, ACM-PODS, VLDB, IEEE-ICDE, EDBT, ICDT, DASFAA
Journals: IEEE-TKDE, ACM-TODS/TOIS, JIIS, J. ACM, VLDB J., Info. Sys., etc.
AI & Machine Learning
Conferences: ACM-SIGKDD, IEEE-ICDM, SIAM-DM, PKDD, PAKDD, etc.
Journal: Data Mining and Knowledge Discovery, KDD Explorations, ACM TKDD
Conferences: Joint Stat. Meeting, etc.
Journals: Annals of statistics, etc.
Visualization
Conference proceedings: CHI, ACM-SIGGraph, etc.
Journals: IEEE Trans. visualization and computer graphics, etc.
85
Chapter 1. Introduction
Why Data Mining?
What Is Data Mining?
A Multi-Dimensional View of Data Mining
What Kinds of Data Can Be Mined?
What Kinds of Patterns Can Be Mined?
What Kinds of Technologies Are Used?
What Kinds of Applications Are Targeted?
Major Issues in Data Mining
A Brief History of Data Mining and Data Mining Society
Summary
86
Summary
Data mining: Discovering interesting patterns and knowledge from
massive amount of data
A natural evolution of science and information technology, in great
demand, with wide applications
A KDD process includes data cleaning, data integration, data selection,
transformation, data mining, pattern evaluation, and knowledge
presentation
Mining can be performed in a variety of data
Data mining functionalities: characterization, discrimination,
association, classification, clustering, trend and outlier analysis, etc.
Data mining technologies and applications
Major issues in data mining
87