Download Data Mining: : A Database Perspective

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Data Mining:
A Database Perspective
Present By YC Liu
Reference
• Jiawei Han and Micheline Kamber, "Data
Mining: Concepts and Techniques", Chapter 6.
• M.S. Chen, J. Han, and P.S. Yu., “Data Mining:
An Overview from a Database Perspective” ,
IEEE Transactions on Knowledge and Data
Engineering, 8(6): 866-883, 1996.
• J. Liu, Y. Pan, K. Wang, and J. Han, "Mining
Frequent Item Sets by Opportunistic
Projection," In Proc. of 2002 Int. Conf. on
Knowledge Discovery in Databases (KDD'02),
Edmonton, Canada, July 2002.
outline
• Introduction
• Mining Association Rules
• Multilevel Data Generalization,
Summarization, and Characterization
• Data Classification
• Clustering Analysis
•
•
•
•
•
(Pattern-Based Similarity Search)
(Mining Path Traversal Patterns)
(Recommendation)
(Web Mining)
(Text Mining)
Introduction(1/5)
• Knowledge Discovery in Databases
• A process of nontrivial extraction of
implicit, previously unknown and
potentially useful information.
Introduction(2/5)
• 主要功用 Data
–
–
–
–
Knowledge
從資料庫中挖掘知識
了解使用者行為
幫助企業作決策
增進商機
• Data Mining 為何興起?
–
–
–
–
商品條碼之廣泛使用
企業界之電腦化
數以百萬計之資料庫正在使用
多年來累積了大量企業交易資料
Introduction(3/5)
Data Mining: A KDD Process
Pattern Evaluation
– Data mining: the core
of knowledge
Data Mining
discovery process.
Task-relevant Data
Data Warehouse
Data Cleaning
Data Integration
Databases
Selection
Introduction(4/5)
Challenges of Data Mining(1/2)
• Handling of Different Types of Data
• Efficiency and Scalability of Data
Mining Algorithms
• Usefulness, Certainty, and
Expressiveness of Data Mining
Results
• Expression of Various Kinds of Data
Mining Requests and Result
Introduction(5/5)
Challenges of Data Mining(2/2)
• Interactive Mining Knowledge at
Multiple Abstraction Levels
• Mining Information from Different
Sources of Data
• Protection of Privacy and Data
Security
An Overview of Data Mining
Techniques
• Classifying Data Mining Techniques
– What kinds of databases to work on
• Relational database, transaction database,
spatial database, temporal database.....
– What kinds of knowledge to be mined
• Association rules, classification,
clustering...
– What kind of techniques to be utilized
• Generalization-based mining, patternbased mining, mining based on statistics
or mathematical.
Mining Different Kinds of
Knowledge from Databases
– Association Rules
– Data generalization, summarization,
and characterization
– Data classification
– Data clustering
– Pattern-based similarity search
– Path traversal patterns
– Recommendation
– Web Mining
– Text Mining
Mining Association Rules
• An association rule is an implication of
the form X=>Y, where X I, Y I and
XY=.
• The rule X=>Y has support s in the
transaction set D if s% of transactions in
D contain XY.
• The rule X=>Y holds in the transaction
set D with confidence c if c% of
transactions in D that contain X also
contain Y.
What Is Association Mining?
•
Association rule mining:
–
•
Applications:
–
•
Finding frequent patterns, associations,
correlations, or causal structures among sets of
items or objects in transaction databases,
relational databases, and other information
repositories.
For cross-marketing and attached mailing applications.
Other applications include catalog design, add-on sales, store
layout and customer segmentation based on buying patterns.
Examples.
Rule form: “Body Head [support,
confidence]”.
– buys(x, “diapers”)  buys(x, “beers”) [0.5%,
60%]
– major(x, “CS”) ^ takes(x, “DB”) grade(x, “A”)
[1%, 75%]
–
Association Rule: Basic
Concepts
• Given: (1) database of transactions, (2)
each transaction is a list of items (purchased
by a customer in a visit)
• Find: all rules that correlate the presence of
one set of items with that of another set of
items
– E.g., 98% of people who purchase tires and auto
accessories also get automotive services done
• Applications
– *  Maintenance Agreement (What the store
should do to boost Maintenance Agreement sales)
– Home Electronics  * (What other products
should the store stocks up?)
Rule Measures: Support and
Confidence
Customer
buys both
• Find all the rules X & Y  Z
with minimum confidence and
support
– support, s, probability that a
transaction contains
{X∪Y∪Z}
Customer
buys beer
– confidence, c, conditional
probability that a transaction
Transaction ID Items Bought having {X∪Y} also contains
Z
2000
A,B,C
Let minimum support 50%,
1000
A,C
and minimum confidence
4000
A,D
50%, we have
5000
B,E,F
– A  C (50%, 66.6%)
Customer
buys diaper
Association Rule Mining: A
Road Map
•
Boolean vs. quantitative associations (Based on the
types of values handled)
–
–
•
Single dimension vs. multiple dimensional
associations
–
•
age(x, “30..39”) ^ income(x, “42..48K”) buys(x, “PC”)
[1%, 75%]
Single level vs. multiple-level analysis
–
•
buys(x, “SQLServer”) ^ buys(x, “DMBook”) buys(x,
“DBMiner”) [0.2%, 60%]
age(x, “30..39”) ^ income(x, “42..48K”) buys(x, “PC”)
[1%, 75%]
What brands of beers are associated with what brands of
diapers?
Various extensions
–
Correlation, causality analysis
Mining Association Rules—An
Example
Transaction ID
2000
1000
4000
5000
Items Bought
A,B,C
A,C
A,D
B,E,F
For rule A  C:
Min. support 50%
Min. confidence 50%
Frequent Itemset Support
{A}
75%
{B}
50%
{C}
50%
{A,C}
50%
support = support({A C}) = 50%
confidence = support({A C})/support({A}) =
66.6%
The Apriori principle:
Any subset of a frequent itemset must be frequent
Mining Association Rules
• Steps for mining association rules – Discover all large itemsets
– Use the large itemsets to generate the
association rules for the database
• To Identify The Large Itemset –
Algorithm Apriori
Mining generalized and multilevel association rules
• Interesting associations among data items
often occur at a relatively high concept
level
Interestingness of Discovered
Association Rules
• Example 1: (Aggarwal & Yu, PODS98)
– Among 5000 students
• 3000 play basketball
• 3750 eat cereal
• 2000 both play basket ball and eat cereal
– play basketball  eat cereal [40%, 66.7%] is
misleading because the overall percentage of
students eating cereal is 75% which is higher than
66.7%.
– play basketball  not eat cereal [20%, 33.3%] is
far more accurate, although with lower support and
basketball not basketball sum(row)
confidence
cereal
2000
1750
3750
not cereal
1000
250
1250
sum(col.)
3000
2000
5000
Interestingness of Discovered
Association Rules
• An association rule “A=>B” is interesting
if its confidence exceeds a certain
measure, or
P( A  B)
 P( B)  d
P( A)
where d is a suitable constant.
Improving the Efficiency of
Mining Association Rules
• Database Scan Reduction
– FP-tree......
• Sampling
• Incremental Updating of Discovered
Association Rules
• Parallel Data Mining
Classification
• A process of learning a function that maps a
data item into one of several predefined
classes.
• Every classification based on inductivelearning algorithms is given as input a set
of samples that consist of vectors of
attribute values and a corresponding class.
• predicts categorical class labels
• classifies data (constructs a model) based
on the training set and the values (class
labels) in a classifying attribute and uses it
in classifying new data
Classification Process (1):
Model Construction
Training
Data
NAME
M ike
M ary
B ill
Jim
D ave
A nne
RANK
YEARS TENURED
A ssistant P rof
3
no
A ssistant P rof
7
yes
P rofessor
2
yes
A ssociate P rof
7
yes
A ssistant P rof
6
no
A ssociate P rof
3
no
Classification
Algorithms
Classifier
(Model)
IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
Classification Process (2):
Use the Model in Prediction
Classifier
Testing
Data
Unseen Data
(Jeff, Professor, 4)
NAME
Tom
Merlisa
George
Joseph
RANK
YEARS TENURED
Assistant Prof
2
no
Associate Prof
7
no
Professor
5
yes
Assistant Prof
7
yes
Tenured?
Data Classification
• Decision-tree-based Classification
Method
– Decision Tree Learning System, ID3
– Evaluation Functions
• Information Gain
i   ( pi ln( pi ))
• Gini Index
n
gini(T ) 1  p 2j
j 1
Training Dataset
This
follows
an
example
from
Quinlan’s
ID3
age
<=30
<=30
30…40
>40
>40
>40
31…40
<=30
<=30
>40
<=30
31…40
31…40
>40
income student credit_rating
high
no fair
high
no excellent
high
no fair
medium
no fair
low
yes fair
low
yes excellent
low
yes excellent
medium
no fair
low
yes fair
medium
yes fair
medium
yes excellent
medium
no excellent
high
yes fair
medium
no excellent
buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
Output: A Decision Tree for
“buys_computer”
age?
<=30
student?
overcast
30..40
yes
>40
credit rating?
no
yes
excellent
fair
no
yes
no
yes
Performance Improvement
• Database Indices
• Attribute-oriented Induction
• Two-phase Multiattribute Extraction
– Inference Power
– Feature Extraction Phase
– Feature Combination Phase
Clustering Analysis
• Clustering:
The process of grouping physical or
abstract objects into classes of similar
objects.
• Clustering Analysis:
to construct meaningful partitioning of a
large set of objects based on a “divide and
conquer” methodology.
• Method:
– Statistic Analysis (Bayesian Classification
Method)
– Probability Analysis
Clustering Based on Randomized
Search
• PAM
(Partitioning Around Medoids)
• CLARA
(CLustering LARge Application)
• CLARANS
(Clustering Large Applications Based Upon
RANdomized Search)
PAM (Partitioning Around
Medoids) (1987)
• PAM (Kaufman and Rousseeuw, 1987), built in
Splus
• Use real object to represent the cluster
– Select k representative objects arbitrarily
– For each pair of non-selected object h and selected
object i, calculate the total swapping cost TCih
– For each pair of i and h,
• If TCih < 0, i is replaced by h
• Then assign each non-selected object to the most
similar representative object
– repeat steps 2-3 until there is no change
PAM Clustering: Total swapping
cost TCih=jCjih
10
10
9
9
8
7
7
6
j
t
8
t
6
j
5
5
4
4
i
3
h
h
i
3
2
2
1
1
0
0
0
1
2
3
4
5
6
7
8
9
10
Cjih = d(j, h) - d(j, i)
0
10
9
9
3
4
5
6
7
8
9
10
8
h
7
2
Cjih = 0
10
8
1
7
j
6
6
5
i
5
i
4
4
t
3
h
3
2
2
1
1
0
j
t
0
0
1
2
3
4
5
6
7
8
9
Cjih = d(j, t) - d(j, i)
10
0
1
2
3
4
5
6
7
8
9
Cjih = d(j, h) - d(j, t)
10
CLARA (Clustering Large
Applications) (1990)
• CLARA (Kaufmann and Rousseeuw in 1990)
– Built in statistical analysis packages, such as S+
• It draws multiple samples of the data set,
applies PAM on each sample, and gives the best
clustering as the output
• Strength: deals with larger data sets than PAM
• Weakness:
– Efficiency depends on the sample size
– A good clustering based on samples will not
necessarily represent a good clustering of the whole
data set if the sample is biased
Focusing Methods
• Focusing Methods
– CLARANS assumes that all the objects to be
clustered are all stored in main memory
– The most computationally expensive step of
CLARANS is calculating the total distances
between the two clusters
– Reducing the number of objects considered
• Only the most central object of a leaf node of the
R*-tree are used to compute the medoids of the
clusters
– Restricting the access
• Focus on Relevant Clusters
• Focus on a Cluster
BIRCH(Balanced Iterative
Reducing and Clustering)
• An incremental one with the possibility of
adjustment of memory requirements to
the size of memory that is available
• Clustering Features
– Summarize information about the subclusters
of points instead of storing all points
• CF Trees
– Branching factor B and threshold T
• By changing the threshold value we can
change the size of the tree
– Use an arbitrary clustering algorithm to
cluster the leaf nodes of the CF-tree
Clustering Feature Vector
Clustering Feature: CF = (N, LS, SS)
N: Number of data points
LS: Ni=1=Xi
SS: Ni=1=Xi2
CF = (5, (16,30),(54,190))
10
9
8
7
6
5
4
3
2
1
0
0
1
2
3
4
5
6
7
8
9
10
(3,4)
(2,6)
(4,5)
(4,7)
(3,8)
CF Tree
Root
B=7
CF1
CF2 CF3
CF6
L=6
child1
child2 child3
child6
Non-leaf node
CF1
CF2 CF3
CF5
child1
child2 child3
child5
Leaf node
prev CF1 CF2
CF6 next
Leaf node
prev CF1 CF2
CF4 next
Data Generalization,
Summarization, and
Characterization
•
•
Data Generalization:
A process which abstracts a large set of
relevant data in a database from a low
concept level to relatively high ones
Approaches
1. Data Cube Approach
2. Attribute-oriented Induction Approach
Data Cube Approach
• Multidimensional database, OLAP, ....
• The general idea of the approach is to
materialize certain expensive computation
that are frequently inquired
– Such as count, sum, average, max, min,...
– Fast response time and flexible views of data
from different angles at different abstraction
levels
Attribute-oriented Induction
Approach
• Essential Background Knowledge:
Concept Hierarchy
• Steps: #
–
–
–
–
–
–
Retrieval initial relation
Attribute Removal
Concept-tree climbing
Vote propagation
Threshold control
Rule transformation
Concept Hierarchy and
Concept-Tree
• 概念階層在歸納之前必須先定義清楚,最一般化
的概念是以”ANY”或是”ALL”來表示,最特定的
概念是對應到資料庫中該屬性的某一特定值。如
屬性Birth place的概念階層可表示為
example
• 假設我們要找出研究生(graduated student)的特性法則:
example
• 屬性的概念階層表格(Concept Hierarchy Table)
example
• 將資料庫中屬性Status是Graduate的過濾出來。
同時表格每一筆資料都加上一欄”Vote”用來紀錄
在歸納過程中,符合該值組的原始資料筆數。
Example-attribute removal
• 將所有屬性中,沒有存在較高概念階層的屬性刪除。
Example-Concept-tree
Climbing and Vote Propagation
• 假設某一個屬性在概念階層中存在著更高層級的
概念,則該屬性就以其更高層級的值取代。如此
例中的history, physics, math...會由science
• 取代...
屬性值向上爬升後,若產生相同的tuples,則
將相同的tuples合併為一筆,並將vote值累加
到歸納後的tuple中。
Example-Concept-tree
Climbing and Vote Propagation
Example-Threshold Control and
Rule Transformation
• 門檻控制(Threshold Control )
• 歸納完成