Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Outline
Motivation
Data mining primitives
Data mining query languages
Designing GUI for data mining systems
Architectures
Data Mining
Primitives
CS 5331 by Rattikorn Hewett
Texas Tech University
1
Motivations: Why primitives?
Data mining primitives
Data mining systems uncover a large set of patterns –
not all are interesting
Data mining should be an interactive process
User directs what to be mined
Users need data mining primitives to communicate with
the data mining system by incorporating them in a data
mining query language
Benefits:
2
Data mining tasks can be specified in the form of data
mining queries by five data mining primitives:
Task-relevant data input
The kinds of knowledge to be mined function & output
Background knowledge interpretation
Interestingness measures evaluation
Visualization
of the discovered patterns presentation
More flexible user interaction
Foundation for design of graphical user interface
Standardization of data mining industry and practice
3
4
1
Task-relevant data
Knowledge to be mined
Specify data to be mined
Database, data warehouse, relation, cube
Condition for selection & grouping
Relevant attributes
Specify data mining “functions”:
Characterization/discrimination
Association
Classification/prediction
Clustering
5
6
Background Knowledge
Interestingness
Typically, in the form of concept hierarchies
Objective measures:
Simplicity:
Schema hierarchy
Set-grouping hierarchy
Operation-derived hierarchy
E.g., email address: dmbook@cs.ttu.edu
login-name < department < university < organization
7
Rule A => B has support, #(A and B)/ sample size
noise threshold (description)
Novelty
E.g., 87 ≤ temperature < 90 normal_temperature
Rule A => B has confidence, P(A|B) = #(A and B)/ #(B)
classification reliability or accuracy, certainty factor, rule strength, rule
quality, discriminating weight, etc.
Utility: potential usefulness
Rule-based hierarchy
(association) rule length, (decision) tree size
Certainty: validity of the rule
E.g., {low, high} all, {30..49} low, {50..100} high
simpler rules are easier to understand and likely to be interesting
E.g., street < city < state < country
not previously known, surprising (used to remove redundant rules)
8
2
Visualization of Discovered Patterns
DMQL(data mining query language)
Specify the form to view the patterns
E.g., rules, tables, chart, decision trees, cubes,
reports etc.
Specify operations for data exploration in
multiple levels of abstraction
E.g., drill-down, roll-up etc.
A DMQL can provide the ability to support ad-hoc and
interactive data mining
By providing a standardized language
Hope to achieve a similar effect like that SQL has on
relational database
Foundation for system development and evolution
Facilitate information exchange, technology transfer,
commercialization and wide acceptance
DMQL is designed with the primitives described earlier
9
Languages & Standardization Efforts
Association rule language specifications
MSQL (Imielinski & Virmani’99)
MineRule (Meo Psaila and Ceri’96)
Query flocks based on Datalog syntax (Tsur et al’98)
collection and data mining query composition
Presentation of discovered patterns
Hierarchy specification and manipulation
Manipulation of data mining primitives
Interactive multilevel mining
Other information
Based on OLE, OLE DB, OLE DB for OLAP
Integrating DBMS, data warehouse and data mining
CRISP-DM (CRoss-Industry Standard Process for Data
Mining)
What tasks should be considered in the design
GUIs based on a data mining query language?
Data
OLEDB for DM (Microsoft’2000)
Designing GUI based on DMQL
10
Providing a platform and process structure for effective data
mining
Emphasizing on deploying data mining technology to solve
business problems
11
12
3
Architectures
Coupling data mining system with DB/DW system
No coupling - Flat file processing, not recommended
Loose coupling - Fetching data from DB/DW
Semi-tight coupling - Enhanced DM performance
Provide efficient implementation of a few data mining primitives in
a DB/DW system, e.g., sorting, indexing, aggregation, histogram
analysis, multiway join, precomputation of some stat functions
Tight coupling - A uniform information processing environment
DM is smoothly integrated into a DB/DW system, mining query is
optimized based on mining query, indexing, query processing
methods, etc.
Concept Description
CS 5331 by Rattikorn Hewett
Texas Tech University
13
14
Review terms
Outline
Review terms
Characterization
Descriptive vs. predictive data mining
Descriptive:
describes the data set in concise,
summarative, informative, discriminative forms
Predictive: constructs models representing the data set,
and uses them to predict behaviors of unknown data
Summarization
Hierarchical
generalization
Attribute relevance analysis
Concept description: involves
Characterization:
provides a concise and succinct
summarization of the given collection of data
Comparison (discrimination): provides descriptions
comparing two or more collections of data
Comparison/discrimination
Descriptive statistical measures
15
16
4
Outline
Concept Description vs. OLAP
Concept description:
can handle complex data types (e.g., text,
image) of the attributes and their aggregations
a more automated process
Review terms
Characterization
Summarization
Hierarchical
generalization
Attribute relevance analysis
OLAP:
restricted
to a small number of dimension and
measure data types
user-controlled process
Comparison/discrimination
Descriptive statistical measures
17
18
Characterization methods
Summarization by OLAP
One approach for characterization is to transform data from
low conceptual levels to high ones “data
generalization”
E.g., daily sales annual sales
Biology Science
Two Methods:
Summarization – as in Data Cube’s OLAP
Hierarchical generalization – Attribute-oriented induction
Data generalization?
19
Data are stored in data cubes
Identify summarization computations
e.g., count( ), sum( ), average( ), max( )
Perform computations and store results in data cubes
Generalization and specialization can be performed on
a data cube by roll-up and drill-down
An efficient implementation of data generalization
Limitations:
Can handle only simple non-numeric data type of dimensions
Can handle only summarization of numeric data
Do not guide users which dimensions to explore or which levels
to reach
20
5
Outline
Attribute-Oriented Induction
Review terms
Characterization
Summarization
Hierarchical
generalization
Attribute relevance analysis
Comparison/discrimination
Descriptive statistical measures
Proposed in 1989 (KDD ‘89 workshop)
Not confined to categorical data nor particular measures.
How is it done?
Collect the task-relevant data (initial relation) using a
relational database query
Perform generalization by attribute removal or
attribute generalization.
Apply aggregation by merging identical, generalized
tuples and accumulating their respective counts
Interactive presentation with users
21
Basic Elements
22
General Steps
Data focusing: task-relevant data, including dimensions, and
the result is the initial relation.
Attribute-removal and Attribute-generalization:
InitialRel: Query processing of task-relevant data,
deriving the initial relation.
Attribute A has a large set of distinct values
If there is no generalization operator on A, or
A’s higher level concepts are expressed in terms of other attributes
(giving redundancy)
Remove A
If there exists a set of generalization operators on A
Select an operator to generalize A
PreGen: Based on the analysis of the number of distinct
values in each attribute, determine generalization plan for
each attribute: removal? or how high to generalize?
PrimeGen: Based on the PreGen plan, perform
generalization to the right level to derive a “prime
generalized relation”, accumulating the counts.
Generalization threshold controls
Presentation: User interaction: (1) adjust levels by
drilling, (2) pivoting, (3) mapping into rules, cross tabs,
visualization presentations.
Attribute generalization: controls size of attribute values for
generalization or removal (~ 2-8, specified/default)
Relation generalization: controls the final relation/rule size (~ 10-30).
23
24
6
Example
Example (cont.)
Initial
Relation
DMQL: Describe general characteristics of graduate
students in the Big-University database
use Big_University_DB
mine characteristics as “Science_Students”
in relevance to name, gender, major, birth_place, birth_date,
residence, phone#, gpa
from student
where status in “graduate”
Name
Gender
Major
Birth_date
Residence
Phone #
GPA
Jim
Woodman
M
CS
Vancouver,BC,
Canada
8-12-76
3511 Main St.,
Richmond
687-4598
3.67
Scott
Lachance
M
CS
Montreal, Que,
Canada
28-7-75
345 1st Ave.,
Richmond
253-9106
3.70
Laura Lee
…
F
…
Physics
…
Seattle, WA, USA
…
25-8-70
…
420-5232
…
3.83
…
Removed
Retained
Generalized
to
Sci,Eng,Bus
Removed
Generalized
to
Excl, VG,..
Gender
Major
Age_range
Residence
GPA
M
Science
Canada
20-25
Richmond
Very-good
16
F
Science
Foreign
25-30
Burnaby
Excellent
22
…
…
…
Prime
Generalized
Relation
Transform to corresponding SQL statement:
Select name, gender, major, birth_place, birth_date, residence,
phone#, gpa
from student
where status in {“Msc”, “MBA”, “PhD” }
…
…
Birth-Place
Generalized
to
Country
Birth_ country
125 Austin
Ave., Burnaby
…
Generalized
to
City
Generalized
to
Age range
…
Canada
Foreign
Total
16
10
26
14
22
36
30
32
62
25
Presentation of results
Summarization
Mapping results into cross tabulation
Visualization techniques:
Review terms
Characterization
Relations where some or all attributes are generalized, with
counts or other aggregation values accumulated.
Cross tabulation:
26
Outline
Generalized relation:
Hierarchical
Pie charts, bar charts, curves, cubes, and other visual forms.
Attribute
Quantitative characteristic rules:
…
Birth_Region
Gender
M
F
Total
Presentation
Count
Mapping generalized result into characteristic rules with
quantitative information associated with it, e.g., t = typical
generalization
relevance analysis
Comparison/discrimination
Descriptive statistical measures
grad ( x) Ù male( x) Þ
birth _ region( x) ="Canada"[t :53%]Ú birth _ region( x) =" foreign"[t : 47%].
27
28
7
Analysis of Attribute Relevance
Methods
To filter out statistically irrelevant attributes or rank
attributes for mining
Idea: Compute a measure that quantifies the
relevance of an attribute with respect to a given
class or concept
Irrelevant attributes inaccurate/unnecessary complex
patterns
An attribute is highly relevant for classifying/predicting a class, if it is
likely that its values can be used to distinguish the class from others
E.g., to describe cheap vs. expensive cars
Is “color” a relevant attribute?
What about using “color” to compare banana and apple?
These measures can be:
Information gain
The Gini index
Uncertainty
Correlation coefficients
29
Example
Example (cont)
How much attribute “major” is relevant to classification
of graduate/undergraduate students?
Relevance measure: Information gain
Review formulae:
For
an attribute value set S, each labeled with a class
in C and pi is a probability that class i is in S, then
Ent ( S ) = -å pi log 2 pi
Expected
iÎC
information needed to classify a sample if it
is partitioned into Si’s for data point that has A’s value
Si
i
I ( A) = å
Ent ( Si )
iÎdom ( A ) S
Information
30
gain: Gain(A) = Ent(S) – I(A)
Gender
Major
Birth_ country
Age_range
M
F
M
F
M
F
M
F
M
F
M
F
Science
Science
Eng
Science
Science
Eng
Science
Business
Business
Science
Eng
Eng
Canada
Foreign
Foreign
Foreign
Canada
Canada
Foreign
Canada
Canada
Canada
Foreign
Canada
20-25
25-30
….
Very-good
Excellent
GPA
……
…..
…
…..
…..
……
Count
16
22
18
25
21
18
18
20
22
24
22
24
120 Graduates
130 Undergraduates
Dom(Major) = {Science, Eng, Business}
Partition the data into Sc, Eng, Bus representing a set of data points
whose “Major” is Science, Eng and Business, respectively
31
32
8
Ent ( S ) = -å pi log 2 pi
Example (cont)
iÎC
I ( A) =
å
iÎdom ( A )
Gender
Major
Birth_ country
Age_range
M
F
M
F
M
F
M
F
M
F
M
F
Science
Science
Eng
Science
Science
Eng
Science
Business
Business
Science
Eng
Eng
Canada
Foreign
Foreign
Foreign
Canada
Canada
Foreign
Canada
Canada
Canada
Foreign
Canada
20-25
25-30
….
Very-good
Excellent
GPA
……
…..
…
…..
…..
……
Si
S
Ent ( Si )
iÎC
I ( A) =
å
iÎdom ( A )
Count
16
22
18
25
21
18
18
20
22
24
22
24
Ent ( S ) = -å pi log 2 pi
Example (cont)
120 Graduates:
Science = 84 (= 16+22+25+21)
Eng = 36
Business = 0
130 Undergraduates
Science = 42
Eng = 46
Business = 42
Gender
Major
Birth_ country
Age_range
M
F
M
F
M
F
M
F
M
F
M
F
Science
Science
Eng
Science
Science
Eng
Science
Business
Business
Science
Eng
Eng
Canada
Foreign
Foreign
Foreign
Canada
Canada
Foreign
Canada
Canada
Canada
Foreign
Canada
20-25
25-30
….
Very-good
Excellent
GPA
……
…..
…
…..
…..
……
Si
S
Ent ( Si )
Count
16
22
18
25
21
18
18
20
22
24
22
24
120 Graduates:
Science = 84 (= 16+22+25+21)
Eng = 36
Business = 0
130 Undergraduates
Science = 42
Eng = 46
Business = 42
Gain(Major) = Ent(S) – I(Major) = 0.9988 – 0.7873 = 0.2115
Similarly, find
Gain(gender), Gain(Birth_country), Gain(Age_range), Gain(GPA)
Ent(S) = 120/250 log2 (120/250) 130/250 log2 (130/250) = 0.9988
Ent(Sc) = 84/126 log2 (84/126) 42/126 log2 (42/126) = ….
Ent(Eng) = 36/82log2 (36/82) 46/82 log2 (46/82) = ….
Ent(Bus) = 0/42 log2 (0/42) 42/42 log2 (42/42) = ….
I(Major) = 126/250Ent(Sc) + 82/250Ent(Eng) + 42/250Ent(Bus) = 0.7873
Gain(Major) = Ent(S) – I(Major) = 0.9988 – 0.7873 = 0.2115
• We can rank “importance” or degree of “relevance” by Gain values
• We can use a threshold to prune out attributes that are less “relevant”
Class Information captured from S
Expected class information induced by attribute “Major”
33
34
Outline
Class comparison
Review terms
Characterization
Goal: mine properties (or rules) to compare a target
class with a contrasting class
The two classes must be comparable
E.g., address and gender are not comparable
store_address and home_address are comparable
CS students and Eng students are comparable
Summarization
Hierarchical
generalization
Attribute relevance analysis
Comparable classes should be generalized to the same
conceptual level
Approaches
Use attribute-oriented induction or data cube to generalize data
for two contrasting classes and then compare the results --- !!!!
Pattern Recognition approach –Approximate discriminating rules
from a data set, repeatedly fine-tune until errors are small
enough
Comparison/discrimination
Descriptive statistical measures
35
36
9
Outline
Descriptive statistical measures
Review terms
Characterization
Data Characteristics that can be computed
Central Tendency
Summarization
Hierarchical
generalization
Attribute relevance analysis
mean
median
Dispersion
When is “mean” not an appropriate measure?
For a very large data set, how do we compute
median ?
five number summary: Min, Quartile1, Median, Quartile3, Max
variance, standard deviation Spread about the mean. What does var = 0 mean?
Outliers
Detected by rules of thumb: values falling at
Comparison/discrimination
Descriptive statistical measures
least 1.5 of (Q3-Q1) above Q3 or below Q1
Useful displays
37
Boxplots, quantile-quantile plot (q-q plot), scatter plot, loess curve
38
References
E. Baralis and G. Psaila. Designing templates for mining association rules. Journal of Intelligent
Information Systems, 9:7-32, 1997.
Microsoft Corp., OLEDB for Data Mining, version 1.0, http://www.microsoft.com/data/oledb/dm,
Aug. 2000.
J. Han, Y. Fu, W. Wang, K. Koperski, and O. R. Zaiane, “DMQL: A Data Mining Query Language
for Relational Databases”, DMKD'96, Montreal, Canada, June 1996.
T. Imielinski and A. Virmani. MSQL: A query language for database mining. Data Mining and
Knowledge Discovery, 3:373-408, 1999.
M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen, and A.I. Verkamo. Finding interesting
rules from large sets of discovered association rules. CIKM’94, Gaithersburg, Maryland, Nov.
1994.
R. Meo, G. Psaila, and S. Ceri. A new SQL-like operator for mining association rules. VLDB'96,
pages 122-133, Bombay, India, Sept. 1996.
A. Silberschatz and A. Tuzhilin. What makes patterns interesting in knowledge discovery systems.
IEEE Trans. on Knowledge and Data Engineering, 8:970-974, Dec. 1996.
S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining with relational
database systems: Alternatives and implications. SIGMOD'98, Seattle, Washington, June 1998.
D. Tsur, J. D. Ullman, S. Abitboul, C. Clifton, R. Motwani, and S. Nestorov. Query flocks: A
generalization of association-rule mining. SIGMOD'98, Seattle, Washington, June 1998.
39
10