Download Data Mining

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Course on Data Mining (581550-4)
Intro/Ass. Rules
7.11.
24./26.10.
Clustering
14.11.
Episodes
KDD Process
Home Exam
30.10.
Text Mining
21.11.2001
21.11.
28.11.
Data mining: KDD Process
Appl./Summary
1
Course on Data Mining (581550-4)
Today 22.11.2001
• Today's subject:
o KDD Process
• Next week's program:
o Lecture: Data mining
applications, future, summary
o Exercise: KDD Process
o Seminar: KDD Process
21.11.2001
Data mining: KDD Process
2
KDD process
Overview
21.11.2001
•
•
•
•
Overview
Preprocessing
Post-processing
Summary
Data mining: KDD Process
3
What is KDD? A process!
• Aim: the selection and processing
of data for
o the identification of novel,
accurate, and useful patterns,
and
o the modeling of real-world
phenomena
• Data mining is a major
component of the KDD process
21.11.2001
Data mining: KDD Process
4
Typical KDD process
Target
data
set
Raw
data
Operational
Database
Eval. of
interestingness
Input data
1
Preprocessing
Data mining
Cleaned
Verified
Focused
2
Postprocessing
Utilization
21.11.2001
Data mining: KDD Process
Results
3
Selected
usable
patterns
5
Phases of the KDD process (1)
Learning the domain
Creating a target data set
Preprocessing
Data cleaning, integration
and transformation
Data reduction and projection
Choosing the DM task
21.11.2001
Data mining: KDD Process
6
Phases of the KDD process (2)
Choosing the DM algorithm(s)
Data mining: search
Postprocessing
Pattern evaluation
and interpretation
Knowledge presentation
Use of discovered knowledge
21.11.2001
Data mining: KDD Process
7
Preprocessing - overview
Preprocessing
21.11.2001
• Why data preprocessing?
• Data cleaning
• Data integration and
transformation
• Data reduction
Data mining: KDD Process
8
Why data preprocessing?
• Aim: to select the data relevant with respect to the task in
hand to be mined
• Data in the real world is dirty
o incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data
o noisy: containing errors or outliers
o inconsistent: containing discrepancies in codes or
names
• No quality data, no quality mining results!
21.11.2001
Data mining: KDD Process
9
Measures of data quality
o
o
o
o
o
o
o
o
21.11.2001
accuracy
completeness
consistency
timeliness
believability
value added
interpretability
accessibility
Data mining: KDD Process
10
Preprocessing tasks (1)
• Data cleaning
o fill in missing values, smooth
noisy data, identify or remove
outliers, and resolve
inconsistencies
• Data integration
o integration of multiple
databases, files, etc.
• Data transformation
o normalization and aggregation
21.11.2001
Data mining: KDD Process
11
Preprocessing tasks (2)
• Data reduction (including
discretization)
o obtains reduced representation
in volume, but produces the
same or similar analytical
results
o data discretization is part of
data reduction, but with
particular importance,
especially for numerical data
21.11.2001
Data mining: KDD Process
12
Preprocessing tasks (3)
Data Cleaning
Data Integration
Data Transformation
Data Reduction
21.11.2001
Data mining: KDD Process
13
Data cleaning tasks
• Fill in missing values
• Identify outliers and
smooth out noisy data
• Correct inconsistent
data
21.11.2001
Data mining: KDD Process
14
Missing Data
• Data is not always available
• Missing data may be due to
o equipment malfunction
o inconsistent with other recorded data, and
thus deleted
o data not entered due to misunderstanding
o certain data may not be considered
important at the time of entry
o not register history or changes of the data
• Missing data may need to be inferred
21.11.2001
Data mining: KDD Process
15
How to Handle Missing Data? (1)
• Ignore the tuple
o usually done when the class label is missing
o not effective, when the percentage of missing values
per attribute varies considerably
• Fill in the missing value manually
o tedious + infeasible?
• Use a global constant to fill in the missing value
o e.g., “unknown”, a new class?!
21.11.2001
Data mining: KDD Process
16
How to Handle Missing Data? (2)
• Use the attribute mean to fill in the missing value
• Use the attribute mean for all samples belonging to
the same class to fill in the missing value
o smarter solution than using the “general” attribute
mean
• Use the most probable value to fill in the missing
value
o inference-based tools such as decision tree
induction or a Bayesian formalism
o regression
21.11.2001
Data mining: KDD Process
17
Noisy Data
• Noise: random error or variance in a
measured variable
• Incorrect attribute values may due to
o faulty data collection instruments
o data entry problems
o data transmission problems
o technology limitation
o inconsistency in naming convention
21.11.2001
Data mining: KDD Process
18
How to Handle Noisy Data?
• Binning
o smooth a sorted data value by looking at the values
around it
• Clustering
o detect and remove outliers
• Combined computer and human inspection
o detect suspicious values and check by human
• Regression
o smooth by fitting the data into regression functions
21.11.2001
Data mining: KDD Process
19
Binning methods (1)
• Equal-depth (frequency) partitioning
o sort data and partition into bins, N
intervals, each containing approximately
same number of samples
o smooth by bin means, bin median, bin
boundaries, etc.
o good data scaling
o managing categorical attributes can be
tricky
21.11.2001
Data mining: KDD Process
20
Binning methods (2)
• Equal-width (distance) partitioning
o divide the range into N intervals of equal
size: uniform grid
o if A and B are the lowest and highest
values of the attribute, the width of
intervals will be: W = (B-A)/N.
o the most straightforward
o outliers may dominate presentation
o skewed data is not handled well
21.11.2001
Data mining: KDD Process
21
Equal-depth binning - Example
• Sorted data for price (in dollars):
o 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
• Partition into (equal-depth) bins:
o Bin 1: 4, 8, 9, 15
o Bin 2: 21, 21, 24, 25
o Bin 3: 26, 28, 29, 34
• Smoothing by bin means: • …by bin boundaries:
o Bin 1: 9, 9, 9, 9
o Bin 1: 4, 4, 4, 15
o Bin 2: 23, 23, 23, 23
o Bin 2: 21, 21, 25, 25
o Bin 3: 29, 29, 29, 29
o Bin 3: 26, 26, 26, 34
21.11.2001
Data mining: KDD Process
22
Data Integration (1)
• Data integration
o combines data from multiple
sources into a coherent store
• Schema integration
o integrate metadata from different
sources
o entity identification problem:
identify real world entities from
multiple data sources, e.g.,
A.cust-id  B.cust-#
21.11.2001
Data mining: KDD Process
23
Data Integration (2)
• Detecting and resolving data value
conflicts
o for the same real world entity,
attribute values from different
sources are different
o possible reasons: different
representations, different scales,
e.g., metric vs. British units
21.11.2001
Data mining: KDD Process
24
Handling Redundant Data
• Redundant data occur often, when multiple databases
are integrated
o the same attribute may have different names in different
databases
o one attribute may be a “derived” attribute in another
table, e.g., annual revenue
• Redundant data may be detected by correlation
analysis
• Careful integration of data from multiple sources may
o help to reduce/avoid redundancies and inconsistencies
o improve mining speed and quality
21.11.2001
Data mining: KDD Process
25
Data Transformation
•
•
•
•
Smoothing: remove noise from data
Aggregation: summarization, data cube construction
Generalization: concept hierarchy climbing
Normalization: scaled to fall within a small, specified
range, e.g.,
o min-max normalization
o normalization by decimal scaling
• Attribute/feature construction
o new attributes constructed from the given ones
21.11.2001
Data mining: KDD Process
26
Data Reduction
• Data reduction
o obtains a reduced representation of the data set that is
much smaller in volume
o produces the same (or almost the same) analytical
results as the original data
• Data reduction strategies
o dimensionality reduction
o numerosity reduction
o discretization and concept hierarchy generation
21.11.2001
Data mining: KDD Process
27
Dimensionality Reduction
• Feature selection (i.e., attribute subset selection):
o select a minimum set of features such that the
probability distribution of different classes given the
values for those features is as close as possible to the
original distribution given the values of all features
o reduce the number of patterns in the patterns, easier to
understand
• Heuristic methods (due to exponential # of choices):
o step-wise forward selection
o step-wise backward elimination
o combining forward selection and backward elimination
21.11.2001
Data mining: KDD Process
28
Dimensionality Reduction Example
Initial attribute set: {A1, A2, A3, A4, A5, A6}
A4 ?
A6?
A1?
Class 1
>
21.11.2001
Class 2
Class 1
Class 2
Reduced attribute set: {A1, A4, A6}
Data mining: KDD Process
29
Numerosity Reduction
• Parametric methods
o assume the data fits some model, estimate model
parameters, store only the parameters, and discard the
data (except possible outliers)
o e.g., regression analysis, log-linear models
• Non-parametric methods
o do not assume models
o e.g., histograms, clustering, sampling
21.11.2001
Data mining: KDD Process
30
Discretization
• Reduce the number of values for a
given continuous attribute by
dividing the range of the attribute into
intervals
• Interval labels can then be used to
replace actual data values
• Some classification algorithms only
accept categorical attributes
21.11.2001
Data mining: KDD Process
31
Concept Hierarchies
• Reduce the data by collecting and
replacing low level concepts by
higher level concepts
• For example, replace numeric
values for the attribute age by
more general values young,
middle-aged, or senior
21.11.2001
Data mining: KDD Process
32
Discretization and
concept hierarchy generation
for numeric data
•
•
•
•
•
21.11.2001
Binning
Histogram analysis
Clustering analysis
Entropy-based discretization
Segmentation by natural
partitioning
Data mining: KDD Process
33
Concept hierarchy generation
for categorical data
• Specification of a partial ordering of
attributes explicitly at the schema
level by users or experts
• Specification of a portion of a
hierarchy by explicit data grouping
• Specification of a set of attributes,
but not of their partial ordering
• Specification of only a partial set of
attributes
21.11.2001
Data mining: KDD Process
34
Specification of
a set of attributes
• Concept hierarchy can be automatically generated based on the
number of distinct values per attribute in the given attribute set. The
attribute with the most distinct values is placed at the lowest level of
the hierarchy.
country
15 distinct values
province_or_ state
65 distinct values
city
3567 distinct values
street
21.11.2001
674 339 distinct values
Data mining: KDD Process
35
Post-processing - overview
Post-processing
21.11.2001
• Why data postprocessing?
• Interestingness
• Visualization
• Utilization
Data mining: KDD Process
36
Why data post-processing? (1)
• Aim: to show the results, or more precisely the most
interesting findings, of the data mining phase to a
user/users in an understandable way
• A possible post-processing methodology:
o find all potentially interesting patterns according to
some rather loose criteria
o provide flexible methods for iteratively and
interactively creating different views of the
discovered patterns
• Other more restrictive or focused methodologies
possible as well
21.11.2001
Data mining: KDD Process
37
Why data post-processing? (2)
• A post-processing methodology is useful, if
o the desired focus is not known in advance (the
search process cannot be optimized to look only for
the interesting patterns)
o there is an algorithm that can produce all patterns
from a class of potentially interesting patterns (the
result is complete)
o the time requirement for discovering all potentially
interesting patterns is not considerably longer than,
if the discovery was focused to a small subset of
potentially interesting patterns
21.11.2001
Data mining: KDD Process
38
Are all the discovered pattern
interesting?
• A data mining system/query may
generate thousands of patterns,
but are they all interesting?
Usually NOT!
• How could we then choose the
interesting patterns?
=> Interestingness
21.11.2001
Data mining: KDD Process
39
Interestingness criteria (1)
• Some possible criteria for interestingness:
o evidence: statistical significance of
finding?
o redundancy: similarity between findings?
o usefulness: meeting the user's
needs/goals?
o novelty: already prior knowledge?
o simplicity: syntactical complexity?
o generality: how many examples covered?
21.11.2001
Data mining: KDD Process
40
Interestingness criteria(2)
• One division of interestingness criteria:
o objective measures that are based on statistics and
structures of patterns, e.g.,
 J-measure: statistical significance
 certainty factor: support or frequency
 strength: confidence
o subjective measures that are based on user’s beliefs
in the data, e.g.,
 unexpectedness: “is the found pattern surprising?"
 actionability: “can I do something with it?"
21.11.2001
Data mining: KDD Process
41
Criticism: Support & Confidence
• Example: (Aggarwal & Yu, PODS98)
o among 5000 students
 3000 play basketball, 3750 eat cereal
 2000 both play basket ball and eat cereal
o the rule play basketball  eat cereal [40%, 66.7%]
is misleading, because the overall percentage of students
eating cereal is 75%, which is higher than 66.7%
o the rule play basketball  not eat cereal [20%,
33.3%] is far more accurate, although with lower
support and confidence
21.11.2001
Data mining: KDD Process
42
Interest
• Yet another objective measure for interestingness is
interest that is defined as
P( A  B)
P( A) P( B)
• Properties of this measure:
o takes both P(A) and P(B) in consideration:
o P(A^B)=P(B)*P(A), if A and B are independent events
o A and B negatively correlated, if the value is less than 1;
otherwise A and B positively correlated.
21.11.2001
Data mining: KDD Process
43
J-measure
• Also J-measure
J  measure  conf ( A)conf ( A  B)  log
conf ( A  B)

conf ( B)
1  conf ( A  B)
(1  conf ( A  B) log(
)
1  conf ( B)
is an objective measure for interestingness
• Properties of J-measure:
o again, takes both P(A) and P(B) in consideration
o value is always between 0 and 1
o can be computed using pre-calculated values
21.11.2001
Data mining: KDD Process
44
Support/Frequency/J-measure
3000
Dataset 1
2500
Dataset 2
Rules
2000
Dataset 3
1500
Dataset 4
1000
Dataset 5
500
Dataset 6
0
0
0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9
1
Threshold
21.11.2001
Data mining: KDD Process
45
Confidence
3000
Dataset 1
2500
Dataset 2
Rules
2000
Dataset 3
1500
Dataset 4
1000
Dataset 5
500
Dataset 6
0
0
0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9
1
Confidence threshold
21.11.2001
Data mining: KDD Process
46
Example – Selection of
Interesting Association Rules
• For reducing the number of
association rules that have to be
considered, we could, for example,
use one of the following selection
criteria:
o frequency and confidence
o J-measure or interest
o maximum rule size (whole rule,
left-hand side, right-hand side)
o rule attributes (e.g., templates)
21.11.2001
Data mining: KDD Process
47
Example –
Problems with selection of rules
• A rule can correspond to prior knowledge or
expectations
o how to encode the background knowledge into the
system?
• A rule can refer to uninteresting attributes or
attribute combinations
o could this be avoided by enhancing the
preprocessing phase?
• Rules can be redundant
o redundancy elimination by rule covers etc.
21.11.2001
Data mining: KDD Process
48
Interpretation and evaluation of
the results of data mining
• Evaluation
o statistical validation and significance testing
o qualitative review by experts in the field
o pilot surveys to evaluate model accuracy
• Interpretation
o tree and rule models can be read directly
o clustering results can be graphed and tabled
o code can be automatically generated by
some systems
21.11.2001
Data mining: KDD Process
49
Visualization of
Discovered Patterns (1)
• In some cases, visualization of the
results of data mining (rules, clusters,
networks…) can be very helpful
• Visualization is actually already
important in the preprocessing
phase in selecting the appropriate data
or in looking at the data
• Visualization requires training and
practice
21.11.2001
Data mining: KDD Process
50
Visualization of
Discovered Patterns (2)
• Different backgrounds/usages may require different
forms of representation
o e.g., rules, tables, cross-tabulations, or pie/bar chart
• Concept hierarchy is also important
o discovered knowledge might be more understandable
when represented at high level of abstraction
o interactive drill up/down, pivoting, slicing and dicing
provide different perspective to data
• Different kinds of knowledge require different kinds of
representation
o association, classification, clustering, etc.
21.11.2001
Data mining: KDD Process
51
Visualization
21.11.2001
Data mining: KDD Process
52
21.11.2001
Data mining: KDD Process
53
Utilization of the results
Increasing potential
to support
business decisions
Making
Decisions
End User
Data Presentation
Visualization Techniques
Business
Analyst
Data Mining
Information Discovery
Data
Analyst
Data Exploration
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
OLAP, MDA
DBA
Data Sources
Paper, Files, Information Providers, Database Systems, OLTP
21.11.2001
Data mining: KDD Process
54
Summary
• Data mining: semi-automatic
discovery of interesting
patterns from large data sets
• Knowledge discovery is a
process:
o preprocessing
o data mining
o post-processing
o using and utilizing the
knowledge
21.11.2001
Data mining: KDD Process
55
Summary
• Preprocessing is important in
order to get useful results!
• If a loosely defined mining
methodology is used, postprocessing is needed in order
to find the interesting results!
• Visualization is useful in preand post-processing!
• One has to be able to utilize
the found knowledge!
21.11.2001
Data mining: KDD Process
56
References – KDD Process
•
•
•
•
•
•
•
•
•
•
P. Adriaans and D. Zantinge. Data Mining. Addison-Wesley: Harlow, England, 1996.
R.J. Brachman, T. Anand. The process of knowledge discovery in databases. Advances in Knowledge
Discovery and Data Mining. AAAI/MIT Press, 1996.
D. P. Ballou and G. K. Tayi. Enhancing data quality in data warehouse environments.
Communications of ACM, 42:73-78, 1999.
M. S. Chen, J. Han, and P. S. Yu. Data mining: An overview from a database perspective. IEEE
Trans. Knowledge and Data Engineering, 8:866-883, 1996.
U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge
Discovery and Data Mining. AAAI/MIT Press, 1996.
T. Imielinski and H. Mannila. A database perspective on knowledge discovery. Communications of
ACM, 39:58-64, 1996.
Jagadish et al., Special Issue on Data Reduction Techniques. Bulletin of the Technical Committee on
Data Engineering, 20(4), December 1997.
D. Keim, Visual techniques for exploring databases. Tutorial notes in KDD’97, Newport Beach, CA,
USA, 1997.
D. Keim, Visual data mining. Tutorial notes in VLDB’97, Athens, Greece, 1997.
D. Keim, and H.P. Krieger, Visual techniques for mining large databases: a comparison. IEEE
Transactions on Knowledge and Data Engineering, 8(6), 1996.
21.11.2001
Data mining: KDD Process
57
References – KDD Process
•
•
•
•
W. Kloesgen, Explora: A multipattern and multistrategy discovery assistant. In U.M. Fayyad, et al.
(eds.), Advances in Knowledge Discovery and Data Mining, 249-271. AAAI/MIT Press, 1996.
M. Klemettinen, A knowledge discovery methodology for telecommunication network alarm
databases. Ph.D. thesis, University of Helsinki, Report A-1999-1, 1999.
M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen, and A.I. Verkamo. Finding interesting rules
from large sets of discovered association rules. CIKM’94, Gaithersburg, Maryland, Nov. 1994.
G. Piatetsky-Shapiro, U. Fayyad, and P. Smith. From data mining to knowledge discovery: An
overview. In U.M. Fayyad, et al. (eds.), Advances in Knowledge Discovery and Data Mining, 1-35.
AAAI/MIT Press, 1996.
•
G. Piatetsky-Shapiro and W. J. Frawley. Knowledge Discovery in Databases. AAAI/MIT Press, 1991.
•
D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999.
•
T. Redman. Data Quality: Management and Technology. Bantam Books, New York, 1992.
•
A. Silberschatz and A. Tuzhilin. What makes patterns interesting in knowledge discovery systems.
IEEE Trans. on Knowledge and Data Engineering, 8:970-974, Dec. 1996.
•
D. Tsur, J. D. Ullman, S. Abitboul, C. Clifton, R. Motwani, and S. Nestorov. Query flocks: A
generalization of association-rule mining. SIGMOD'98, Seattle, Washington, June 1998.
21.11.2001
Data mining: KDD Process
58
References – KDD Process
•
Y. Wand and R. Wang. Anchoring data quality dimensions ontological foundations. Communications
of ACM, 39:86-95, 1996.
•
R. Wang, V. Storey, and C. Firth. A framework for analysis of data quality research. IEEE Trans.
Knowledge and Data Engineering, 7:623-640, 1995.
21.11.2001
Data mining: KDD Process
59
Reminder: Course Organization
Course Evaluation
•
•
21.11.2001
Passing the course: min 30 points
o home exam: min 13 points (max 30
points)
o exercises/experiments: min 8 points
(max 20 points)
 at least 3 returned and reported
experiments
o group presentation: min 4 points (max
10 points)
Remember also the other requirements:
o attending the lectures (5/7)
o attending the seminars (4/5)
o attending the exercises (4/5)
Data mining: KDD Process
60
Seminar Presentations/Groups 9-10
Visualization and
data mining
D. Keim, H.P., Kriegel, T.
Seidl: “Supporting Data
Mining of Large Databases
by Visual Feedback
Queries", ICDE’94.
21.11.2001
Data mining: KDD Process
61
Seminar Presentations/Groups 9-10
Interestingness
G. Piatetsky-Shapiro, C.J.
Matheus: “The
Interestingness of
Deviations”, KDD’94.
21.11.2001
Data mining: KDD Process
62
KDD process
Thanks to
Jiawei Han from Simon Fraser University
and
Mika Klemettinen from Nokia Research Center
for their slides
which greatly helped in preparing this lecture!
Also thanks to
Fosca Giannotti and Dino Pedreschi from Pisa
for their slides.
21.11.2001
Data mining: KDD Process
63