Download 슬라이드 1 - SNUT Data Mining & Data Analysis Tool

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
2011 Data Mining
Industrial & Information Systems Engineering
Chapter 2:
Overview of Data Mining Process
•Pilsung Kang
•Industrial & Information Systems Engineering
•Seoul National University of Science & Technology
2011 Data Mining, IISE, SNUT
Data Mining Definition Revisited
Extracting useful information from large datasets.
(Hand et al., 2001)
Data mining is the process of exploration and analysis,
by automatic or semi-automatic means, of large
quantities of data in order to discover meaningful
patterns and rules. (Berry and Linoff, 1997, 2000)
Data mining is the process of discovering meaningful
new correlations, patterns and trends by sifting through
large amount data stored in repositories, using pattern
recognition technologies as well as statistical and
mathematical techniques. Gartner Group, 2004)
2
2011 Data Mining, IISE, SNUT
Descriptive vs. Predictive (purpose)
Descriptive Modeling
Predictive Modeling
 Look back to the past
 Predict the future
 To extract compact and easily
 Identify strong links between
understood information from
variables of data.
large, sometimes gigantic
 To predict the unknown
database.
consequence (dependent
 OLAP (online analytical
variable) based on the
processing), SQL (structured
information provided
query language).
(independent variable)
 y = f(x1, x2, ..., xn) + ε
3
2011 Data Mining, IISE, SNUT
Supervised vs. Unsupervised (methods)
Supervised Learning
Unsupervised Learning
 Goal: predict a single “target” or
 Explores intrinsic characteristics.
“outcome” variable.
 Estimates underlying
 Finds relations between X and Y.
distribution.
 Train (learn) data where target
 Segment data into meaningful
value is known.
groups or detect patterns.
 Score data where target value is
 There is no target (outcome)
not known.
variable to predict or classify.
4
2011 Data Mining, IISE, SNUT
Data Mining Techniques
1
Data Visualization
 Graphs and plots of data.
 Histograms, boxplots, bar charts, scatterplots.
 Especially useful to examine relationships between pairs of
variables.
 Descriptive & Unsupervised
5
2011 Data Mining, IISE, SNUT
Data Mining Techniques
Data Reduction
2
 Distillation of complex/large data into simpler/smaller data.
 Reducing the number of variables/columns.
 Also called dimensionality reduction(variable selection,
variable extraction, e.g., principal component analysis)
 Reducing the number of records/rows.
 Also called data compression (e.g., sampling and clustering)
 Descriptive & Unsupervised
Data Visualization + Data Reduction = Data Exploration
6
2011 Data Mining, IISE, SNUT
Data Mining Techniques
Segmentation/Clustering
 Goal: divide the entire data into a small number of subgroups.
 Homogeneous within groups while heterogeneous between
3
groups.
 Examples: Market segmentation, social network analysis.
 Descriptive & Unsupervised
7
2011 Data Mining, IISE, SNUT
Data Mining Techniques
Segmentation/Clustering example: hierarchical clustering
3
8
2011 Data Mining, IISE, SNUT
Data Mining Techniques
Classification
 Goal: predict categorical target (outcome) variable.
 Examples: Purchase/no purchase, fraud/no fraud, creditworthy/not
creditworthy.
 Each row is a case/record/instance.
4
 Each column is a variable.
 Target variable is often binary (yes/no).
 Predictive & Supervised
9
2011 Data Mining, IISE, SNUT
Data Mining Techniques
Classification Example: Decision Tree
4
10
2011 Data Mining, IISE, SNUT
Data Mining Techniques
Classification Example: Logistic Regression
1
0.9
0.8
0.7
0.6
0.5
4
0.4
0.3
0.2
0.1
0
-5
-4
-3
-2
0
-1
1
2
3
4
5
 Play if 1/(1+exp(-0.2*outlook+0.4*humidity+0.8*windy) > 0.5
 Else, do not play
11
2011 Data Mining, IISE, SNUT
Data Mining Techniques
Classification Examples
“Separate the riding mower buyers(●) from non-buyers(○)”
4
(x-axis: income(x$1000), y-axis: Lot size (x1000 sqft))
12
2011 Data Mining, IISE, SNUT
Data Mining Techniques
Prediction
 Goal: predict numerical target (outcome) variable.
 Examples: sales, revenue, performance.
 As in classification:
 Each row is a case/record/instance.
 Each column is a variable.
 Taken together, classification and prediction
5
constitute “predictive analytics”
 Predictive & Supervised
13
2011 Data Mining, IISE, SNUT
Data Mining Techniques
Prediction Example: Neural Networks
5
14
2011 Data Mining, IISE, SNUT
Data Mining Techniques
Association Rule
 Goal: produce rules that define “what goes with what”
 Example: “If X was purchased, Y was also purchased”
 Rows are transactions.
 Used in recommender systems – “Our records show you bought X,
you may also like Y”
 Also called “affinity analysis,” or “market basket analysis”
 Predictive & Unsupervised
6
15
2011 Data Mining, IISE, SNUT
Data Mining Techniques
Association Rule Example: Market Basket Analysis
6
Wall Mart (USA)
E-Mart (Korea)
16
2011 Data Mining, IISE, SNUT
Data Mining Techniques
Novelty Detection
 Goal: identify if a new case is similar to the given ‘normal’ cases.
 Example: medical diagnosis, fault detection, identity verification.
 Each row is a case/record/instance.
 Each column is a variable.
 No explicit target variable, but assumed that all records have the
same target.
 Also called “outlier detection,” or “one-class classification”
 Predictive & Unsupervised
7
17
2011 Data Mining, IISE, SNUT
Data Mining Techniques
Novelty Detection Example: Keystroke Dynamics-based
User Authentication
7
http://ksd.snu.ac.kr
18
2011 Data Mining, IISE, SNUT
Data Mining Techniques
Descriptive Modeling
Predictive Modeling
•…
• Classification
• Prediction
Supervised
Learning
• Data Visualization
• Association Rules
Unsupervised • Data Reduction
Learning
• Segmentation/clustering
19
• Novelty Detection
2011 Data Mining, IISE, SNUT
Steps in Data Mining
1. Define and understand the purpose of data mining project
2. Formulate the data mining problem
3. Obtain/verify/modify the data
4. Explore and customize the data
5. Build data mining models
6. Evaluate and interpret the results
7. Deploy and monitor the model
20
2011 Data Mining, IISE, SNUT
Steps in Data Mining
1
Define and understand the purpose of data mining project
 Why do we have to conduct this project?
(Jun, 2010: http://www.kdnuggets.com)
 What would be the achievement if the project succeed?
21
2011 Data Mining, IISE, SNUT
Steps in Data Mining
Formulate the data mining problem
2
 What is the purpose?
 Increase sales.
 Detect cancer patients.
 What data mining task is appropriate?
 Classification.
 Prediction.
 Association rules, …
22
2011 Data Mining, IISE, SNUT
Steps in Data Mining
Obtain/verify/modify the data: Data acquisition
 Data source
 Data warehouse,
3
 Data mart, …
 Define input variables and target variable if necessary
 Ex: Churn prediction for credit card service
• Inputs: age, sex, tenure, amount of spending, risk grade,…
• Target: whether he/she leaves the company.
23
2011 Data Mining, IISE, SNUT
Steps in Data Mining
Obtain/verify/modify the data: Outlier detection
 Outlier
 “A value that the variable cannot have” or “ An extremely rare
3
value” (ex: age 990, height -150cm, …)
 There are a number of outliers in a real database due to
many reasons.
 How to deal with outliers?
 Ignore the record with outliers if total record is sufficient.
 Replace with another value (mean, median, estimate from a
certain pdf, etc) if total records are insufficient.
24
2011 Data Mining, IISE, SNUT
Steps in Data Mining
Obtain/verify/modify the data: Missing Value Imputation
 Missing value
 A variable is missing when it has null value in database
3
although it should have a certain real value.
 Operational errors, human errors.
 How to deal with missing values?
 Ignore the record with missing values if total record is
sufficient.
 Replace with another value (mean, median, estimate from a
certain pdf, etc) if total records are insufficient.
25
2011 Data Mining, IISE, SNUT
Steps in Data Mining
Obtain/verify/modify the data: Variable handling
 Type of variables
 Binary: 0/1 (ex: benign/malignant in medical diagnosis).
3
 Categorical: more than two values, ordered (high, middle,
low) or not ordered (ex: color, job).
 Ordinal: continuous, differences between two consecutive
values are not identical (ex: rank of the final exam).
 Interval: continuous, difference between two consecutive
values are identical (ex: age, height, weight).
26
2011 Data Mining, IISE, SNUT
Steps in Data Mining
Obtain/verify/modify the data: Variable handling
 Variable transformation
3
Binning:
• interval → binary or ordered
categorical.
Low
Mid
1-of-C coding:
• unordered categorical →
binary.
“Color: yellow, red, blue, green”
High
27
d1
d2
d3
yellow
1
0
0
red
0
1
0
blue
0
0
1
green
0
0
0
2011 Data Mining, IISE, SNUT
Steps in Data Mining
Explore and customize the data: Data Visualization
 Single variable
Histogram:
• shows the distribution of a
single variable.
• possible to check the normality.
4
Box plot
outliers
Histogram
“max”
180
160
Frequency
140
120
100
mean
80
60
40
quartile 3
median
quartile 1
20
0
5
10
15
20
25
30
35
40
45
50
“min”
outlier
MEDV
28
2011 Data Mining, IISE, SNUT
Steps in Data Mining
Explore and customize the data: Data Visualization
 Multiple variables
 Correlation table:
• indicate which variables are highly (positively or
negatively) correlated.
• Help to remove irrelevant variables or select
4
representative variables
CRIM
CRIM
ZN
INDUS
CHAS
NOX
RM
ZN
INDUS
CHAS
1
-0.20047
1
0.406583 -0.53383
1
-0.05589
-0.0427 0.062938
1
0.420972
-0.5166 0.763651 0.091203
-0.21925 0.311991 -0.39168 0.091251
29
NOX
1
-0.30219
RM
1
2011 Data Mining, IISE, SNUT
Steps in Data Mining
Explore and customize the data: Data Visualization
 Multiple variables
Var. 1
 Scatter plot matrix:
• Shows the relations between two pairs of variables.
Var. 2
4
Var. 3
Var. 4
30
2011 Data Mining, IISE, SNUT
Steps in Data Mining
Explore and customize the data: Dimensionality Reduction
“If there are various logical ways to explain a certain phenomenon,
the simplest is the best” - Occam’s Razor
 Curse of dimensionality
 The number of records increases exponentially to sustain the
4
same explain ability as the number of variables increases.
21=2
22=4
23=8
31
2011 Data Mining, IISE, SNUT
Steps in Data Mining
Explore and customize the data: Dimensionality Reduction
 Variable reduction
 Select a small set of relevant variables.
 Correlation analysis, Kolmogorov-Sminrov test, …
4
V1
V2
V3
V4
V1
V2
V3
V4
V5
V6
1
0.9
-0.8
0.1
0.2
0
1
-0.7
0.2
0.1
0.1
1
-0.1
0.1
-0.1
1
0.9
0.3
1
-0.9
V5
V6
1
32
Select
V1 & V4
2011 Data Mining, IISE, SNUT
Steps in Data Mining
Explore and customize the data: Dimensionality Reduction
 Variable extraction
 Construct a new variable that contains more intensive
information than original variables.
 Principal component analysis (PCA), …
4
 Example:
 Original variables:
• Age, sex, height, weight
• Income, property, tax paid
 Constructed variables:
• Var1: age+3*I(sex = female)+0.2*height-0.3*weight
• Var2: Income + 0.1*property + 2*tax paid
33
2011 Data Mining, IISE, SNUT
Steps in Data Mining
Explore and customize the data: Instance Reduction
 Random sampling
 Select a small set of records with uniformly distributed
sampling rate.
 In classification, class ratios are preserved.
4
 Stratified sampling
 Select a set of records such that rare events have higher
probability to be selected.
 In classification, class ratios are modified.
• Under-sampling: preserve minority, reduce majority.
• Over-sampling: preserve majority, increase minority.
34
2011 Data Mining, IISE, SNUT
Steps in Data Mining
Explore and customize the data: Data separation
 Over-fitting
 Occurs when data mining algorithms ‘memorize’ the given
data, even unnecessary (noise, outlier, etc.).
4
10
10
8
8
6
6
4
4
2
2
0
0
0
5
10
0
35
5
10
2011 Data Mining, IISE, SNUT
Steps in Data Mining
Explore and customize the data: Data partition
 Training Data
 Used to build a model or learn data mining algorithm.
 Validation Data
 Used to select the best parameters for the model.
4
 Test Data
 Used to select the best model among algorithms considered.
Training Data
Algorithm A-1
Algorithm A-2
Algorithm A-3
Algorithm B-1
Algorithm B-2
Algorithm B-3
Validation Data
Algorithm A-1
Algorithm A-2
Algorithm A-3
Algorithm B-1
Algorithm B-2
Algorithm B-3
36
Test Data
Algorithm A-1
Algorithm A-2
Algorithm A-3
Algorithm B-1
Algorithm B-2
Algorithm B-3
2011 Data Mining, IISE, SNUT
Steps in Data Mining
Explore and customize the data: Data normalization
 Normalization (Standardization)
 Eliminate the effect caused by different measurement scale
or unit.
 z-score: (value-mean)/(standard deviation).
Original data
4
Normalized data
Id
Age
Income
Id
Age
Income
1
25
1,000,000
1
-2
-1
2
35
2,000,000
2
0
0
3
45
3,000,000
3
2
1
…
…
…
…
…
…
Mean
35
2,000,000
Mean
0
0
Stdev
5
1,000,000
Stdev
1
1
37
2011 Data Mining, IISE, SNUT
Steps in Data Mining
Build data mining models
 Data mining algorithm
 Classification
• Logistic regression, k-nearest neighbor, naïve bayes,
classification trees, neural networks, linear discriminant
analysis.
 Prediction
5
• Linear regression, k-nearest neighbor, regression trees,
neural networks.
 Association rules: A priori algorithm.
 Clustering: Hierarchical clustering, K-Means clustering.
38
2011 Data Mining, IISE, SNUT
Steps in Data Mining
Evaluate and interpret the results
 Classification performance
 Confusion matrix
Predicted
1(+)
0(-)
1(+)
True positive,
Sensitivity (A)
False negative,
Type I error (B)
0(-)
False positive,
Type II error (C)
True negative,
Specificity (D)
Actual
 Simple accuracy: (A+C)/(A+B+C+D)
6
 Balanced correction rate:
A
D

A B C  D
 Lift charts, receiver operating characteristic (ROC) curve, etc.
39
2011 Data Mining, IISE, SNUT
Steps in Data Mining
Evaluate and interpret the results
 Prediction performance
 y: actual target value, y’: predicted target value
• Mean squared error, Root mean squared error
MSE 
1 n
2

(
y

y
)

i
i
n i 1
RMSE 
• Mean absolute error
6
1 n
MAE  i 1 yi  yi
n
• Mean absolute percentage error
MAPE 
1 n
y  yi / yi

i 1 i
n
40
1 n
2

(
y

y
)

i
i
n i 1
10
8
6
4
2
0
0
5
10
2011 Data Mining, IISE, SNUT
Steps in Data Mining
Evaluate and interpret the results
 Clustering
 Within variance: variance among record in a single cluster.
 Between variance: variance between clusters.
 Good clustering: high between variance and low within
variance.
 Association rules
 Support: P ( A, B )
6
 Confidence: P( A | B) 
 Lift:
P( A, B)
P( B)
P( A | B)
P( A, B)

P( B)
P( A)  P( B)
41
2011 Data Mining, IISE, SNUT
Steps in Data Mining
Deploy and monitor the model
 Deployment
 Integrate the data mining model into operational system.
 Run the model on real data to produce decisions or actions.
• “Send Mr. Kang a coupon because his likelihood to leave
the company next month is 80%”
 Monitoring
 Evaluate the performance of the model after deployment.
 Update or redevelop if necessary.
7
42