Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Sistemas de Información Biomédica
Grado en Ingeniería Biomédica
Universidad Politécnica de Madrid
KDD & Data Mining
Prof. David Pérez del Rey
dperezdelrey@fi.upm.es
School of Computer Science - UPM
Room 2104
Tel: +34 91 336 74 45
Outline – 2 + 1 hours
KDD and DM – 1 hour (first day)
KDD and Data Mining
Simple Examples
Biomedical Applications
Further resources
DM Exercises and Assignment selection – 1 hour (first day)
Groups
Assignment
Start working…
Presentations – up to 1 hours (in 2 weeks)
A 10-15 minutes presentation per group
2
Motivation – Data growth
Nowadays the amount
of data available is
increasing dramatically:
Bank, telecom…
Astronomy, biology,
medicine…
Web, text, ecommerce…
Few data is ever
looked by humans
3
Motivation – Data growth in Biomedicine
David Pérez del Rey
4
More information = better decisions?
More data available doesn’t mean more
knowledge to be used in decisions
We need automatic Knowledge Discovery in
Databases (KDD) methods
5
What is (not) Data Mining?
What is not Data
Mining?
What is Data Mining?
– Look up phone
number in phone
directory
– Certain names are more
prevalent in certain US
locations (O’Brien, O’Rurke,
O’Reilly… in Boston area)
– Query a Web
search engine for
information
about “Amazon”
– Group together similar
documents returned by
search engine according to
their context (e.g. Amazon
rainforest, Amazon.com,)
© Tan,Steinbach, Kumar
Introduction to Data Mining
Knowledge Discovery Definition
Knowledge Discovery in Data is the
non-trivial process of identifying
valid
novel
potentially useful
and ultimately understandable patterns in data
from Advances in Knowledge Discovery and Data
Mining, Fayyad, Piatetsky-Shapiro, Smyth, and
Uthurusamy, (Chapter 1), AAAI/MIT Press 1996
7
Phases of the KDD Process (Fayyad et al., 1996)
“Discovery process of non-trivial and useful knowledge”
Integrated
Data
Original
Data
Target
Data
Selection
Data Reduction
Integration
Interpretation
Knowledge
Data Cleaning
Data Mining
Patterns
Transformation
Transformed
8
Data
Data Mining Tasks
Prediction
Methods
Use some variables to predict unknown or
future values of other variables
Description
Methods
Find human-interpretable patterns that
describe the data
From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
Related Fields
Machine
Learning
Visualization
Data Mining and
Knowledge
Discovery
Statistics
Databases
&
Integration
10
BIG Data!!!
Big data = collection of data sets so large and
complex that it becomes difficult to process using
traditional data processing applications
Analysis
Capture
Curation
Search
Sharing
Storage
Transfer
Visualization
Privacy violations
…
Traditional
technologies have
limitations in:
Internet search
Physics simulations
Meteorology
Genomics
Internet search
Finance
…
11
Types of Attributes
There are different types of attributes
Nominal
○
Examples: ID numbers, eye color, zip codes
Ordinal
○ Examples: rankings (e.g., taste of potato chips on a
scale from 1-10), grades, height in {tall, medium,
short}
Interval
○ Examples: calendar dates, temperatures in Celsius or
Fahrenheit.
Ratio
○ Examples: temperature in Kelvin, length, time, counts
© Tan,Steinbach, Kumar
Introduction to Data Mining
Properties of Attribute Values
The type of an attribute depends on which of the
following properties it possesses:
Distinctness:
Order:
Addition:
Multiplication:
=
< >
+ */
Nominal attribute: distinctness
Ordinal attribute: distinctness & order
Interval attribute: distinctness, order & addition
Ratio attribute: all 4 properties
© Tan,Steinbach, Kumar
Introduction to Data Mining
Discrete and Continuous Attributes
Discrete Attribute
Has only a finite or countably infinite set of values
Examples: zip codes, counts, or the set of words in a
collection of documents
Often represented as integer variables
Note: binary attributes are a special case of discrete
attributes
Continuous Attribute
Has real numbers as attribute values
Examples: temperature, height, or weight
Practically, real values can only be measured and
represented using a finite number of digits
Continuous attributes are typically represented as floatingpoint variables
© Tan,Steinbach, Kumar
Introduction to Data Mining
Missing Values
Reasons for missing values
Information is not collected
(e.g., people decline to give their age and weight)
Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)
Handling missing values
Eliminate Data Objects
Estimate Missing Values
Ignore the Missing Value During Analysis
Replace with all possible values (weighted by their
probabilities)
© Tan,Steinbach, Kumar
Introduction to Data Mining
Duplicate Data
Data set may include data objects that are
duplicates, or almost duplicates of one another
Major issue when merging data from heterogeous sources
Examples:
Same person with multiple email addresses
Data cleaning
Process of dealing with duplicate data issues
© Tan,Steinbach, Kumar
Introduction to Data Mining
Previous Phases to Data Mining
High-quality data preparation is key to producing
valid and reliable models
Data Understanding
Integration
Sampling
Dimensionality Reduction
Feature subset selection
Feature creation
Discretization and Binarization
Attribute Transformation
…
17
Data mining
Unsupervised Techniques
Cluster Analysis, Principal Components
Association Rules, Collaborative Filtering
Supervised Techniques
Prediction (Estimation):
○ Regression, Regression Trees, k-Nearest Neighbors
Classification:
○ k-Nearest Neighbors, Naïve Bayes, Classification
Trees, Logistic Regression, Neural Nets
18
Clustering Definition
Given a set of data points, each having a set of
attributes, and a similarity measure among them,
find clusters such that
Data points in one cluster are more similar to one another
Data points in separate clusters are less similar to one
another
Similarity Measures:
Euclidean Distance if attributes are continuous.
Other Problem-specific Measures.
© Tan,Steinbach, Kumar
Introduction to Data Mining
Illustrating Clustering
Euclidean Distance Based Clustering in 3-D space.
Intracluster distances
are minimized
© Tan,Steinbach, Kumar
Intercluster distances
are maximized
Introduction to Data Mining
Clustering: Application 1
Market Segmentation:
Goal: subdivide a market into distinct subsets of
customers where any subset may conceivably be
selected as a market target to be reached with a
distinct marketing mix
Approach:
○ Collect different attributes of customers based on their
geographical and lifestyle related information
○ Find clusters of similar customers
○ Measure the clustering quality by observing buying patterns
of customers in same cluster vs. those from different clusters
© Tan,Steinbach, Kumar
Introduction to Data Mining
Regression
Predict a value of a given continuous valued variable
based on the values of other variables, assuming a
linear or nonlinear model of dependency
Greatly studied in statistics, neural network fields
Examples:
Predicting sales amounts of new product based on
advertising expenditure
Predicting wind velocities as a function of
temperature, humidity, air pressure, etc
Time series prediction of stock market indices
© Tan,Steinbach, Kumar
Introduction to Data Mining
Data Mining Tasks: Regression
y
Y1
y=x+1
Y1’
X1
x
Classification: Definition
Given a collection of records (training set )
Each record contains a set of attributes, one of the
attributes is the class
Find a model for class attribute as a function of the
values of other attributes
Goal: previously unseen records should be assigned a
class as accurately as possible
A test set is used to determine the accuracy of the
model. Usually, the given data set is divided into
training and test sets, with training set used to build
the model and test set used to validate it
© Tan,Steinbach, Kumar
Introduction to Data Mining
Examples of Classification Task
Predicting relapse of cancer
Classifying credit card transactions
as legitimate or fraudulent
Classifying structures of protein
Categorizing news stories as finance,
weather, entertainment, sports, etc…
Classification: Application
Direct Marketing
Goal: Reduce cost of mailing by targeting a set of
consumers likely to buy a new cell-phone product
Approach:
○ Use the data for a similar product introduced before
○ We know which customers decided to buy and which
decided otherwise - This {buy, don’t buy} decision forms
the class attribute.
○ Collect various demographic, lifestyle, and companyinteraction related information about all such customers
Type of business, where they stay, how much they earn, etc.
○ Use this information as input attributes to learn a classifier
model
From [Berry & Linoff] Data Mining Techniques, 1997
Classification: Application
Sky Survey Cataloging
Goal: To predict class (star or galaxy) of sky objects,
especially visually faint ones, based on the
telescopic survey images (from Palomar
Observatory).
3000 images with 23,040 x 23,040 pixels per image.
Approach:
○ Segment the image
○ Measure image attributes (features) - 40 of them per object
○ Model the class based on these features
○ Success Story: Could find 16 new high red-shift quasars,
some of the farthest objects that are difficult to find!
From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
Classifying Galaxies
Early
Class:
• Stages of Formation
Courtesy: http://aps.umn.edu
Attributes:
• Image features,
• Characteristics of light
waves received, etc.
Intermediate
Late
Data Size:
• 72 million stars, 20 million galaxies
• Object Catalog: 9 GB
• Image Database: 150 GB
Classification: Application
Fraud Detection
Goal: Predict fraudulent cases in credit card
transactions
Approach:
○ Use credit card transactions and the information on its
account-holder as attributes
When does a customer buy, what does he buy, how often he pays
on time, etc
○ Label past transactions as fraud or fair transactions - This
forms the class attribute
○ Learn a model for the class of the transactions.
○ Use this model to detect fraud by observing credit card
transactions on an account
© Tan,Steinbach, Kumar
Introduction to Data Mining
Illustrating Classification Task
Tid
Attrib1
1
Yes
Large
125K
No
2
No
Medium
100K
No
3
No
Small
70K
No
4
Yes
Medium
120K
No
5
No
Large
95K
Yes
6
No
Medium
60K
No
7
Yes
Large
220K
No
8
No
Small
85K
Yes
9
No
Medium
75K
No
10
No
Small
90K
Yes
Attrib2
Attrib3
Class
Learning
algorithm
Induction
Learn
Model
Model
10
Training Set
Tid
Attrib1
Attrib2
Attrib3
11
No
Small
55K
?
12
Yes
Medium
80K
?
13
Yes
Large
110K
?
14
No
Small
95K
?
15
No
Large
67K
?
Apply
Model
Class
Deduction
10
Test Set
© Tan,Steinbach, Kumar
Introduction to Data Mining
Example of a Decision Tree
Tid Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
Splitting Attributes
Refund
Yes
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
NO
> 80K
YES
10
Training Data
Married
Model: Decision Tree
Another Example of Decision
Tree
MarSt
10
Tid Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
Married
NO
Single,
Divorced
Refund
No
Yes
NO
TaxInc
< 80K
NO
> 80K
YES
There could be more than one tree that
fits the same data!
Decision Tree Classification Task
Tid
Attrib1
1
Yes
Large
125K
No
2
No
Medium
100K
No
3
No
Small
70K
No
4
Yes
Medium
120K
No
5
No
Large
95K
Yes
6
No
Medium
60K
No
7
Yes
Large
220K
No
8
No
Small
85K
Yes
9
No
Medium
75K
No
10
No
Small
90K
Yes
Attrib2
Attrib3
Class
Tree
Induction
algorithm
Induction
Learn
Model
Model
10
Training Set
Tid
Attrib1
Attrib2
11
No
Small
55K
?
12
Yes
Medium
80K
?
13
Yes
Large
110K
?
14
No
Small
95K
?
15
No
Large
67K
?
10
Test Set
Attrib3
Apply
Model
Class
Deduction
Decision
Tree
Apply Model to Test Data
Test Data
Start from the root of tree.
Refund
Yes
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Married
NO
> 80K
YES
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
?
Apply Model to Test Data
Test Data
Refund
Yes
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Married
NO
> 80K
YES
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
?
Apply Model to Test Data
Test Data
Refund
Yes
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Married
NO
> 80K
YES
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
?
Apply Model to Test Data
Test Data
Refund
Yes
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Married
NO
> 80K
YES
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
?
Apply Model to Test Data
Test Data
Refund
Yes
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Married
NO
> 80K
YES
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
?
Apply Model to Test Data
Test Data
Refund
Yes
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
?
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Married
NO
> 80K
YES
Assign Cheat to “No”
Decision Tree Classification Task
Tid
Attrib1
Attrib2
Attrib3
Class
1
Yes
Large
125K
No
2
No
Medium
100K
No
3
No
Small
70K
No
4
Yes
Medium
120K
No
5
No
Large
95K
Yes
6
No
Medium
60K
No
7
Yes
Large
220K
No
8
No
Small
85K
Yes
9
No
Medium
75K
No
10
No
Small
90K
Yes
Tree
Induction
algorithm
Induction
Learn
Model
Model
10
Training Set
Tid
Attrib1
Attrib2
11
No
Small
55K
?
12
Yes
Medium
80K
?
13
Yes
Large
110K
?
14
No
Small
95K
?
15
No
Large
67K
?
10
Test Set
Attrib3
Apply
Model
Class
Deduction
Decision
Tree
Decision Tree Induction
Many Algorithms:
Hunt’s Algorithm (one of the earliest)
CART
ID3, C4.5
SLIQ,SPRINT
Simple Examples
From Witten, I.H., and Frank, E., Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations
42
Simple Examples
From Witten, I.H., and Frank, E., Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations
43
Simple Examples
From Witten, I.H., and Frank, E., Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations
44
Simple Examples
From Witten, I.H., and Frank, E., Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations
45
Simple Examples
From Witten, I.H., and Frank, E., Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations
46
Decision tree Example
Let a computer do it for us: WEKA
From Witten, I.H., and Frank, E., Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations
47
Model Evaluation
Metrics for Performance Evaluation
How to evaluate the performance of a model?
Methods for Performance Evaluation
How to obtain reliable estimates?
Methods for Model Comparison
How to compare the relative performance among
competing models?
© Tan,Steinbach, Kumar
Introduction to Data Mining
Metrics for Performance Evaluation
Focus on the predictive capability of a model
Rather than how fast it takes to classify or build models,
scalability, etc.
Confusion Matrix:
PREDICTED CLASS
Class=Yes
Class=Yes
ACTUAL
CLASS Class=No
a
c
Class=No
b
d
a: TP (true positive)
b: FN (false negative)
c: FP (false positive)
d: TN (true negative)
Metrics for Performance Evaluation…
PREDICTED CLASS
Class=Yes
Class=Yes
ACTUAL
CLASS Class=No
Class=No
a
(TP)
b
(FN)
c
(FP)
d
(TN)
Most widely-used metric:
ad
TP TN
Accuracy
a b c d TP TN FP FN
Evaluation
51
Limitation of Accuracy
Consider a 2-class problem (e.g. Ebola or not)
Number of Class 0 examples = 999
Number of Class 1 examples = 1
If model predicts everything to be class 0,
accuracy is 999/1000 = 99.9 %
Accuracy is misleading because model does not detect
any class 1 example
Cost Matrix
PREDICTED CLASS
C(i|j)
Class=Yes
Class=Yes
C(Yes|Yes)
C(No|Yes)
C(Yes|No)
C(No|No)
ACTUAL
CLASS Class=No
Class=No
C(i|j): Cost of misclassifying class j example as class i
Computing Cost of Classification
Cost
Matrix
PREDICTED CLASS
ACTUAL
CLASS
Model
M1
ACTUAL
CLASS
PREDICTED CLASS
+
-
+
150
40
-
60
250
Accuracy = 80%
Cost = 3910
C(i|j)
+
-
+
-1
100
-
1
0
Model
M2
ACTUAL
CLASS
PREDICTED CLASS
+
-
+
250
45
-
5
200
Accuracy = 90%
Cost = 4255
Cost vs Accuracy
Count
PREDICTED CLASS
Class=Yes
Class=Yes
ACTUAL
CLASS
Class=No
Class=No
a
b
c
d
Accuracy is proportional to cost if
1. C(Yes|No)=C(No|Yes) = q
2. C(Yes|Yes)=C(No|No) = p
N=a+b+c+d
Accuracy = (a + d)/N
Cost
PREDICTED CLASS
Class=Yes
ACTUAL
CLASS
Class=No
Class=Yes
p
q
Class=No
q
p
Cost = p (a + d) + q (b + c)
= p (a + d) + q (N – a – d)
= q N – (q – p)(a + d)
= N [q – (q-p) Accuracy]
Cost-Sensitive Measures
a
Precision (p)
ac
a
Recall (r)
ab
2rp
2a
F - measure (F)
r p 2a b c
wa w d
Weighted Accuracy
wa wb wc w d
1
1
4
2
3
4
Methods for Performance Evaluation
How to obtain a reliable estimate of performance?
Performance of a model may depend on other
factors besides the learning algorithm:
Class distribution
Cost of misclassification
Size of training and test sets
Learning Curve
Learning curve shows
how accuracy changes
with varying sample size
Effect of small sample size:
-
Bias in the estimate
-
Variance of estimate
Methods of Estimation
Holdout
Reserve 2/3 for training and 1/3 for testing
Random subsampling
Repeated holdout
Cross validation
Partition data into k disjoint subsets
k-fold: train on k-1 partitions, test on the remaining one
Leave-one-out: k=n
Stratified sampling
oversampling vs undersampling
Bootstrap
Sampling with replacement
ROC (Receiver Operating Characteristic)
Developed in 1950s for signal detection theory to
analyze noisy signals
Characterize the trade-off between positive hits and false
alarms
ROC curve plots TP (on the y-axis) against FP (on
the x-axis)
Performance of each classifier represented as a
point on the ROC curve
changing the threshold of algorithm, sample distribution or
cost matrix changes the location of the point
ROC Curve
(TP,FP):
(0,0): declare everything
to be negative class
(1,1): declare everything
to be positive class
(1,0): ideal
Diagonal line:
Random guessing
Below diagonal line:
○ prediction is opposite of
the true class
ROC Curve
62
Using ROC for Model Comparison
No model consistently
outperform the other
M1 is better for
small FPR
M2 is better for
large FPR
Area Under the ROC
curve
Ideal:
Area
=1
Random guess:
Area
= 0.5
Data Mining in Biomedicine
Health Care
Disease diagnosis
Drug discovery
Symptom clustering
Decision Support Systems
…
Bioinformatics / Genomics
Gene expression
Microarrays analysis
○ Many columns (variables) – Moderate number of rows
(observation units)
Protein structure prediction
…
Major challenge: Integration of multi-scale data
64
Example: ALL/AML data
38 training cases, 34 test, ~ 7,000 genes
2 Classes: Acute Lymphoblastic Leukemia (ALL) vs
Acute Myeloid Leukemia (AML)
Use trainning data to build diagnostic model
ALL
AML
Results on test data:
33/34 correct, 1 error may be mislabeled
65
Protein Structure
SPIDER Data Mining Project: Scalable, Parallel
and Interactive Data Mining and Exploration at RPI
http://www.cs.rpi.edu/~zaki
66
From http://www.cs.rpi.edu/~zaki
Resources
References
Witten, I.H., and Frank, E., Data Mining: Practical Machine Learning Tools
and Techniques with Java Implementations, Morgan Kaufmann, 2000.
Han, J. and Kamber, Micheline, Data Mining: Concepts and Techniques,
Morgan Kaufmann, 2001
Pardalos P, Boginski V, Vazakopoulos A: Data mining in biomedicine.
Springer; 2007. (Google Books)
Wang JTL, Zaki MJ, Toivonen HTT, et al. (eds). Data Mining in Bioinformatics.
Springer-Verlag, 2004. (Google Books)
Online
University of Minnesota: http://www-
users.cs.umn.edu/~kumar/dmbook/index.php
University of Regina:
http://www2.cs.uregina.ca/~hamilton/courses/831/index.html
University of Waikato: Weka Software
University of Ljubljana: Orange Software
68
Assignment – Weka / Orange
Using the Weka http://www.cs.waikato.ac.nz/ml/weka/ or Orange http://orange.biolab.si/
framework
Data mining analysis, comparing performance results for a classification problem:
Data mining analysis, comparing performance results for a regresion problem:
breastTumor.arff – Train a model to predict tumor size
At most 3 different classifiers for each dataset (including ZeroR and Neural Networks)
(optional) Investigate Evaluation measures
Report
breast-cancer.arff – Train a model to predict recurrence of cancer
At most 3 different classifiers for each dataset (including ZeroR and J48)
“cross-validation” vs “only training set”
(optional) Manual missing value preprocessing (Delete or estimate)
(optional) Manual field selection
(optional) Cost estimate
Length: up to 2-3 pages in .pdf to dperezdelrey@fi.upm.es by 27th October at 12:00
10-15 minutes presentation and discussion on 29nd October at 17:30
69