Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Machine Learning
Part 2: Intermediate and Active
Sampling Methods
Jaime Carbonell (with contributions from Pinar
Donmez and Jingrui He)
Carnegie Mellon University
jgc@cs.cmu.edu
December, 2008
© 2008, Jaime G Carbonell
Beyond “Standard” Learning:
Multi-Objective Learning
Structuring Unstructured Data
Text Categorization
Temporal Prediction
Cycle & trend detection
Semi-Supervised Methods
Labeled + Unlabeled Data
Active Learning
Proactive Learning
“Unsupervised” Learning
Predictor attributes, but no explicit objective
Clustering methods
Rare category detection
December, 2008
© 2008, Jaime G. Carbonell
2
Multi-Objective Supervised Learning
Several objectives to predict, overlapping sets of
predictor attributes
p1
p2
p3
p4
p5
Predictor att’s
p6
obj1
-- Independent predictions
(each solved ignoring others)
obj2
-- Dependent predictions
(results of earlier predictions
partially feed next round)
obj3
Dependent case: sequence the predictions. If
feedback, cycle until stability (or fixed N)
December, 2008
© 2008, Jaime G. Carbonell
3
The Vector Space Model
How to Convert Text to “Data”
Definitions of document and query vectors,
where wj = jth word, and c(wj,di) = count the
occurrences of wi in document dj
For topic-categorization use wn+1 as objective
category to predict (e.g. “finance”, “sports”)
Vocabulary {wi , w2 ,...wn }
di [c( w1 , d i ), c( w2 , d i ),..., c( wn , d i )]
qi [c( w1 , qi ), c( w2 , qi ),..., c( wn , qi )]
December, 2008
© 2008, Jaime G. Carbonell
4
Refinements to Word-Based
Features
Well-known methods
Stop-word removal (e.g., “it”, “the”, “in”, …)
Phrasing (e.g., “White House”, “heart attack”, …)
Morphology (e.g., “countries” => “country”)
Feature Expansion
Query expansion (e.g., “cheap” =>
“inexpensive”, “discount”, “economic”,…)
Feature Transformation & Reduction
Singular-value decomposition (SVD)
Linear discriminant analysis (LDA)
December, 2008
© 2008, Jaime G. Carbonell
5
Query-Document Similarity
(For Retrieval and for kNN)
Traditional “Cosine Similarity”
qd
Sim (q , d )
qd
where:
d
2
d
i
i 1,... n
Each element in the query and document vectors are word weights
Rare words count more, e.g.: di = log2(Dall/Dfreq(wordi))
Getting the top-k documents (or web pages) is done by:
Retrieve( q, k ) Arg max [k , Sim(d , q )]
d D
December, 2008
© 2008, Jaime G. Carbonell
6
Multi-tier Text Categorization
News Event
Terrorist Event
Bombing
Economic disaster
Shooting
Asian Crisis
US tech crisis
Given text, predict category at each level
Issue: What if we need to go beyond words as features?
December, 2008
© 2008, Jaime G. Carbonell
7
Time Series Prediction Process
Find leading indicators
“predictor” variables from earlier epochs
Code values per distinct time interval
E.g. “sales at t-1, at t-2, t-3 …”
E.g. “advertisement $ at t, t-1, t-2”
Objective is to predict desired variable at
current or future epochs
E.g. “sales at t, t+1, t+2”
Apply machine learning methods you learned
Regression, d-trees, kNN, Bayesian, …
December, 2008
© 2008, Jaime G. Carbonell
8
Time Series Prediction: caveat 1
2006 Total Sales
2008 Total Sales
Q1: 9.5M
Q1: 12M
Q2: 8.5M
Q2: 11M
Q3: 7.5M
Q3: 9.5M
Q4: 11M
Q4: ??
2007 Total Sales
1. Determine periodic cycle
Q1: 11M
2. Find within-cycle trend
Q2: 10M
3. Find cross-cycle trend
Q3: 8.5M
4. Combine both components
Q4: 13M
December, 2008
© 2008, Jaime G. Carbonell
9
Time Series Prediction: caveat 2
2008 Total Airline Sales
Q1: 12M
Q1: 9.5M
Q2: 8.5M
Q2: 11M
Q3: 7.5M
Q3: 9.5M
Q4: 11M
Q4: ??
Watch for exogenous variable!
(World-trade Center attack
wreaked havoc with airline
industry predictions)
Less tragic and less obvious
one-of-a-kind events too
December, 2008
2006 Total Sales
2007 Total Sales
Q1: 11M
Q2: 10M
Q3: 8.5M
Q4: 13M
© 2008, Jaime G. Carbonell
10
Leveraging Existing Data Collecting Systems
1999 Influenza outbreak
Influenza cultures
Sentinel physicians
WebMD queries about ‘cough’ etc.
School absenteeism
Sales of cough and cold meds
Sales of cough syrup
ER respiratory complaints
ER ‘viral’ complaints
Influenza-related deaths
December, 2008
[Moore, 2002]
Week (1999-2000))
© 2008, Jaime G. Carbonell
11
Adaptive Filtering over a Document Stream
Training documents (past)
Unlabeled documents
Test documents
Current document: On-topic?
On-topic documents
Off-topic documents
December, 2008
time
Topic 1
Topic 2
Topic 3
…
RF
© 2008, Jaime G. Carbonell
12
Classifier = Rocchio, Topic = Civil War (R76 in TREC10), Threshold = MLR
MLR threshold function:
locally linear, globally non-linear
December, 2008
© 2008, Jaime G. Carbonell
13
Time Series in a Nutshell
Time-Series Prediction requires regression, except
Historical data per time period (aka “epoch”)
Predictor attributes come from both current +
earlier epochs
Objective attribute from earlier epochs
predictor attributes for current epoch
Process Difference with Normal Machine Learning
First detect cyclical patterns among epochs
Predict within a cycle
Predict cross-cycle using corresponding epochs
only (then combine with within-cycle prediction)
December, 2008
© 2008, Jaime G. Carbonell
14
Active Learning
Assume:
{x, y}
Very few “labeled” instances
Very many “unlabeled” instances {x}
An omniscient “oracle” which can assign
an label to an unlabeled instance
Objective:
Select instances to label such that
learning accuracy is maximized with the
fewest oracle labeling requests
December, 2008
© 2008, Jaime G. Carbonell
15
Active Learning (overall idea)
Data Source
learn a new model
unlabeled data
Learning Mechanism
User
output
label request
Expert
labeled data
Why is Active Learning Important?
Labeled data volumes unlabeled data volumes
1.2% of all proteins have known structures
.01% of all galaxies in the Sloan Sky Survey
have consensus type labels
.0001% of all web pages have topic labels
If labeling is costly, or limited, we want to select
the points that will have maximal impact
December, 2008
© 2008, Jaime G. Carbonell
17
Review of Supervised Learning
Training data: {xi , yi }i 1,... k , y simplify
y
Functional space:
{ f j pl }
Fitness Criterion:
arg min yi f j , pl ( xi ) ( f j , pl )
j ,l
i
Variants: online learning, noisy data, …
December, 2008
© 2008, Jaime G. Carbonell
18
Active Learning
Training data: {xi , yi }i 1,... k {xi }i k 1,... n O : xi yi
Special case: k 0
Functional space: { f j
Fitness Criterion:
a.k.a. loss function
pl }
arg min yi f j , pl ( xi ) ( f j , pl )
j ,l
i
Sampling Strategy:
ˆ
arg min L( f ( x , y )) | x {x ,..., x }
xi { xk 1 ,..., xn }
December, 2008
test
test
i
© 2008, Jaime G. Carbonell
1
k
19
Sampling Strategies
Random sampling (preserves distribution)
Uncertainty sampling (Tong & Koller, 2000)
proximity to decision boundary
maximal distance to labeled x’s
Density sampling (kNN-inspired McCallum & Nigam, 2004)
Representative sampling (Xu et al, 2003)
Instability sampling (probability-weighted)
x’s that maximally change decision boundary
Ensemble Strategies
Boosting-like ensemble (Baram, 2003)
DUAL (Donmez & Carbonell, 2007)
Dynamically switches strategies from Density-Based to
Uncertainty-Based by estimating derivative of expected
residual error reduction
December, 2008
© 2008, Jaime G. Carbonell
20
Which point to sample?
Green = unlabeled
Red = class A
Brown = class B
Density-Based Sampling
Centroid of largest unsampled cluster
Uncertainty Sampling
Closest to decision boundary
Maximal Diversity Sampling
Maximally distant from labeled x’s
Ensemble-Based Possibilities
Uncertainty + Diversity criteria
Density + uncertainty criteria
Active Learning Issues
Interaction of active sampling with underlying
classifier(s).
On-line sampling vs. batch sampling.
Active sampling for rank learning and for
structured learning (e.g. HMMs, sCRFs).
What if Oracle is fallible, or reluctant, or
differentially expensive proactive learning.
How does noisy data affect active learning?
What if we do not have even the first labeled
point(s) for one or more classes? new class
discovery.
How to “optimally” combine A.L .strategies
December, 2008
© 2008, Jaime G. Carbonell
26
Strategy Selection:
No Universal Optimum
• Optimal operating
range for AL sampling
strategies differs
• How to get the best of
both worlds?
• (Hint: ensemble
methods, e.g. DUAL)
December, 2008
© 2008, Jaime G. Carbonell
27
Motivation for DUAL
Strength of DWUS:
favors higher density samples close to the decision boundary
fast decrease in error
But!
DWUS establishes diminishing returns! Why?
• Early iterations -> many points are highly uncertain
• Later iterations -> points with
high uncertainty no longer in dense regions28
December, 2008
© 2008, Jaime G. Carbonell
• DWUS wastes time picking instances with no direct effect on the error
How does DUAL do better?
Runs DWUS until it estimates a cross-over
(DWUS )
x t
Monitor the change in expected error at each iteration to
detect when it is stuck in local minima
^
^
(DWUS )
1
nt
E [(y i y i )
2
| xi ] 0
DUAL uses a mixture model after the cross-over ( saturation )
point
^
x s argmax * E [(y i y i )2 | x i ] (1 ) * p (x i )
*
i I U
Our goal should be to minimize the expected future error
If we knew the future error of Uncertainty Sampling (US) to
be zero, then we’d force 1
But in practice, we do not know it
December, 2008
© 2008, Jaime G. Carbonell
29
More on DUAL
After cross-over, US does better => uncertainty score should
be given more weight
should reflect how well US performs
can be calculated by the expected error of
^
^
US on the unlabeled data* => (US )
Finally, we have the following selection criterion for DUAL:
^
^
^
x s argmax(1 (US )) * E [(y i y i ) | x i ] (US ) * p (x i )
*
2
i I U
*
US is allowed to choose data only from among the already
sampled instances, and
is calculated on the remaining
^
unlabeled set
to
(US )
December, 2008
© 2008, Jaime G. Carbonell
30
Results: DUAL vs DWUS
December, 2008
© 2008, Jaime G. Carbonell
31
Paired Density-Based Sampling
(Donmez & Carbonell, 2008)
Desiderata
Balanced Sampling from both (all) classes
Combine density-based with coverage-based
Method
Non-Euclidian distance function
p 1
1
pk pk 1
d (x i , x j )= ln(1 min (e
1))
p Pij
k 1
Select maximally separated pairs of points
based on maximizing a utility function
December, 2008
© 2008, Jaime G. Carbonell
32
Paired Density Method (cont.)
Utility function:
U (i , j ) log pˆ(x ) pˆ(x )
i
j
2
Pˆ(y k | x k )
exp( x i x k ) * y min
k { 1}
k i N x i
2
log exp( x j x r ) * min Pˆ(y r | x r )
y r { 1}
r j N x j
ˆ
ˆ
s * min P (y i | x i ) min P (y j | x j )
y j { 1}
y i { 1}
Select the two points that optimize utility and are
maximally distant
i
December, 2008
*
,j
*
argmax x
i j I U
i
xj
2
* U (i , j )
© 2008, Jaime G. Carbonell
33
Results of Paired-Density Sampling
December, 2008
© 2008, Jaime G. Carbonell
34
Active Learning model in NLP
Test Data
Evaluation
Parsing model
Training Data
Build
Machine Translation System
Active Learner
Named Entity Recognition module
Word Sense Disambiguation model
Sample selection
Addition
Samples
Unlabeled Set
Active Training Set
Un-annotated corpus
Annotation
Translation
Word-Sense Disambiguation
Needed in NLP for parsing, translation, search…
Example:
Line ax+by+c, rope, queue, track,…
“Banco” bench, financial inst, sand bank, …
Challenge: How to disambiguate from context
Approach: Build ML classifier (sense = class)
Problem: Insufficient training data
Amelioration: Active Learning
December, 2008
© 2008, Jaime G. Carbonell
36
Word Sense Disambiguation:
Active Learning Methods
Entropy Sampling
Vector q represents the trained model’s predictions
qc prediction probability of class c
Pick the example whose prediction vector displays the
greatest entropy
Margin Sampling
If c and c’ are the two most likely categories Picks the
example with the smallest margin
December, 2008
© 2008, Jaime G. Carbonell
Word Sense Disambiguation:
Experiment
On 5 English verbs that had coarse grained senses.
Double-blind tagging applied to 50 instances of the target word
If the inter-tagger (ITA) agreement < 90%, the sense entry is
revised by adding examples and explanations
December, 2008
© 2008, Jaime G. Carbonell
Word Sense Disambiguation Results
Active vs. Proactive Learning
ACTIVE LEARNING
PROACTIVE LEARNING
All x’s cost the same to label
Max number of labels
Omniscient oracle
Never errs
Indefatigable oracle
Always answers
Single oracle
Oracle selection unnecessary
December, 2008
Labeling cost is f1(D(x),O)
Max labeling budget
Fallible oracles
Errs with p(E(x)) ~
f2(D(x),O)
Reluctant oracles
Answers with p(A(x)) …
Multiple oracles
Joint optimization of oracle
and instance selection
© 2008, Jaime G. Carbonell
40
Scenario 1: Reluctance
2 oracles:
reliable oracle: expensive but always answers
with a correct label
reluctant oracle: cheap but may not respond to
some queries
Define a utility score as expected value of
information at unit cost
P (ans | x , k ) *V (x )
U (x , k )
Ck
December, 2008
© 2008, Jaime G. Carbonell
41
How to estimate Pˆ(ans | x , k ) ?
Cluster unlabeled data using k-means
Ask the label of each cluster centroid to the reluctant oracle. If
label received: increase Pˆ(ans | x ,reluctant) of nearby points
no label: decrease Pˆ(ans | x ,reluctant)
of nearby points
h (x c t , y c t ) maxd x c t x
Pˆ(ans | x ,reluctant)
exp
ln
Z
2
x ct x
0.5
x C t
h (x c , y c ) {1, 1} equals 1 when label received, -1 otherwise
# clusters depend on the clustering budget and oracle fee
December, 2008
© 2008, Jaime G. Carbonell
42
Algorithm for Scenario 1
December, 2008
© 2008, Jaime G. Carbonell
43
Scenario 2: Fallibility
Two oracles:
One perfect but expensive oracle
One fallible but cheap oracle, always answers
Alg. Similar to Scenario 1 with slight modifications
During exploration:
Fallible oracle provides the label with its confidence
Confidence = Pˆ(y | x ) of fallible oracle
If Pˆ(y | x ) [0.45,0.5] then we don’t use the label
but we still update Pˆ(correct | x , k )
December, 2008
© 2008, Jaime G. Carbonell
44
Scenario 3: Non-uniform Cost
Uniform cost: Fraud detection, face recognition, etc.
Non-uniform cost: text categorization, medical diagnosis,
protein structure prediction, etc.
2 oracles:
Fixed-cost Oracle
Variable-cost Oracle
C non unif (x ) 1
December, 2008
max y Y Pˆ(y | x ) 1 Y
1 1 Y
© 2008, Jaime G. Carbonell
45
Outline of Scenario 3
December, 2008
© 2008, Jaime G. Carbonell
46
Underlying Sampling Strategy
Conditional entropy based sampling, weighted by a density
measure
Captures the information content of a close neighborhood
U (x ) log min Pˆ(y | x ,wˆ) exp x k
k x N x
y { 1}
2
2
ˆ
* min P (y | k ,wˆ)
y { 1}
close neighbors of x
December, 2008
© 2008, Jaime G. Carbonell
47
Results: Reluctance
December, 2008
© 2008, Jaime G. Carbonell
48
Cost varies non-uniformly
statistically
significant
(p<0.01)
December, 2008
© 2008, Jaime G. Carbonell
49
Proactive Learning in General
Multiple Expert (a.k.a. Oracles)
Different areas of expertise
Different costs
Different reliabilities
Different availability
What question to ask and whom to query?
Joint optimization of query & oracle selection
Referals among Oracles (with referal fees)
Learn about Oracle capabilities as well as
solving the Active Learning problem at hand
December, 2008
© 2008, Jaime G. Carbonell
50
Unsupervised Learning in DM
What does it mean to learn without an objective?
Explore the data for natural groupings
Learn association rules, and later examine
whether they can be of any business use
Illustrative examples
Market basket analysis later optimize shelf
allocation & placements
Cascaded or correlated mechanical faults
Demographic grouping beyond known classes
Plan product bundling offers
December, 2008
© 2008, Jaime G. Carbonell
51
Example Similarity Functions
Determine a similarity metric
Eucledian
Cosine
KL-divergence
sim euclid (d i , d j )
2
2
(d i ,k d j ,k )
k 1, n
q di
simcos (q , d i )
q 2 di
1
2
Determine a clustering algorithm
Incremental, agglomerative, K-means, …
December, 2008
© 2008, Jaime G. Carbonell
52
Hierarchical Agglomerative
Clustering Methods
Generic Agglomerative Procedure (Salton '89), result in
nested clusters via iterations
1. Compute all pairwise document-document similarity
coefficients
2. Place each of n documents into a class of its own
3. Merge the two most similar clusters into one;
- replace the two clusters by the new cluster
- recompute intercluster similarity scores w.r.t. the new
cluster
- If cluster radius > max-size, block further merging
4. Repeat the above step until there are only k clusters left
(note k could = 1).
December, 2008
© 2008, Jaime G. Carbonell
53
Group Agglomerative Clustering
2
1
6
5
4
3
9
7
8
K-Means Clustering
1. Select k-seeds s.t. d(ki,kj) > dmin
2. Assign points to clusters by min dist.
Cluster(pi) = Argmin(d(pi,sj))
sj{s1,…,sk}
3. Compute new cluster centroids:
1
cj
pi
n pi j thcluster
4. Reassign points to clusters (as in 2 above)
5. Iterate until no points change clusters
December, 2008
© 2008, Jaime G. Carbonell
55
K-Means Clustering: Initial Data Points
Step 1: Select k random
seeds s.t. d(ki,kj) > dmin
Initial Seeds
(if k=3)
K-Means Clustering: First-Pass Clusters
Step 2: Assign points
to clusters by min dist.
Cluster(pi) = Argmin(d(pi,sj))
sj{s1,…,sk}
Initial Seeds
K-Means Clustering: Seeds Centroids
Step 3: Compute new
cluster centroids:
1
cj
n
p
i
pi j th cluster
New Centroids
K-Means Clustering: Second Pass Clusters
Note: some
data points
reassigned
Step 4: Recompute
Cluster(pi) = Argmin(d(pi,cj))
cj{c1,…,ck}
Centroids
Cluster Optimization (finding “k”)
average(d ( xi , x j ), x cluster , i j )
k Arg min
k[1, n ] average(d ( xk , xl ), x cluster , k l )
1
1
d
(
x
,
x
)
i
j
k c2
cCk
xi x j c
k Arg min
k[1, n ] 1
d
(
cen
(
c
),
cen
(
c
))
l
m
k2
cl cm Ck
1
k
d
(
x
,
x
)
i
j
c2
cCk
xi x j c
k Arg min
k[1, n ] d (cen(cl ), cen(cm ))
cl cm Ck
December, 2008
© 2008, Jaime G. Carbonell
60
Clustering for Novelty Detection
Functionality
Build background model
Technology
Expected Events (clusters)
Find divergences
(Hierarchical) k-means
Individual outliers (but many
false positives)
New Mini-clusters (unmasked
new-event detection)
Detect when a novel event is
masked by ordinary ones
Trigger Alerts
December, 2008
Divergence metrics
Radial density gradients from
cluster centroid
Temporally-adaptive distance
measures
Secondary peaks in density
function
Route & Prioritize
Formulate hypotheses for Analyst
Modeling methods
Create analyst profiles
RETE-based SAMs methods
(last PI-meeting ARGUS
paper)
© 2008, Jaime G. Carbonell
61
Cluster Evolution
Constant Event
New Obfuscated Event
New Un-obfuscated Event
Growing Event
( x )
( x )
(1 ) max
j
r
j
Cluster Density Changes
Constant Event
New Obfuscated Event
New Unobfuscated Event
Growing Event
( x )
( x )
(1 ) max
j
r
j
Novelty Detection and Profile
Management 1
Novelty Detection
Matcher
Profiles
Data Streams
New Profiles
Analyst
December, 2008
© 2008, Jaime G. Carbonell
64
Results on Medical Data
New Mini-Cluster Analysis reveals outbreaks of:
•
•
•
•
Tularemia
Dengue Fever
Myiasis
Chagas Disease
SARS Outbreak simulation
Added new records for patients
from a small geographical region
diagnosed with influenza in
9/2001
Graph shows resulting secondary
peak in the pulmonary disease
density function
December, 2008
© 2008, Jaime G. Carbonell
65
What’s Rare Category Detection
Start de-novo
Very skewed classes
Majority classes
Minority classes
Labeling oracle
Goal
Discover minority classes
with a few label requests
December, 2008
© 2008, Jaime G. Carbonell
66
Comparison with Outlier
Detection
Rare classes
A group of points
Clustered
Non-separable from the
majority classes
December, 2008
Outliers
A single point
Scattered
Separable
© 2008, Jaime G. Carbonell
67
Fraud detection
Network intrusion detection
Applications
Astronomy
Spam image detection
The Big Picture
Unbalanced
Unlabeled
Data Set
Rare
Category
Detection
Feature
Extraction
Learning in
Unbalanced
Settings
Classifier
Feature
Representation
Relational
Temporal
Raw
Data
Questions We Want to Address
How to detect rare categories in an unbalanced, unlabeled
data set with the help of an oracle?
How to detect rare categories with different data types, such
as graph data, stream data, etc?
How to do rare category detection with the least information
about the data set?
How to select relevant features for the rare categories?
How to design effective classification algorithms which fully
exploit the property of the minority classes (rare category
classification)?
December, 2008
© 2008, Jaime G. Carbonell
70
Notation
d
x
S
x
,
,
x
1 , n i
Unlabeled examples:
m Classes: yi 1, , m
m-1 rare classes: p 2 , , p m
One majority class: p1 , p c
2cm
Goal: find at least ONE example from each rare
class by requesting a few labels
December, 2008
© 2008, Jaime G. Carbonell
71
Assumptions
The distribution of the majority class is sufficiently
smooth
Examples from the minority classes form compact
clusters in the feature space
0.25
0.2
0.15
0.1
0.05
December, 2008
0
-6
© 2008, Jaime G. Carbonell
-4
-2
0
2
72
4
6
Two Classes: NNDB
1. Calculate class-specific radius r
2. xi S , NN xi , r x x xi r , ni NN xi , r
Increase t by 1
3.
si
max
x j NN xi ,tr
n n
i
j
4. Query x arg max xi S si
No
5.
xRare class?
Yes
6. Output
December, 2008
x
© 2008, Jaime G. Carbonell
73
NNDB: Calculate Nearest Neighbors
r
200
190
180
170
160
150
140
130
120
120
140
160
180
200
220
200
190
180
170
NN xi , r x x xi r
ni NN xi , r
160
150
140
130
120
120
December, 2008
140
© 2008, Jaime G. Carbonell
160
180
200
220
74
NNDB: Calculate the Scores
tr
200
si
max
x j NN xi ,tr
n n
i
j
190
180
170
Query x arg max xi S si
160
150
140
130
120
120
December, 2008
140
© 2008, Jaime G. Carbonell
160
180
200
220
75
NNDB: Pick the Next Candidate
t 1 r
200
Increase t by 1
si
max
190
n n
x j NN xi , t 1 r
i
j
180
170
160
Query x arg max xi S si
150
140
130
120
120
December, 2008
140
© 2008, Jaime G. Carbonell
160
180
200
220
76
Why NNDB Works
Theoretically
Theorem 1 [He & Carbonell 2007]: under
certain conditions, with high probability, after a
few iteration steps, NNDB queries at least one
example whose probability of coming from the
minority class is at least 1/3
Intuitively
The scoresi measures the
change in local density
200
190
180
170
160
150
140
130
120
120
December, 2008
© 2008, Jaime G. Carbonell
140
160
180
200
220
77
Multiple Classes: ALICE
2
m
p
,
,
p
m-1 rare classes:
1
c
One majority class: p
,p 2 c m
c c 1
Yes
1. For each rare class c,
2cm
2. We have found examples from class c
No
3. Run NNDB with prior
December, 2008
© 2008, Jaime G. Carbonell
pc
78
Why ALICE Works
Theoretically
Theorem 2 [He & Carbonell 2008]: under
certain conditions, with high probability, in
each outer loop of ALICE, after a few
iteration steps in NNDB, ALICE queries at least
one example whose probability of coming from
one minority class is at least 1/3
December, 2008
© 2008, Jaime G. Carbonell
79
Implementation Issues
ALICE
Problem: repeatedly sampling from the same rare class
MALICE
Solution: relevance feedback
Class-specific radius
December, 2008
© 2008, Jaime G. Carbonell
80
Results on Synthetic Data Sets
5
4
3
2
1
0
-1
-3
-2
-1
0
December, 2008
1
2
3
4
© 2008, Jaime G. Carbonell
81
Summary of Real Data Sets
Abalone
4177 examples
7-dimensional features
20 classes
Largest class: 16.50%
Smallest class: 0.34%
December, 2008
Shuttle
4515 examples
9-dimensional features
7 classes
Largest class: 75.53%
Smallest class: 0.13%
© 2008, Jaime G. Carbonell
82
Results on Real Data Sets
Abalone
Shuttle
MALICE
Interleave
Random sampling
December, 2008
© 2008, Jaime G. Carbonell
MALICE
Interleave
Random sampling
83
Imprecise priors
Abalone
Shuttle
20
7
Classes Discovered
Classes Discovered
6
15
-5%
-10%
-20%
0
+5%
+10%
+20%
10
5
0
0
50
100
150
200
Number of Selected Examples
December, 2008
5
4
3
2
250
1
0
© 2008, Jaime G. Carbonell
-5%
-10%
-20%
0
+5%
+10%
+20%
20
40
60
80
Number of Selected Examples
84
100
Specially Designed Exponential
Families [Efron & Tibshirani 1996]
Favorable compromise between parametric and
nonparametric density estimation
Estimated density
p 1 parameter vector
Carrier density
g x g0 x exp 0 t x
Normalizing parameter
December, 2008
T
1
p 1 vector of sufficient statistics
© 2008, Jaime G. Carbonell
85
SEDER Algorithm
Carrier density: kernel density estimator
T
1 2
d 2
t x x ,, x
To decouple the estimation of different parameters
d
j
Decompose 0 j 1 0
Relax the constraint such that
xj
December, 2008
dx
j
j 2
x xi
1
j
j
j
exp
exp
x
0i
1
2 j 2
2 j
© 2008, Jaime G. Carbonell
2
j
1
86
Parameter Estimation
Theorem 3 [To appear]: the maximum likelihood estimate
j and j satisfy the following conditions:
j and ˆ j of
0i
̂
0i
1
1
x
n
k 1
j 2
k
where
j
Ei x
j 2
December, 2008
xj
j 1,, d
j
j 2
n
ˆ j xk xi E j x j
exp
i1 0i 2 j 2 i
n
k 1
j
j 2
n
ˆ j xk xi
exp
i 1 0i 2 j 2
x
j 2
2
dx
j
j 2
x xi
1
ˆ j ˆ j x j
exp
exp
0i
1
j
j 2
2
2
© 2008, Jaime G. Carbonell
2
87
j
Parameter Estimation cont.
1
Let 1 j
b
1
j
: positive parameter
b
j 2
2
2
B
B
4 AC
j
j 1,, d: bˆ
2A
1 n
where
,j 2
j 2
B C k 1 xk
n
bˆ j 1
j
j 2
in most cases
n
xk xi j 2
i 1 exp 2 j 2 xi
1 n
A k 1
j
j 2
n
n
xk xi
exp
i 1 2 j 2
j
1
December, 2008
© 2008, Jaime G. Carbonell
88
Scoring Function
The estimated density
d
1 n
~
g b x i 1 j 1
n
Scoring function: norm of the gradient
n
sk
d
l 1
where
1 d
Di x j 1
n
December, 2008
j
j j 2
x b xi
1
exp
j 2 j
j
j
2b
2 b
i 1
l
k
l 2
l
Di xk x b x
l
b
l
i
2
2
j
j j 2
x b xi
1
exp
j 2 j
j
j
2b
2 b
© 2008, Jaime G. Carbonell
89
Results on Synthetic Data Sets
December, 2008
© 2008, Jaime G. Carbonell
90
Summary of Real Data Sets
Data
Set
n
d
m
Largest
Class
Smallest
Class
Ecoli
336
7
6
42.56%
2.68%
Glass
214
Moderately
Skewed
9
6
35.51%
4.21%
Page Blocks
5473
10
5
89.77%
0.51%
Abalone
4177
7
20
16.50%
0.34%
Shuttle
4515
9
7
75.53%
0.13%
December, 2008
Extremely Skewed
© 2008, Jaime G. Carbonell
91
Moderately Skewed Data Sets
Ecoli
Glass
MALICE
MALICE
December, 2008
© 2008, Jaime G. Carbonell
92
Extremely Skewed Data Sets
Page Blocks
Abalone
MALICE
MALICE
Shuttle
MALICE
Additional Notation
W : n n pair-wise similarity matrix
D : n n diagonal matrix,
W D 1 2W D 1 2
Dii j 1Wij : normalized matrix
n
A
I nn W : global similarity matrix, where
is an
1
I nn identity matrix, and
is a positive
parameter close to 1
December, 2008
© 2008, Jaime G. Carbonell
94
Global Similarity Matrix
1
A I nn W
Better than pair-wise similarity matrix for rare
category detection
December, 2008
© 2008, Jaime G. Carbonell
95
GRADE: Full Prior Information
2cm
1. For each rare class c,
2. Calculate class-specific similarity a
c
3. xi S, NN xi , a c x A x, xi a c , nic NN xi , a c
Increase t by 1
4. si
Relevance
max c
Feedback
x j NN xi , a t
n
c
i
ncj
5. Query x arg max xi S si
No
6. x class c?
Yes
7. Output
x
GRADE-LI: Less Prior Information
1. Calculate problem-specific similarity a
2. xi S , NN xi , a x A x, xi a , ni NN xi , a
Increase t by 1
3.
si
Relevance
max ni
xj
NN xi , a t
Feedback
nj
t 2
4. Query x arg max xi S si
No
5. xa new class?
Yes
6. Output
December, 2008
x
© 2008, Jaime G. Carbonell
7. Budget exhausted?
No
97
MALICE
Glass
MALICE
Shuttle
Abalone
Ecoli
Results on Real Data Sets
MALICE
MALICE
Applying Machine Learning for
Data Mining in Business
Step 1: Have clear objective to Optimize
Step 2: Have sufficient data
Step 3: Clean, normalize, clean data some more
Step 4: Make sure there isn’t an easy solution
(e.g. a small number of rules from expert)
Step 5: Do the Data Mining for real
Step 6: Cross-validate, improve, go to step 5
December, 2008
© 2008, Jaime G. Carbonell
99
Managing the Data Mining Process
Ingredients for successful DM
Data (warehouse, stream, DBs, …)
Right problems (objectives, …)
Tools (Machine Learning tool suites, …)
People (analogy to surgical team: next slide)
Estimate (size) problem, approach, progress
ROI (max, min, realistic)
Determine if DM is likely best approach
Deploy team
Evaluate intermediate results
December, 2008
© 2008, Jaime G. Carbonell
100
The Data Mining Team
The Administrator (manager & domain)
Pick problem, resources, ROI calc, monitor, …
The Surgeon (ML specialist w/domain knowledge)
Select ML method, predictor atts, objective, …
The Anesthesiologist (preparer)
Chief data specialist, sampling, coverage, …
The Nurses (assistants)
DB manager, programmers, gophers …
The Medical Students
Prepare new surgeons: learn by doing
December, 2008
© 2008, Jaime G. Carbonell
101
Need Some Domain Expertise
Data Preparation
What are good candidate predictor att’s?
How to combine multiple objectives?
How to sample? (e.g. id cyclic periods)
Progress monitoring and results interpretation
How accurate must prediction be?
Do we need more or different data?
Are we pursing reasonable objective(s)?
Application of DM after accomplished
Update of DM when/as environment evolves
December, 2008
© 2008, Jaime G. Carbonell
102
Typical Data Mining Pitfalls
Insufficient data to establish predictive patterns
Incorrect selection of predictor attributes
Statistics to the rescue (e.g. 2 test)
Unrealistic objectives (e.g. fraud recovery)
Inappropriate ML method selection
Data preparation problems
Failure to normalize across data sets
Systematic bias in original data collection
Belief in DM as panacea or black magic
Giving up too soon (very common)
December, 2008
© 2008, Jaime G. Carbonell
103
Final Words on Data Mining
Data Mining is:
1/3 science (math, algorithms, …)
…and 1/3 engineering (data prep, analysis, …)
…and 1/3 “art” (experience really counts)
10 years ago it was mostly art
10 years from now it will be mostly engineering
What to expect from the research labs?
Better supervised algorithms
Focus on unsupervised learning + optimization
Move to incorporate semi-structured (text) data
December, 2008
© 2008, Jaime G. Carbonell
104
THANK YOU!
December, 2008
© 2008, Jaime G. Carbonell
105