Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Mining Generalized
Association Rules
Ramkrishnan Strikant
Rakesh Agrawal
Data Mining Seminar, spring semester, 2003
Prof. Amos Fiat
Student: Idit Haran
Outline
Motivation
Terms
& Definitions
Interest Measure
Algorithms for mining generalized
association rules
Comparison
Conclusions
Idit Haran, Data Mining Seminar, 2003
2
Motivation
Find Association
Rules of the form:
Diapers Beer
Different kinds of diapers:
Huggies/Pampers, S/M/L, etc.
Different kinds of beers:
Heineken/Maccabi, in a bottle/in a can, etc.
The information on the bar-code is of type:
Huggies Diapers, M Heineken Beer in bottle
The preliminary rule is not interesting, and probably
will not have minimum support.
Idit Haran, Data Mining Seminar, 2003
3
Taxonomy
is-a
hierarchies
Clothes
Outwear
Jackets
Shirts
Footwear
Shoes
Hiking
Boots
Ski Pants
Idit Haran, Data Mining Seminar, 2003
4
Taxonomy - Example
Let
say we found the rule:
Outwear Hiking Boots
with minimum support and confidence.
The rule
Jackets Hiking Boots
may not have minimum support
The rule
Clothes Hiking Boots
may not have minimum confidence.
Idit Haran, Data Mining Seminar, 2003
5
Taxonomy
Users
are interested in generating rules that span
different levels of the taxonomy.
Rules of lower levels may not have minimum
support
Taxonomy can be used to prune uninteresting or
redundant rules
Multiple taxonomies may be present.
for example: category, price(cheap, expensive),
“items-on-sale”. etc.
Multiple taxonomies may be modeled as a forest, or
a DAG.
Idit Haran, Data Mining Seminar, 2003
6
Notations
z
ancestors
(marked with ^)
edge:
is_a relationship
parent
p
c1
c2
child
descendants
Idit Haran, Data Mining Seminar, 2003
7
Notations
I
= {i1, i2, …, im}- items.
T-
transaction, set of items TI
(we expect the items in T to be leaves in T .)
– set of transactions
T supports item x, if x is in T or x is an
ancestor of some item in T.
T supports XI if it supports every item
in X.
D
Idit Haran, Data Mining Seminar, 2003
8
Notations
A generalized
association rule: X Y
if XI , YI , XY = , and no item in
Y is an ancestor of any item in X.
The rule XY has confidence c in D if c% of
transactions in D that support X also support Y.
The rule XY has support s in D if s% of
transactions in D supports XY.
Idit Haran, Data Mining Seminar, 2003
9
Problem Statement
To
find all generalized association
rules that have support and
confidence greater than the userspecified minimum support (called
minsup) and minimum confidence
(called minconf) respectively.
Idit Haran, Data Mining Seminar, 2003
10
Example
Recall
the taxonomy:
Clothes
Outwear
Jackets
Shirts
Footwear
Shoes
Hiking
Boots
Ski Pants
Idit Haran, Data Mining Seminar, 2003
11
Frequent Itemsets
Example
Itemset
Support
{Jacket}
2
{Outwear}
3
{Clothes}
4
{Shoes}
2
{Hiking Boots}
2
{Footwear}
4
Database D
Transaction Items Bought
100
Shirt
200
Jacket, Hiking Boots
300
Ski Pants, Hiking Boots
400
Shoes
500
Shoes
{Clothes,Hiking Boots}
2
600
Jacket
{Outwear, Footwear}
2
{Clothes, Footwear}
2
{Outwear, Hiking Boots} 2
Rules
Rule
Support Confidence
Outwear Hiking Boots
33%
66.6%
minsup = 30%
Outwear Footwear
33%
66.6%
minconf = 60%
Hiking Boots Outwear
33%
100%
Idit Haran, Data
Seminar, 2003
Hiking Boots Clothes
33%Mining
100%
12
Observation 1
If
the set{x,y} has minimum support,
so do {x^,y^} {x^,y} and {x^,y^}
For example:
if {Jacket, Shoes} has minsup, so will
{Outwear, Shoes}, {Jacket,Footwear}, and
{Outwear,Footwear}
Clothes
Outwear
Footwear
Shirts
Shoes
Idit Haran, Data Mining Seminar, 2003
Jackets
Ski Pants
Hiking
Boots
13
Observation 2
If
the rule xy has minimum support and
confidence, only xy^ is guaranteed to have both
minsup and minconf.
The
rule OutwearHiking Boots has minsup and minconf.
The rule OutwearFootwear has both minsup and minconf.
Clothes
Outwear
Footwear
Shirts
Shoes
Jackets Idit Haran,
Ski Pants
Data Mining Seminar, 2003
Hiking
Boots
14
Observation 2 – cont.
However,
the rules x^y and x^y^ will have
minsup, they may not have minconf.
For
example:
The rules ClothesHiking Boots and ClothesFootwear
have minsup, but not minconf.
Clothes
Outwear
Footwear
Shirts
Shoes
Jackets Idit Haran,
Ski Pants
Data Mining Seminar, 2003
Hiking
Boots
15
Interesting Rules –
Previous Work
a
rule XY is not interesting if:
support(XY) support(X)•support(Y)
Previous work does not consider taxonomy.
The previous interest measure pruned less
than 1% of the rules on a real database.
Idit Haran, Data Mining Seminar, 2003
16
Interesting Rules –
Using the Taxonomy
MilkCereal
(8% support, 70% conf)
Milk is parent of Skim Milk, and 25% of sales
of Milk are Skim Milk
We expect:
Skim MilkCereal
to have 2% support
and 70% confidence
Idit Haran, Data Mining Seminar, 2003
17
R-Interesting Rules
A rule
is XY is R-interesting w.r.t an
ancestor X^Y^ if:
or,
real support
(XY)
real confidence
(XY)
>
>
R•
expected support (XY)
based on (X^Y^)
expected confidence (XY)
R•
based on (X^Y^)
With
R = 1.1 about 40-55% of the rules were
prunes.
Idit Haran, Data Mining Seminar, 2003
18
Problem Statement (new)
To
find all generalized R-interesting
association rules (R is a userspecified minimum interest called
min-interest) that have support and
confidence greater than minsup and
minconf respectively.
Idit Haran, Data Mining Seminar, 2003
19
Algorithms – 3 steps
1. Find all itemsets whose support is greater than
minsup. These itemsets are called frequent
itemsets.
2. Use the frequent itemsets to generate the
desired rules:
if ABCD and AB are frequent then
conf(ABCD) = support(ABCD)/support(AB)
3. Prune all uninteresting rules from this set.
*All presented algorithms will only implement step 1.
Idit Haran, Data Mining Seminar, 2003
20
Algorithms – 3 steps
1. Find all itemsets whose support is greater than
minsup. These itemsets are called frequent
itemsets.
2. Use the frequent itemsets to generate the
desired rules:
if ABCD and AB are frequent then
conf(ABCD) = support(ABCD)/support(AB)
3. Prune all uninteresting rules from this set.
*All presented algorithms will only implement step 1.
Idit Haran, Data Mining Seminar, 2003
21
Algorithms (step 1)
Input:
Database, Taxonomy
Output: All frequent itemsets
3 algorithms (same output, different run-time):
Basic, Cumulate, EstMerge
Idit Haran, Data Mining Seminar, 2003
22
Algorithm Basic – Main Idea
Is
itemset X is frequent?
Does transaction T supports X?
(X contains items from different levels of taxonomy,
T contains only leaves)
T’ =
T + ancestors(T);
Answer: T supports X X T’
Idit Haran, Data Mining Seminar, 2003
23
Algorithm Basic
L1 {frequent 1-itemsets}
Count item occurrences
For ( k 2; Lk-1 ; k ) do begin
Ck apriori- gen (Lk-1 );
forall transacti on t D do begin
t add - ancestor (t , T )
Ct subset (C k ,t)
forall candidates c Ct do
Generate new k-itemsets
candidates
Add all ancestors of each
item in t to t, removing any
duplication
c.count ;
end
end
Lk { c Ck|c.count minsup}
end
Answer
Find the support of all
the candidates
Take only those with
support over minsup
L ;
k
k
Idit Haran, Data Mining Seminar, 2003
24
Candidate generation
Join
step
P and q are 2 k-1 frequent
itemsets identical in all k-2
first items.
insert into Ck
select p.item1 , p.item2 , p.itemk 1 , q.itemk 1
from Lk 1 p,Lk 1q
where p.item1 q.item1 ,..., p.itemk 2 q.itemk 2 , p.itemk 1 q.itemk 1
Prune
step
Join by adding the last item of
q to p
forall itemsets c Ck do
forall (k-1)-subsets s of c do
if (s Lk-1 ) then
delete c from Ck
Check all the subsets, remove a
candidate with “small” subset
Idit Haran, Data Mining Seminar, 2003
25
Optimization 1
Filtering the ancestors added to transactions
We
only need to add to transaction t the
ancestors that are in one of the candidates.
If the original item is not in any itemsets, it can
be dropped from the transaction.
Clothes
Example:
candidates: {clothes,shoes}.
Transaction t: {Jacket, …}
can be replaced with {clothes, …}
Outwear
Jackets
Idit Haran, Data Mining Seminar, 2003
Shirts
Ski Pants
26
Optimization 2
Pre-computing ancestors
Rather
than finding ancestors for each item by
traversing the taxonomy graph, we can precompute the ancestors for each item.
We can drop ancestors that are not contained
in any of the candidates in the same time.
Idit Haran, Data Mining Seminar, 2003
27
Optimization 3
Pruning itemsets containing an item and its ancestor
If
we have {Jacket} and {Outwear}, we will have
candidate {Jacket, Outwear} which is not interesting.
support({Jacket} ) = support({Jacket, Outwear})
Delete ({Jacket, Outwear}) in k=2 will ensure it will not
erase in k>2. (because of the prune step of candidate generation
method)
Therefore,
we can prune the rules containing an item an
its ancestor only for k=2, and in the next steps all
candidates will not include item + ancestor.
Idit Haran, Data Mining Seminar, 2003
28
Algorithm Cumulate
Optimization 2: compute the set of
all ancestors T* from T
ComputeT * from T
L1 {frequent 1-itemsets}
For ( k 2; Lk-1 ; k ) do begin
Ck apriori- gen (Lk-1 );
if (k 2) then prune(C 2 )
T remove - unnecessar y(T , C k )
*
*
forall transacti on t D do begin
t add - ancestor (t , T * )
Ct subset (C k ,t)
forall candidates c Ct do
c.count ;
end
Optimization 3: Delete any candidate
in C2 that consists of an item and its
ancestor
Optimization 1: Delete any ancestors
in T* that are not present in any of
the candidates in Ck
Optimzation2: foreach item xt add
all ancestor of x in T* to t.
Then, remove any duplicates in t.
end
Lk { c Ck|c.count minsup}
end
Answer
L ;
k
k
Idit Haran, Data Mining Seminar, 2003
29
Clothes
Stratification
Outwear
Jackets
Candidates:
Footwear
Shirts
Shoes
Hiking
Boots
Ski Pants
{Clothes, Shoes}, {Outwear,Shoes}, {Jacket,Shoes}
If {Clothes, Shoes}
does not have minimum
support, we don’t need to count either
{Outwear,Shoes} or {Jacket,Shoes}
We will count in steps:
step 1: count {Clothes, Shoes}, and if it has minsup step 2: count {Outwear,Shoes}, if has minsup –
step 3: count {Jacket,Shoes}
Idit Haran, Data Mining Seminar, 2003
30
Version 1: Stratify
Depth
of an itemset:
itemsets with no parents are of depth 0.
others:
depth(X) = max({depth(X^) |X^ is a parent of X}) + 1
The algorithm:
Count all itemsets C0 of depth 0.
Delete candidates that are descendants to the itemsets in C0 that
didn’t have minsup.
Count remaining itemsets at depth 1 (C1)
Delete candidates that are descendants to the itemsets in C1 that
didn’t have minsup.
Count remaining itemsets at depth 2 (C2), etc…
Idit Haran, Data Mining Seminar, 2003
31
Tradeoff & Optimizations
#candidates counted
#passes over DB
Count each depth
on different pass
Cumulate
Optimiztion 1: Count together
multiple depths from certain level
Optimiztion 2: Count more than 20%
of candidatesIdit
per
pass
Haran, Data Mining Seminar, 2003
32
Version 2: Estimate
Estimating
1st
candidates support using sample
pass: (C’k)
count
candidates that are expected to have minsup
(we count these candidates as candidates that has 0.9*minsup in the sample)
count
2nd
candidates whose parents expect to have minsup.
pass: (C”k)
children of candidates in C’k that were not
expected to have minsup.
count
Idit Haran, Data Mining Seminar, 2003
33
Example for Estimate
minsup = 5%
Candidates
Itemsets
Support in
Support in Database
Sample Scenario A Scenario B
8%
7%
9%
{Clothes, Shoes}
{Outwear, Shoes}
4%
{Jacket, Shoes}
2%
4%
Idit Haran, Data Mining Seminar, 2003
6%
34
Version 3: EstMerge
Motivation: eliminate 2nd pass of algorithm Estimate
Implementation: count these candidates of C”k with
the candidates in C’k+1.
Restriction: to create C’k+1 we assume that all
candidates in C”k has minsup.
The tradeoff: extra candidates counted by EstMerge
v.s. extra pass made by Estimate.
Idit Haran, Data Mining Seminar, 2003
35
Algorithm EstMerge
Count item occurrences
Generate a sample over the
Database, in the first pass
L1 {frequent 1-itemsets}
Ds generate - sample ( D);
For ( k 2, C"1 ; Lk-1 or C"k-1 ; k ) do begin
Ck ge nerate - candidates ( Lk-1 , C"k-1 );
C 'k expected - frequent - and - sons ( Ds , Ck );
find - support ( D,C' k ,C"k-1 ) ;
Ck prune - descendent s ( D,C' k ,C"k-1 );
C"k Ck - C'k ;
Lk {c C 'k | c.count minsup}
Lk 1 Lk 1 {c C"k 1 | c.count minsup}
end
Answer
L ;
k
Generate new k-itemsets
candidates from Lk-1C”k-1
Estimate Ck candidate’s support by
making a pass over Ds. C’k = candidates
that are expected to have minsup +
candidates whose parents are expected
to have minsup
Find the support of C’kC”k-1
by making a pass over D
Delete candidates in Ck whose
ancestors in C’k don’t have minsup
Remaining candidates in Ck
that are not in C’k
k
Idit Haran, Data Mining Seminar, 2003
Add all candidate in C”k with minsup
36
All candidate in C’k with minsup
Stratify - Variants
Idit Haran, Data Mining Seminar, 2003
37
Size of Sample
P=5%
P=1%
P=0.5%
P=0.1%
a=.8p a=.9p
a=.8p
a=.9p
a=.8p
a=.9p
a=.8p
a=.9p
n=1000
0.32
0.76
0.80
0.95
0.89
0.97
0.98
0.99
n=10,000
0.00
0.07
0.11
0.59
0.34
0.77
0.80
0.95
n=100,000
0.00
0.00
0.00
0.01
0.00
0.07
0.12
0.60
n=1,000,000
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.01
Pr[support in sample < a]
Idit Haran, Data Mining Seminar, 2003
38
Size of Sample
Idit Haran, Data Mining Seminar, 2003
39
Performance Evaluation
Compare
running time of 3 algorithms:
Basic, Cumulate and EstMerge
On synthetic data:
effect
On
of each parameter on performance
real data:
Supermarket
Data
Department Store Data
Idit Haran, Data Mining Seminar, 2003
40
Synthetic Data Generation
Parameter
Default
Value
|D|
Number of transactions
1,000,000
|T|
Average size of the Transactions
10
|I|
Average size of the maximal potentially frequent itemsets 4
|I |
Number of maximal potentially frequent itemsets
10,000
N
Number of items
100,000
R
Number of Roots
250
L
Number of Levels
4-5
F
Fanout
5
D
Depth-ration ( probability that item in a rule comes from 1
level i / probability that
item Data
comes
from
level2003
i+1)
Idit Haran,
Mining
Seminar,
41
Minimum Support
Idit Haran, Data Mining Seminar, 2003
42
Number of Transactions
Idit Haran, Data Mining Seminar, 2003
43
Fanout
Idit Haran, Data Mining Seminar, 2003
44
Number of Items
Idit Haran, Data Mining Seminar, 2003
45
Reality Check
Supermarket
Data
548,000
items
Taxonomy: 4 levels, 118 roots
~1.5 million transactions
Average of 9.6 items per transaction
Department
Store Data
228,000
items
Taxonomy: 7 levels, 89 roots
570,000 transactions
Average of 4.4 items per transaction
Idit Haran, Data Mining Seminar, 2003
46
Results
Idit Haran, Data Mining Seminar, 2003
47
Conclusions
Cumulate
and EstMerge were 2 to 5 times
faster than Basic on all synthetic datasets.
On the supermarket database they were 100
times faster !
EstMerge was ~25-30% faster than Cumulate.
Both EstMerge and Cumulate exhibits linear
scale-up with the number of transactions.
Idit Haran, Data Mining Seminar, 2003
48
Summary
The
use of taxonomy is necessary for finding
association rules between items at any level
of hierarchy.
The obvious solution (algorithm Basic) is not
very fast.
New algorithms that use the taxonomy
benefits are much faster
We can use the taxonomy to prune
uninteresting rules.
Idit Haran, Data Mining Seminar, 2003
49
Idit Haran, Data Mining Seminar, 2003
50