Download CVFDT algorithm

Mining Decision Trees from Data Streams Tong Suk Man Ivy CSIS DB Seminar February 12, 2003 1 Contents  Introduction: problems in mining data streams  Classification of stream data  VFDT algorithm  Window approach  CVFDT algorithm  Experimental results  Conclusions  Future work 2 Data Streams  Characteristics  Large volume of ordered data points, possibly infinite  Arrive continuously  Fast changing  Appropriate model for many applications:  Phone call records  Network and security monitoring  Financial applications (stock exchange)  Sensor networks 3 Problems in Mining Data Streams  Traditional data mining techniques usually require  Entire data set to be present  Random access (or multiple passes) to the data  Much time per data item  Challenges of stream mining  Impractical to store the whole data  Random access is expensive  Simple calculation per data due to time and space constraints 4 Classification of Stream Data  VFDT algorithm  “Mining High-Speed Data Streams”, KDD 2000. Pedro Domingos, Geoff Hulten  CVFDT algorithm (window approach)  “Mining Time-Changing Data Streams”, KDD 2001. Geoff Hulten, Laurie Spencer, Pedro Domingos 5 Hoeffding Trees 6 Definitions  A classification problem is defined as:  N is a set of training examples of the form (x, y)  x is a vector of d attributes  y is a discrete class label  Goal: To produce from the examples a model y=f(x) that predict the classes y for future examples x with high accuracy 7 Decision Tree Learning  One of the most effective and widely-used classification methods  Induce models in the form of decision trees  Each node contains a test on the attribute  Each branch from a node corresponds to a possible outcome of the test  Each leaf contains a class prediction  A decision tree is learned by recursively replacing leaves by test nodes, starting at the root Age<30? Yes No Car Type= Sports Car? Yes Yes No No 8 Challenges  Classic decision tree learners assume all training data can be simultaneously stored in main memory  Disk-based decision tree learners repeatedly read training data from disk sequentially  Prohibitively expensive when learning complex trees  Goal: design decision tree learners that read each example at most once, and use a small constant time to process it 9 Key Observation  In order to find the best attribute at a node, it may be sufficient to consider only a small subset of the training examples that pass through that node.  Given a stream of examples, use the first ones to choose the root attribute.  Once the root attribute is chosen, the successive examples are passed down to the corresponding leaves, and used to choose the attribute there, and so on recursively.  Use Hoeffding bound to decide how many examples are enough at each node 10 Hoeffding Bound  Consider a random variable a whose range is R  Suppose we have n observations of a _  Mean: a  Hoeffding bound states: With probability 1- , the true mean of a is at least _ a   , where R 2 ln( 1 /  )  2n 11 How many examples are enough?  Let G(Xi) be the heuristic measure used to choose test attributes (e.g. Information Gain, Gini Index)  Xa : the attribute with the highest attribute evaluation value after seeing n examples.  Xb : the attribute with the second highest split evaluation function value after seeing n examples.  Given a desired , if G  G( X a )  G( X b )   after seeing n examples at a node,  Hoeffding bound guarantees the true G  G    0 probability 1-. , with  This node can be split using Xa, the succeeding examples will be passed to the new leaves. R 2 ln( 1 /  )  2n 12 Algorithm  Calculate the information gain for the attributes and determines the best two attributes  Pre-pruning: consider a “null” attribute that consists of not splitting the node  At each node, check for the condition G  G( X a )  G( X b )    If condition satisfied, create child nodes based on the test at the node  If not, stream in more examples and perform calculations till condition satisfied 13 Age<30? Yes Data Stream No Yes _ _ G(Car Type) - G(Gender)   Age<30? Yes No Car Type= Sports Car? Yes Data Stream Yes No No 14 Performance Analysis  p: probability that an example passed through DT to level i will fall into a leaf at that point  The expected disagreement between the tree produced by Hoeffding tree algorithm and that produced using infinite examples at each node is no greater than  /p.  Required memory: O(leaves * attributes * values * classes) 15 VFDT 16 VFDT (Very Fast Decision Tree)  A decision-tree learning system based on the Hoeffding tree algorithm  Split on the current best attribute, if the difference is less than a user-specified threshold  Wasteful to decide between identical attributes  Compute G and check for split periodically  Memory management  Memory dominated by sufficient statistics  Deactivate or drop less promising leaves when needed  Bootstrap with traditional learner  Rescan old data when time available 17 VFDT(2)  Scales better than pure memory-based or pure disk-based learners  Access data sequentially  Use subsampling to potentially require much less than one scan  VFDT is incremental and anytime  New examples can be quickly incorporated as they arrive  A usable model is available after the first few examples and then progressively defined 18 Experiment Results (VFDT vs. C4.5)  Compared VFDT and C4.5 (Quinlan, 1993)  Same memory limit for both (40 MB)  100k examples for C4.5  VFDT settings: δ= 10-7, τ= 5%, nmin=200  Domains: 2 classes, 100 binary attributes  Fifteen synthetic trees 2.2k – 500k leaves  Noise from 0% to 30% 19 Experiment Results Accuracy as a function of the number of training examples 20 Experiment Results Tree size as a function of number of training examples 21 Mining Time-Changing Data Stream  Most KDD systems, include VFDT, assume training data is a sample drawn from stationary distribution  Most large databases or data streams violate this assumption  Concept Drift: data is generated by a time-changing concept function, e.g.  Seasonal effects  Economic cycles  Goal:  Mining continuously changing data streams  Scale well 22 Window Approach  Common Approach: when a new example arrives, reapply a traditional learner to a sliding window of w most recent examples  Sensitive to window size  If w is small relative to the concept shift rate, assure the availability of a model reflecting the current concept  Too small w may lead to insufficient examples to learn the concept  If examples arrive at a rapid rate or the concept changes quickly, the computational cost of reapplying a learner may be prohibitively high. 23 CVFDT 24 CVFDT  CVFDT (Concept-adapting Very Fast Decision Tree learner)  Extend VFDT  Maintain VFDT’s speed and accuracy  Detect and respond to changes in the examplegenerating process 25 Observations  With a time-changing concept, the current splitting attribute of some nodes may not be the best any more.  An outdated subtree may still be better than the best single leaf, particularly if it is near the root.  Grow an alternative subtree with the new best attribute at its root, when the old attribute seems out-of-date.  Periodically use a bunch of samples to evaluate qualities of trees.  Replace the old subtree when the alternate one becomes more accurate. 26 CVFDT algorithm  Alternate trees for each node in HT start as empty.  Process examples from the stream indefinitely. For each example (x, y),  Pass (x, y) down to a set of leaves using HT and all alternate trees of the nodes (x, y) passes through.  Add (x, y) to the sliding window of examples.  Remove and forget the effect of the oldest examples, if the sliding window overflows.  CVFDTGrow  CheckSplitValidity if f examples seen since last checking of alternate trees.  Return HT. 27 CVFDT algorithm: process each example Pass example down to leaves add example to sliding window Window overflow? Read new example Yes Forget oldest example No CVFDTGrow f examples since last checking? Yes No CheckSplitValidty 28 CVFDT algorithm: process each example Pass example down to leaves add example to sliding window Window overflow? Read new example Yes Forget oldest example No CVFDTGrow f examples since last checking? Yes No CheckSplitValidty 29 CVFDTGrow  For each node reached by the example in HT,  Increment the corresponding statistics at the node.  For each alternate tree Talt of the node,  CVFDTGrow  If enough examples seen at the leaf in HT which the example reaches,  Choose the attribute that has the highest average value of the attribute evaluation measure (information gain or gini index).  If the best attribute is not the “null” attribute, create a node for each possible value of this attribute 30 CVFDT algorithm: process each example Pass example down to leaves add example to sliding window Window overflow? Read new example Yes Forget oldest example No CVFDTGrow f examples since last checking? Yes No CheckSplitValidty 31 Forget old example  Maintain the sufficient statistics at every node in HT to monitor the validity of its previous decisions.  VFDT only maintain such statistics at leaves.  HT might have grown or changed since the example was initially incorporated.  Assigned each node a unique, monotonically increasing ID as they are created.  forgetExample (HT, example, maxID)  For each node reached by the old example with node ID no larger than the max leave ID the example reaches,  Decrement the corresponding statistics at the node.  For each alternate tree Talt of the node, forget(Talt, example, maxID). 32 CVFDT algorithm: process each example Pass example down to leaves add example to sliding window Window overflow? Read new example Yes Forget oldest example No CVFDTGrow f examples since last checking? Yes No CheckSplitValidty 33 CheckSplitValidtiy  Periodically scans the internal nodes of HT.  Start a new alternate tree when a new winning attribute is found.  Tighter criteria to avoid excessive alternate tree creation.  Limit the total number of alternate trees. 34 Smoothly adjust to concept drift  Alternate trees are grown the same way HT is.  Periodically each node with non-empty alternate trees enter a testing Yes mode.  M training examples to compare accuracy.  Prune alternate trees with non-increasing accuracy over time.  Replace if an alternate tree is more accurate. Yes Age<30? Yes Married? No No No Car Type= Sports Car? Yes Yes No Experience <1 year? No Yes No No Yes 35 Adjust to concept drift(2)  Dynamically change the window size  Shrink the window when many nodes gets questionable or data rate changes rapidly.  Increase the window size when few nodes are questionable. 36 Performance  Require memory O(nodes * attributes * attribute values * classes).  Independent of the total number of examples.  Running time O(Lc * attributes * attribute values * number of classes).  Lc : the longest length an example passes through * number of alternate trees.  Model learned by CVFDT vs. the one learned by VFDT-Window:  Similar in accuracy  O(1) vs. O(window size) per new example. 37 Experiment Results     Compare CVFDT, VFDT, VFDT-Window 5 million training examples Concept changed at every 50k examples Drift Level: average percentage of the test points that changes label at each concept change.  About 8% of test points change label each drift  100,000 examples in window  5% noise  Test the model every 10k examples throughout the run, averaged these results. 38 Experiment Results (CVFDT vs. VFDT) drift level Error rate as a function of number of attributes 39 Experiment Results (CVFDT vs. VFDT) Tree size as a function of number of attributes 40 Experiment Results (CVFDT vs. VFDT) Portion of data set that is labelled -ve Error rates of learners as a function of the number of examples seen 41 Experiment Results (CVFDT vs. VFDT) Error rates as a function of the amount of concept drift 42 Experiment Results CVFDT’s drift characteristics 43 Experiment Results (CVFDT vs. VFDT vs. VFDT-window) Stimulated by running VFDT on W for every 100K examples instead of every example Error Rate: VFDT: 19.4% CVFDT: 16.3% VFDT-Window: 15.3% Running Time: VFDT: 10 minutes CVFDT: 46 minutes VFDT-Window: expect 548 days Error rates over time of CVFDT, VFDT, and VFDT-window 44 Experiment Results  CVFDT not use too much RAM  D=50, CVFDT never uses more than 70MB  Use as little as half the RAM of VFDT  VFDT often had twice as many leaves as the number of nodes in CVFDT’s HT and alternate subtrees combined  Reason: VFDT considers many more outdated examples and is forced to grow larger trees to make up for its earlier wrong decisions due to concept drift 45 Conclusions  CVFDT – a decision-tree induction system capable of learning accurate models from high speed, concept-drifting data streams  Grow an alternative subtree whenever an old one becomes questionable  Replace the old subtree when the new more accurate  Similar in accuracy to applying VFDT to a moving window of examples 46 Future Work  Concepts changed periodically and removed subtrees may become useful again  Comparisons with related systems  Continuous attributes  Weighting examples 47 Reference List  P. Domingos and G. Hulten. Mining high-speed data streams. In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2000.  G. Hulten, L. Spencer, and P. Domingos. Mining time-changing data streams. In Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2001  V. Ganti, J. Gehrke, and R. Ramakrishnan. DEMON: Mining and monitoring evolving data. In Proceedings of the Sixteenth International Conference on Data Engineering, 2000.  J. Gehrke, V. Ganti, R. Ramakrishnan, and W.L. Loh. BOAT: optimistic decision tree construction. In Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data, 1999. 48 The end Q&A 49 Thank You! 50

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download CVFDT algorithm