Survey							
                            
		                
		                * Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Mining Decision Trees from Data Streams Thanks: Tong Suk Man Ivy HKU 1 Contents  Introduction: problems in mining data streams  Classification of stream data  VFDT algorithm  Window approach  CVFDT algorithm  Experimental results  Conclusions  Future work 2 Data Streams  Characteristics  Large volume of ordered data points, possibly infinite  Arrive continuously  Fast changing  Appropriate model for many applications:  Phone call records  Network and security monitoring  Financial applications (stock exchange)  Sensor networks 3 Problems in Mining Data Streams  Traditional data mining techniques usually require  Entire data set to be present  Random access (or multiple passes) to the data  Much time per data item  Challenges of stream mining  Impractical to store the whole data  Random access is expensive  Simple calculation per data due to time and space constraints 4 Processing Data Streams: Motivation  A growing number of applications generate streams of data  Performance measurements in network monitoring and traffic management  Call detail records in telecommunications  Transactions in retail chains, ATM operations in banks  Log records generated by Web Servers  Sensor network data  Application characteristics  Massive volumes of data (several terabytes)  Records arrive at a rapid rate  Goal: Mine patterns, process queries and compute statistics on data streams in real-time (from VLDB’02 Tutorial) 5 Data Streams: Computation Model Data Streams Synopsis in Memory Stream Processing Engine e1 ,..., en (Approximate) Answer  A data stream is a (massive) sequence of elements: Stream processing requirements  Single pass: Each record is examined at most once  Bounded storage: Limited Memory (M) for storing synopsis  Real-time: Per record processing time (to maintain synopsis) must be low 6 Network Management Application  Network Management involves monitoring and configuring network hardware and software to ensure smooth operation  Monitor link bandwidth usage, estimate traffic demands  Quickly detect faults, congestion and isolate root cause  Load balancing, improve utilization of network resources Measurements Alarms Network Operations Center Network (from VLDB’02 Tutorial) 7 IP Network Measurement Data  IP session data (collected using Cisco NetFlow) Source 10.1.0.2 18.6.7.1 13.9.4.3 15.2.2.9 12.4.3.8 10.5.1.3 11.1.0.6 19.7.1.2 Destination 16.2.3.7 12.4.0.3 11.6.8.2 17.1.2.1 14.8.7.4 13.0.0.1 10.3.4.5 16.5.5.8 Duration 12 16 15 19 26 27 32 18 Bytes 20K 24K 20K 40K 58K 100K 300K 80K Protocol http http http http http ftp ftp ftp  AT&T collects 100 GBs of NetFlow data each day! (from VLDB’02 Tutorial) 8 (from VLDB’02 Tutorial) Network Data Processing  Traffic estimation  How many bytes were sent between a pair of IP addresses?  What fraction network IP addresses are active?  List the top 100 IP addresses in terms of traffic  Traffic analysis  What is the average duration of an IP session?  What is the median of the number of bytes in each IP session?  Fraud detection  List all sessions that transmitted more than 1000 bytes  Identify all sessions whose duration was more than twice the normal  Security/Denial of Service  List all IP addresses that have witnessed a sudden spike in traffic  Identify IP addresses involved in more than 1000 sessions 9 Data Stream Processing Algorithms  Generally, algorithms compute approximate answers  Difficult to compute answers accurately with limited memory  Approximate answers - Deterministic bounds  Algorithms only compute an approximate answer, but bounds on error  Approximate answers - Probabilistic bounds  Algorithms compute an approximate answer with high probability  With probability at least of the actual  answer 1,the  computed answer is within a factor  Single-pass algorithms for processing streams also applicable to (massive) terabyte databases! (from VLDB’02 Tutorial) 10 Classification of Stream Data  VFDT algorithm  “Mining High-Speed Data Streams”, KDD 2000. Pedro Domingos, Geoff Hulten  CVFDT algorithm (window approach)  “Mining Time-Changing Data Streams”, KDD 2001. Geoff Hulten, Laurie Spencer, Pedro Domingos 11 Hoeffding Trees 12 Definitions  A classification problem is defined as:  N is a set of training examples of the form (x, y)  x is a vector of d attributes  y is a discrete class label  Goal: To produce from the examples a model y=f(x) that predict the classes y for future examples x with high accuracy 13 Decision Tree Learning  One of the most effective and widely-used classification methods  Induce models in the form of decision trees  Each node contains a test on the attribute  Each branch from a node corresponds to a possible outcome of the test  Each leaf contains a class prediction  A decision tree is learned by recursively replacing leaves by test nodes, starting at the root Age<30? Yes No Car Type= Sports Car? Yes Yes No No 14 Challenges  Classic decision tree learners assume all training data can be simultaneously stored in main memory  Disk-based decision tree learners repeatedly read training data from disk sequentially  Prohibitively expensive when learning complex trees  Goal: design decision tree learners that read each example at most once, and use a small constant time to process it 15 Key Observation  In order to find the best attribute at a node, it may be sufficient to consider only a small subset of the training examples that pass through that node.  Given a stream of examples, use the first ones to choose the root attribute.  Once the root attribute is chosen, the successive examples are passed down to the corresponding leaves, and used to choose the attribute there, and so on recursively.  Use Hoeffding bound to decide how many examples are enough at each node 16 Hoeffding Bound  Consider a random variable a whose range is R  Suppose we have n observations of a _  Mean: a  Hoeffding bound states: With probability 1- , the true mean of a is at least _ a   , where R 2 ln( 1 /  )  2n 17 How many examples are enough?  Let G(Xi) be the heuristic measure used to choose test attributes (e.g. Information Gain, Gini Index)  Xa : the attribute with the highest attribute evaluation value after seeing n examples.  Xb : the attribute with the second highest split evaluation function value after seeing n examples.  Given a desired , if G  G( X a )  G( X b )   after seeing n examples at a node,  Hoeffding bound guarantees the true G  G    0 probability 1-. , with  This node can be split using Xa, the succeeding examples will be passed to the new leaves. R 2 ln( 1 /  )  2n 18 Algorithm  Calculate the information gain for the attributes and determines the best two attributes  Pre-pruning: consider a “null” attribute that consists of not splitting the node  At each node, check for the condition G  G( X a )  G( X b )    If condition satisfied, create child nodes based on the test at the node  If not, stream in more examples and perform calculations till condition satisfied 19 Age<30? Yes Data Stream No Yes _ _ G(Car Type) - G(Gender)   Age<30? Yes No Car Type= Sports Car? Yes Data Stream Yes No No 20 Performance Analysis  p: probability that an example passed through DT to level i will fall into a leaf at that point  The expected disagreement between the tree produced by Hoeffding tree algorithm and that produced using infinite examples at each node is no greater than  /p.  Required memory: O(leaves * attributes * values * classes) 21 VFDT 22 VFDT (Very Fast Decision Tree)  A decision-tree learning system based on the Hoeffding tree algorithm  Split on the current best attribute, if the difference is less than a user-specified threshold  Wasteful to decide between identical attributes  Compute G and check for split periodically  Memory management  Memory dominated by sufficient statistics  Deactivate or drop less promising leaves when needed  Bootstrap with traditional learner  Rescan old data when time available 23 VFDT(2)  Scales better than pure memory-based or pure disk-based learners  Access data sequentially  Use subsampling to potentially require much less than one scan  VFDT is incremental and anytime  New examples can be quickly incorporated as they arrive  A usable model is available after the first few examples and then progressively defined 24 Experiment Results (VFDT vs. C4.5)  Compared VFDT and C4.5 (Quinlan, 1993)  Same memory limit for both (40 MB)  100k examples for C4.5  VFDT settings: δ= 10-7, τ= 5%, nmin=200  Domains: 2 classes, 100 binary attributes  Fifteen synthetic trees 2.2k – 500k leaves  Noise from 0% to 30% 25 Experiment Results Accuracy as a function of the number of training examples 26 Experiment Results Tree size as a function of number of training examples 27 Mining Time-Changing Data Stream  Most KDD systems, include VFDT, assume training data is a sample drawn from stationary distribution  Most large databases or data streams violate this assumption  Concept Drift: data is generated by a time-changing concept function, e.g.  Seasonal effects  Economic cycles  Goal:  Mining continuously changing data streams  Scale well 28 Window Approach  Common Approach: when a new example arrives, reapply a traditional learner to a sliding window of w most recent examples  Sensitive to window size  If w is small relative to the concept shift rate, assure the availability of a model reflecting the current concept  Too small w may lead to insufficient examples to learn the concept  If examples arrive at a rapid rate or the concept changes quickly, the computational cost of reapplying a learner may be prohibitively high. 29 CVFDT 30 CVFDT  CVFDT (Concept-adapting Very Fast Decision Tree learner)  Extend VFDT  Maintain VFDT’s speed and accuracy  Detect and respond to changes in the examplegenerating process 31 Observations  With a time-changing concept, the current splitting attribute of some nodes may not be the best any more.  An outdated subtree may still be better than the best single leaf, particularly if it is near the root.  Grow an alternative subtree with the new best attribute at its root, when the old attribute seems out-of-date.  Periodically use a bunch of samples to evaluate qualities of trees.  Replace the old subtree when the alternate one becomes more accurate. 32 CVFDT algorithm  Alternate trees for each node in HT start as empty.  Process examples from the stream indefinitely. For each example (x, y),  Pass (x, y) down to a set of leaves using HT and all alternate trees of the nodes (x, y) passes through.  Add (x, y) to the sliding window of examples.  Remove and forget the effect of the oldest examples, if the sliding window overflows.  CVFDTGrow  CheckSplitValidity if f examples seen since last checking of alternate trees.  Return HT. 33 CVFDT algorithm: process each example Pass example down to leaves add example to sliding window Window overflow? Read new example Yes Forget oldest example No CVFDTGrow f examples since last checking? Yes No CheckSplitValidty 34 CVFDT algorithm: process each example Pass example down to leaves add example to sliding window Window overflow? Read new example Yes Forget oldest example No CVFDTGrow f examples since last checking? Yes No CheckSplitValidty 35 CVFDTGrow  For each node reached by the example in HT,  Increment the corresponding statistics at the node.  For each alternate tree Talt of the node,  CVFDTGrow  If enough examples seen at the leaf in HT which the example reaches,  Choose the attribute that has the highest average value of the attribute evaluation measure (information gain or gini index).  If the best attribute is not the “null” attribute, create a node for each possible value of this attribute 36 CVFDT algorithm: process each example Pass example down to leaves add example to sliding window Window overflow? Read new example Yes Forget oldest example No CVFDTGrow f examples since last checking? Yes No CheckSplitValidty 37 Forget old example  Maintain the sufficient statistics at every node in HT to monitor the validity of its previous decisions.  VFDT only maintain such statistics at leaves.  HT might have grown or changed since the example was initially incorporated.  Assigned each node a unique, monotonically increasing ID as they are created.  forgetExample (HT, example, maxID)  For each node reached by the old example with node ID no larger than the max leave ID the example reaches,  Decrement the corresponding statistics at the node.  For each alternate tree Talt of the node, forget(Talt, example, maxID). 38 CVFDT algorithm: process each example Pass example down to leaves add example to sliding window Window overflow? Read new example Yes Forget oldest example No CVFDTGrow f examples since last checking? Yes No CheckSplitValidty 39 CheckSplitValidtiy  Periodically scans the internal nodes of HT.  Start a new alternate tree when a new winning attribute is found.  Tighter criteria to avoid excessive alternate tree creation.  Limit the total number of alternate trees. 40 Smoothly adjust to concept drift  Alternate trees are grown the same way HT is.  Periodically each node with non-empty alternate trees enter a testing mode. Yes  M training examples to compare accuracy.  Prune alternate trees with non-increasing accuracy over time.  Replace if an alternate tree is more accurate. Yes Age<30? Yes Married? No No No Car Type= Sports Car? Yes Yes No Experience <1 year? No Yes No No Yes 41 Adjust to concept drift(2)  Dynamically change the window size  Shrink the window when many nodes gets questionable or data rate changes rapidly.  Increase the window size when few nodes are questionable. 42 Performance  Require memory O(nodes * attributes * attribute values * classes).  Independent of the total number of examples.  Running time O(Lc * attributes * attribute values * number of classes).  Lc : the longest length an example passes through * number of alternate trees.  Model learned by CVFDT vs. the one learned by VFDT-Window:  Similar in accuracy  O(1) vs. O(window size) per new example. 43 Experiment Results     Compare CVFDT, VFDT, VFDT-Window 5 million training examples Concept changed at every 50k examples Drift Level: average percentage of the test points that changes label at each concept change.  About 8% of test points change label each drift  100,000 examples in window  5% noise  Test the model every 10k examples throughout the run, averaged these results. 44 Experiment Results (CVFDT vs. VFDT) drift level Error rate as a function of number of attributes 45 Experiment Results (CVFDT vs. VFDT) Tree size as a function of number of attributes 46 Experiment Results (CVFDT vs. VFDT) Portion of data set that is labelled -ve Error rates of learners as a function of the number of examples seen 47 Experiment Results (CVFDT vs. VFDT) Error rates as a function of the amount of concept drift 48 Experiment Results CVFDT’s drift characteristics 49 Experiment Results (CVFDT vs. VFDT vs. VFDT-window) Stimulated by running VFDT on W for every 100K examples instead of every example Error Rate: VFDT: 19.4% CVFDT: 16.3% VFDT-Window: 15.3% Running Time: VFDT: 10 minutes CVFDT: 46 minutes VFDT-Window: expect 548 days Error rates over time of CVFDT, VFDT, and VFDT-window 50 Experiment Results  CVFDT not use too much RAM  D=50, CVFDT never uses more than 70MB  Use as little as half the RAM of VFDT  VFDT often had twice as many leaves as the number of nodes in CVFDT’s HT and alternate subtrees combined  Reason: VFDT considers many more outdated examples and is forced to grow larger trees to make up for its earlier wrong decisions due to concept drift 51 Conclusions  CVFDT – a decision-tree induction system capable of learning accurate models from high speed, concept-drifting data streams  Grow an alternative subtree whenever an old one becomes questionable  Replace the old subtree when the new more accurate  Similar in accuracy to applying VFDT to a moving window of examples 52 Future Work  Concepts changed periodically and removed subtrees may become useful again  Comparisons with related systems  Continuous attributes  Weighting examples 53 Reference List  P. Domingos and G. Hulten. Mining high-speed data streams. In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2000.  G. Hulten, L. Spencer, and P. Domingos. Mining time-changing data streams. In Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2001  V. Ganti, J. Gehrke, and R. Ramakrishnan. DEMON: Mining and monitoring evolving data. In Proceedings of the Sixteenth International Conference on Data Engineering, 2000.  J. Gehrke, V. Ganti, R. Ramakrishnan, and W.L. Loh. BOAT: optimistic decision tree construction. In Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data, 1999. 54 The end Q&A 55 Thank You! 56