Download Online Mining of Maximal Frequent Itemsequences from Data Streams

Online Mining of Maximal Frequent Itemsequences from Data Streams Guojun Mao1,2, Xindong Wu2, Chunnian Liu1, Xingquan Zhu2, Gong Chen2, Yue Sun1, and Xu Liu1 1 School of Computer Science, Beijing University of Technology, Beijing 100022, P.R. China 2 Department of computer Science, University of Vermont, Burlington VT 05405, U.S.A E-mail: maoguojun@bjut.edu.cn; xwu@cs.uvm.edu Abstract Mining data streams often requires real-time extraction of interesting patterns from dynamic and continuously growing data. This requirement has imposed challenges on discovering and outputting current useful patterns in an instant way, commonly referred to as online streaming data mining. In this paper, we present INSTANT, a novel algorithm that explores maximal frequent itemsequences from streaming data in an online fashion. We first provide useful operators on the lattice of itemsequential sets, and then apply them to the algorithm design of INSTANT. In comparison with the most popular methods such as close-itemset based mining algorithms, INSTANT has solid theoretical foundations to ensure that it employs more compact in-memory data structures than closed itemsequences. Experimental results show that our method can achieve better results than previous related methods in terms of both time and space efficiency. 1. Introduction Discovering frequent itemsets from transaction data streams is a typical problem which has received intensive studies [1, 4, 7, 13]. Recent research efforts on mining frequent itemsets from large volumes of data streams have centered on the development of in-memory data structures and the design of algorithms with effective time efficiency and space utilization [4, 11, 8]. Dong et al [11] argued that one of the keys to mining data streams is online mining of changes. Furthermore, online mining methods in data streams should timely output current patterns to users when these changes result in new pattern generation. This is because for many streaming data oriented applications such as stock analysis and market prediction, users need to review the current pattern changes in time. Thus, an online mining method should be expected in real-time response to streaming data, and the results generated by the mining algorithms should be instantly displayed to users. A full online algorithm must have the ability to maintain the intermediate information obtained from scanning the data stream, and upon the users’ requests, quickly display all available results. In comparison with traditional static data mining efforts, online mining in data streams could impose higher system resource requirements for maintaining historical information. However, these requirements must be controlled to efficiently maintain frequent patterns in dynamic streaming environments. Therefore, attractive algorithms for mining large-volume data streams should be relatively stable or less scale-up with increasing volumes of streaming data and various user-specified parameters. Due to the large volume and unpredictable speed of data streams, it is possible to experience a shortage in system resources at any time. Thus, online mining algorithms must provide good strategies to cope with system overload. It is a challenging task for mining data streams to realize load shedding while trying to minimize the degradation in accuracy. This paper aims at online mining of frequent itemsequences from data streams. We will present an efficient algorithm called INSTANT (maxImal frequeNt So-far iTemsequence mAiNTainer), which is based on a new mining theory provided by this paper. The paper will also discuss the performance of the proposed algorithm from both theoretic and experimental views. 1 1.1 Related Work The problem of mining frequent itemsets in databases was first addressed by Agrawal et al [2] who have created the apriori-property for frequent itemset mining such that all nonempty sub-itemsets of a frequent itemset must be frequent. During the last decade, many efforts have been made in mining frequent itemsets, where two approaches have received intensive attentions: Closed Itemsets [19] and FP-Tree patterns [14]. Pasquier et al [19] first addressed the problem of mining closed itemsets, and provided an improved mining theory to the Apriori principle where all nonempty closed sub-itemsets of a frequent closed itemset must be frequent. Since [19], many excellent algorithms based on closed itemset mining have been proposed [20, 23]. Han et al [14] proposed an FP-Tree method which is the first effort of mining frequent itemsets without candidate generation and with only two scans over the database. the compact in-memory data structure of the FP-Tree has now been widely adopted [15, 17]. Recently discovering frequent itemsets has been successfully extended to data stream mining, which is more challenging than mining in transaction databases. Manku et al [18] gave an algorithm called LOSSY COUNTING for mining all frequent itemsets over the entire history of the streaming data. This algorithm is based on the Apriori property, but it is an one-pass algorithm over data streams. Chi et al [9] proposed an algorithm called MOMENT that might be the first to mining closed patterns from data streams. It uses an in-memory data structure called CET to maintain closed itemsets obtained by scanning streaming data. There are also some algorithms based on tree structures for discovering frequent itemsets from data streams. Giannella et al [12] presented a data structure called FP-Stream for maintaining information about frequent itemsets in data streams. Based on scanning the generated FP-Stream, the frequent patterns during an arbitrary time interval can be obtained. Another typical algorithm based on tree structures is DSM-FI proposed by Li et al [17], which extends a prefix-tree-based compact pattern representation. In order to output frequent itemsets to users, DSM-FI executes a top-down frequent itemset discovery scheme from maintained in-memory data structures. Like our target in this paper, these methods all try to make use of excellent in-memory data structures to find frequent itemsets from the streaming data. However, unlike our target of this paper, they employ a two-phase implementation, i.e., generating frequent patterns to users from the inmemory data structures after scanning the data stream to produce interim in-memory data structures. We think that the output of frequent patterns should also be a dynamic streaming process. Once an object becomes frequent at any time, it should be instantly output. Theoretically, a data stream can continuously and infinitely increase over time, so selecting a current handling window is also a key problem for mining streaming data. Zhu et al [24] gave three windowing models for mining data streams: landmark windows, sliding windows, and damped windows. In a landmark window, algorithms cope with the data from a specific time point called landmark to the present. Without combining with extra data updating techniques, this model cannot handle well continuous high-volume data streams. A sliding window is a popular model in mining data streams. The data updating over a siding window with a fixed size is simple and trivial. An old transaction has to be cleaned up while a new transaction enters the sliding window. MOMENT [9] uses the sliding window technique to maintain the current CET. Another typical algorithm using sliding windows is FTP-DS proposed by Teng [22]. In the damped window model, the weights of all transactions in a data stream are considered as the functions of their arrival time. The later a transaction arrives, the higher weight it should have. Chang et al [8] developed a weight function that decreases with age and designed an algorithm estDec for mining frequent itemsets in streaming data. FP-Stream [12] created an aging weight function, and can mine frequent itemsets at multiple time granularities by a novel titled-time windowing technique. In fact, different window models have their own advantages and disadvantages. However, we think that the damped windows provide a more flexible way to update data, which can implement load shedding in different strategies such as periodical, background, threshold-driven and integrated shedding plans. The continuous, high-speed data arrival in data streams can cause system overload. A good load shedding scheme is necessary for data-stream mining to decide when and how to discard aged data in memory. There are two basic strategies to handle system congestions: (a) prevention - the system actively estimates its work load based on the current input rate of the data stream, and before a congestion occurs, the system performs load shedding to discard some data tuples in advance; and (b) post-treatment - when the system performance degrades seriously or when it stops working, a load shedding mechanism has to be invoked. In general, an accurate prevention from stream jams suffers from costly computing expenses, whereby a full post-treatment loses up-to-date responses to continuous data streams. Therefore, how to select data to discard in memory is crucial for mining data streams. Similar to the aging weight function used in FP-Stream, Chang et al [7] developed an algorithm for maintaining frequent itemsets in streaming data assuming each transaction has a weight that is related with its age. Recently, there are some refer- 2 ences that discuss this issue and have provided some effective methods [5, 6, 10]. Babcock et al [5, 6] assumed that a set of Quality-of-Service (QoS) specifications is available, and then a load shedding scheme according to the QoS specifications was designed to decide when and how to discard data. Chi et al [10] proposed a load shedding scheme for classifying multiple data streams, and introduced a new metric called Quality of Decision (QoD) to measure the load status. In addition, some simple shedding mechanisms like the aging weight function in FP-Stream may get a good efficiency for real-time or online systems. Several references have discussed the problem of online mining patterns from data streams [1, 12, 16, 21]. Asia et al [1] gave an online algorithm called StreamT that aims at mining patterns from streams of semi-structured data such as XML data. Keogh et al [16] and Palpanas [21] considered the problem of online mining of streaming time series and gave algorithms for solving this problem. Our data source format in this paper is very different from those of the above algorithms. 1.2 Our Contributions Our focus in this paper is on dynamic information maintenance with continuous streaming data, and instant output for current frequent itemsequences. Our contributions can be summarized as follows. (1) We assume that there is a lexicographical order among all items in a data stream, and while all items in a transaction are represented as an itemsequence, all transactions in a data stream can be modeled into an itemsequential set. Based on the algebra lattice of itemsequential sets, we provide some new mining operators. These theoretic results are then applied to mining data streams. (2) We present a new online algorithm, INSTANT, which has provable space and time efficiencies. (3) The in-memory data structures used in INSTANT have less space expenses than closed itemsequences. Most important of all, without re-scanning over any in-memory data structures to output frequent patterns, INSTANT can directly display current frequent itemsequences while they are generated. Therefore, INSTANT has an obvious online mining feature. The rest of the paper is organized as follows. In Section 2, we present our problem statement and some theoretical results on the algebra lattice of itemsequential sets. Section 3 describes our algorithm INSTANT and analyzes its theoretical performance properties. Experimental studies are provided in Section 4. Section 5 concludes the paper. 2. Operators on Itemsequential Sets Before presenting our mining algorithm in Section 3, we introduce relevant concepts and notations with a theoretic analysis in this section. 2.1 Problem Statement A popular formulation of the problem for mining transaction databases is via the term itemset. From this viewpoint, a transaction database is a series of tuples, each of which includes an itemset, and discovering frequent itemsets in the transaction database is considered a key phase in pattern mining. In this paper, we consider the term itemsequence rather than itemset. In short, an itemsequence is an ordered list of items. Definition 1 (Itemsequence). An itemsequence is an ordered list of items where the order of items is given by a specific criterion. Let α=a1a2…am and β= b1b2…bn be two itemsequences. We say that β contains α, denoted by α? sβ, if there exist integers k1>k2>…>km and ai=bki (i=1, 2, …, m). In this situation, we also call α a sub-itemsequence of β, or β a super-itemsequence of α. Example 1. Consider the capital letters in the English alphabet as all concerned items, and the order to be the alphabetic order. Then itemsequence ABC is an itemsequence, but ACB is not. Assuming β=ABCDEF, we have ABC? sβ, but ABG? sβ. After introducing the term itemsequence, we can now formalize the problem to be tackled in this paper as fellows. Let I={i1, i2, … , im} be an item alphabet, called items, and DS={t1, t2, …, tn, …} be a data stream in which every element represents a transaction. A transaction ti is modeled into an itemsequence on I (i=1, 2, …), and is related to a unique transaction identifier, TID, which increases over time. Given an arbitrary itemsequence, its support is the ratio of the number of these transactions containing (? s) this itemsequence against the number of all transac- 3 tions in DS. Due to the potentially infinite nature of a data stream, it is not feasible to get the full support information of an itemsequence in DS. However, through analyzing what has happened in DS so far, we can get its current patterns. Definition 2 (Support). Given an item alphabet I={i1, i2, … , im} and a data stream DS={t1, t2, …, tn, …}, the current support of an itemsequence t, denoted by Csup(t), is the percent that the number of these transactions containing t as a sub-itemsequence against the number of all transactions occurred so far in DS. Also, the global support, denoted by Gsup(t), is the number of transactions that contain t as a sub-itemsequence against the number of all transactions in DS. Note that we sometimes use the term Support Count instead of Support, and the support count of an itemsequence simply means the number of times that this itemsequence occurs in a specific period. Definition 3 (Maximal itemsequence). Given a set of itemsequences S, an itemsequence is a maximal itemsequence in S if it is not a sub-itemsequence of any other itemsequence in S. Definition 4 (Maximal frequent itemsequence). Given I, DS, and a minimum support Msup, an itemsequence t is called a frequent itemsequence if Gsup(t)? Msup. An itemsequence is called a maximal frequent itemsequence if it is not a sub-itemsequence of any other frequent itemsequence in DS. An itemsequence t is called a frequent so-far itemsequence if Csup(t)? Msup. An itemsequence is called a maximal frequent so-far itemsequence if it is not a subitemsequence of any other frequent itemsequence discovered so far. Our research objective in this paper is to develop a single-pass fast algorithm to find maximal frequent so-far itemsequences, and to instantly output them when such itemsequences are discovered. 2.2 Itemsequential Set Theory In Section 2.1, we have given the definition of itemsequence, and we will extend it to the term itemsequential set in this subsection. With this extension, we can create useful operators for discovering frequent itemsequences. Definition 5 (Itemsequential set). An itemsequential set is a set of itemsequences on I. Let t be an itemsequence, and s1 and s2 be two itemsequential sets, then (1) t sub-belongs to s1 if there exists an itemsequence s in s1, that t? ss, denoted by t? subs1. (2) t is an element of the sub-intersection set of s1 and s2, denoted by s1? (3) t is an element of the sub-union set of s1 and s2, denoted by s1? subs2, subs2, if both t? subs1 and t? subs2. if either t? subs1 or ? subs2. Example 2. Assuming s1={AB, CD} and s2={ABC, AD}, if we use general set operators, then AB? s1, AB? s2, s1? s2=Φ, s1? s2={AB, CD, ABC, AD}. However, according to the above Definition 5, we can say AB? subs2, s1? subs2={A, B, C, D, AB}, and s1? subs2 ={A, B, C, D, AB, CD, AC, BC, AD, ABC}. Definition 6 (Maximal sub-operators). Let t be an itemsequence, and s1 and s2 be two itemsequential sets, then (1) t is an element of the maximal sub-intersection set of s1 and s2, denoted by s1? mss2, if t is an element of s1? subs2 and is not contained by any other element of s1? subs2. (2) t is an element of the maximal sub-union set of s1 and s2, denoted by s1? mss2, if t is an element of s1? subs2 and is not contained by any other element of s1? subs2. Example 3. Assuming s1 = {AB, CD} and s2 = {ABC, AD}, then s1 ? AD}. Property 1 (Idempotent law). s1? = {AB, C, D}, s1? ms s2 = {ABC, CD, mss1 = s1; s1? mss1 = s1. Property 2 (Commutative law). s1? Property 3 (Associative law). (s1? Property 4 (Absorption law). s1? ms s2 mss2 = s2? mss1; s1 ? ms s2 = s2 ? ms s1. mss2) ms ? (s1? mss3 = s1? ms (s2? mss3); mss2) = s1? ms (s1? mss2) (s1? mss2) mss3 = s1? ms (s2? mss3). = s1. Property 5 (Distributive law). s1? ms(s2? mss3)=(s1? ms s2)? ms(s1? mss3); s1? These properties can be easily derived from Definitions 5 and 6. 4 ? ms(s2? mss3)=(s1? mss2)? ms(s1? ms s3). 3. Algorithm and Analysis In this section, we will first present our algorithm design and then provide a theoretical analysis on its performance properties. 3.1 Algorithm Design In comparison with other types of data, streaming data is more difficult to deal with in pattern mining. On the one hand, a data stream is dynamically growing, so its patterns should be incrementally formed. From this point of view, algorithms for mining data streams should have an online feature, which means that any so-far patterns can be timely provided to the users once they have been found. On the other hand, a data stream is a collection of highvolume data, so an efficient usage of the main memory has become the bottleneck to mining data streams. To breakthrough this bottleneck, it is necessary to design a compact data structure in memory. Therefore, redundant information should not be stored in memory at all or as little as possible, and active pruning measures must be taken. Also, since a data stream is theoretically infinite, it is necessary to shed aged or less important data from memory in time. Based on the theoretical analysis in the previous section, we design an algorithm in a rather succinct way. Figure 1 gives its pseudocode. There are two main in-memory data structures used in our algorithm INSTANT: (1) K, an itemsequential set, stores maximal frequent so-far itemsequences that have been found by a specific time; and (2) U, an array of itemsequential sets, where U[i] stores maximal itemsequences that are infrequent with a support count of i by a specific time. Algorithm INSTANT INPUT: (1) An continuous data stream DS; (2) Minimum support count δ; (3) Memory space available for the user ϕ. OUTPUT: Maximal frequent so-far itemsequences. Main: Initialize(K, δ); Initialize(U, δ); REPEAT α=get an itemsequence from DS; IF (α? sub K) Fre_maker(K, U[δ-1]); Sup_maintainer(U, α, δ); IF (memory usage ? ϕ) Shedder(U,ϕ); ENDIF ENDIF UNTIL endof(DS) Figure 1 Description of Algorithm INSTANT When an itemsequence α arrives in memory from DS, INSTANT first tests if α? subK. If α? subK, no action needs to be taken because α or its super-itemsequences as frequent patterns have been stored and output. If α? subK, the following tree procedures will be called. (1) Fre_maker(K, U[δ-1]) : When α appears, it is possible that α or α’s sub-itemsequences become frequent. Figure 2 presents this procedure. In Figure 2, by executing S0={α}∩msU[δ-1], this procedure gets a new frequent itemsequential set S0 related to α, and updates K with K=K? msS0. Also, by calling Output(S0), this procedure timely displays the frequent so-far itemsequences right after they are generated, which indicates the algorithm has an obvious online mining feature. (2) Sup_maintainer(U, α, δ): when a new itemsequence α occurs from DS, new (sub)itemsequences can be generated into U, or the supports of existing elements in U should be updated. Thus, this procedure maintains the changes that α brings about to U[1], U[2], … , U[δ-1] in a hierarchical way. As Figure 3 shows, Sup_maintainer() can be succinctly described based on the sub-operators between itemsequential sets that have been introduced in Section 2. 5 Procedure Fre_maker(K, α,U[δ-1]) S0={α}? msU[δ-1]; IF (S0? Φ) K= K ? ms S0; Output(S0); ENDIF; Figure 2 Description of Fre_maker() Procedure Sup_maintainer(U, α, δ) S1={α}; FOR (i=1;i<δ; i++) S2= U[i] ? ms S1; U[i]=U[i] ? ms S1; U[i]=U[i]-S2; S1=S2; ENDFOR; Figure 3 Description of Sup_maintainer() (3) Shedder(U,ϕ): As far as system overload is concerned, a shedding strategy needs to be carefully considered. In general, we expect that the elements that are going to be driven out of memory are those that have less influence to the mining results in the future. To minimize this influence, our algorithm maintains a weighted function for every itemsequence t in DS: Imp(t)=w1*Time(t)+w2*Length(t)+w3* Csup(t). We consider three factors: (a) occurrence time Time(t): the earlier an itemsequence occurs in DS, the less important it is; (b) itemsequence length Length(t): the longer an itemsequence is, the more important it is; and (c) current support Csup(t): the larger support count an itemsequence has, the more important it is. Based on such an Imp() function, this paper provides the following solution for load shedding: Given a user-specified memory threshold ϕ, Shedder(U,ϕ) selects those itemsequences in U with the least Imp values to discard. Figure 4 shows the main steps in this procedure. Procedure Shedder(U,ϕ) REPEAT α=select an itemsequence in U; delete(α, U); UNTIL (the memory usage < ϕ); Figure 4 Description of Shedder() In general, ϕ should be much less than the overall system throughput in order to keep enough space available for processing possible arrival peaks in the data stream. Unlike the above two procedures, Shedder() might be periodically invoked as a background process. Example 4. Assuming DS = {ABC, ACD, ABCD, ABD, ACDE, ABCD, BC}, and Msup = 3 (count), Table 1 shows how INSTANT runs. Table 1 An Example for INSTANT α ABC ACD ABCD ABD ACDE ABCD K Φ Φ Φ {AC} {AC, AB, AD} {ACD, AB} {ACD, ABC, ABD} U[1] Φ {ABC} {ABC, ACD} {ABCD} {ABCD} {ABCD, ACDE} {ACDE} 6 U[2] Φ Φ {AC} {ABC, ACD} {ABC, ACD, ABD} {ABC, ABD} {ABCD} 3.2 Theoretical Performance Analysis Mining complete frequent itemsets in a transaction database brings about many problems, one of which is the memory usage, especially when the support threshold is low or the database is high-volume. A widely-discussed method to solve this problem is the mining of closed frequent itemsets instead of complete frequent itemsets. There are many references that have discussed related problems with useful algorithms [20, 19, 23, 9]. A closed itemset can be expanded to a closed itemsequence in Definition 1. Following the definitions in [20], we give the definitions of closed itemsequences and closed frequent itemsequences as follows. Definition 7 (Closed itemsequence). Given an itemsequential set S, an itemsequence t is a closed itemsequence in S if there does not exist any itemsequence t0 in S such that (1) t0 is a proper super-itemsequence of t, and (2) every itemsequence containing (? s) t also contains t0. Definition 8 (Closed frequent itemsequence). An itemsequence t is a closed frequent itemsequence if it is a closed itemsequence and its support passes the given minimum support threshold. For mining a data stream, the support of itemsequence t can be either Gsup(t) or Csup(t) in Definition 2. When Csup(t) passes the minimum support threshold at a given time, t is called a closed frequent so-far itemsequence by that time. Some theoretical performance properties can be proved on INSTANT by comparing the closed itemsequences in the data stream. Lemma 1. After an itemsequence is handled by INSTANT, for any i=1, 2, … , δ-1, the support counts of all itemsequences in U[i] exactly equal to i, where δ is the minimum support count. Proof. Assume a data stream is (α1, α2, … , αn-1, αn, …), and U in -1 and U in are respectively the values of U[i] before and after handling αn. For n=1, 2, …, k, …, we use mathematical Induction to prove this lemma. (a) When n=1, U10 = U 02 = … = Uδ0 -1 = φ; and U11 = {α1}, U12 = … = U1δ -1 = φ. Obviously Lemma 1 holds. (b) When n>1, given the induction hypothesis that Lemma 1 holds for all n < k, we will prove that Lemma 1 also holds for n = k. According to Algorithm INSTANT, we have: k k -1 k -1 U1 = ( U1 ? ms {αk}) − ( U1 ? ms {αk}), k k -1 k -1 k -1 k -1 U 2 = ( U 2 ? ms ( U1 ? ms {αk})) − ( U 2 ? ms ( U1 ? ms {αk})), … …, i −1 i k k -1 k -1 k -1 U i = ( U i ? ms ( ( ? j=ms U j )? ms{αk}) ) − (( ? j=ms U j ) ? ms {αk})…………………………….(1) 1 1 … …, δ −2 δ −1 k k -1 k -1 k -1 Uδ -1 = ( Uδ -1 ? ms ( ( ? j=ms U j ) ? ms {αk})) − (( ? j=ms U j ) ? ms {αk}). 1 1 With (1), and based on the induction hypothesis, all itemsequences in U1i , Ui2 , …, Uik -1 have a support count of i, so we can obtain: ∀t ? U ik -1 , Csup(t)=i, …………………………..(2) i −1 ∀t ? ( ? ms j =1 k -1 U j ), Csup(t)? i-1, ………….......(3) i −1 ∀t ? ( ( ? ∀t ? (( ? ms j =1 i ms j =1 k -1 U j )? ms{αk} ), Csup(t)? i, ........(4) k -1 U j )? ms{αk}), Csup(t)? i+1. ......(5) By (2) and (4), we have 7 i −1 ∀t ? ( U ik -1 ? ms( ( ? ms U kj -1 )? ms{αk})), Csup(t)? i. ...............................................................(6) j =1 Applying (6) and (5) to (1), we can get the following result: ∀ t? U ik = ( U ik -1 ? (( ? i ms j =1 i −1 ms ( (? ms j =1 k -1 U j )? ms{αk}) ) − k -1 k U j ) ? ms {αk}), Csup(t) satisfies ? i but does not satisfy ? i+1. That is, all elements in U i exactly have the support count of i. Thus Lemma 1 holds for n = k. According to (a) and (b), we can induce as follows. For a data stream (α1, α2, … , αn, …), after processing an itemsequence αn (n=1, 2, …), INSTANT keeps U[i] with an exact support count of i (i=1, 2, … , δ-1). ? Lemma 2. At any time, Algorithm INSTANT satisfies: if i<j and t? U[i], then t? subU[j]. Proof. For i<j, if t? U[i], then by Lemma 1, Csup(t)=i. Also, for any t1 such that t? st1, Csup(t1)? Csup(t)=i<j, so t1? U[j]. Therefore, t? subU[j]. Lemma 3. At any time, for any i=1, 2, … , δ-1, Algorithm INSTANT keeps U[i] as a set of maximal itemsequences. Proof. U[i] is updated by Procedure Sup_maintainer() which uses sub-operator ? ms to renew its elements. According to the definition of ? ms, after executing U[i]? msS1, any itemsequence of U[i] cannot be contained by other itemsequences in U[i]. Thus, U[i] is a set of maximal itemsequences. ? Lemma 4. At any time, Algorithm INSTANT keeps K as a set of maximal frequent itemsequences. Proof. It can be proved by a similar way to the proof of Lemma 3. Lemma 5. Given a data stream of DS and a certain time T, assuming CF is the set of closed frequent itemsequences at T in DS, and K is the set of frequent itemsequences generated by Algorithm INSTANT by T, we have K? CF. Proof. By Lemma 4, K is a set of maximal frequent itemsequences. That is, for any t? K, there exists no proper super-itemsequence of t that is frequent. According to Definitions 7 and 8, t is also a closed frequent itemsequence, i.e., t? CF. Therefore, K? CF. ? Theorem. Given a data stream DS and a certain time T, assuming C is the set of all closed itemsequences at T with DS, K is the set of the frequent itemsequences generated by Algorithm INSTANT by T, and U[i] (i=1, 2, … , δ-1) is δ −1 the set of these itemsequences with support i generated by Algorithm INSTANT by T, then (K? ( ? U[i])) ? C. i =1 Proof. Let CF and NF be respectively the set of closed frequent itemsequences and the set of closed infrequent δ −1 itemsequences at T with DS. By Lemma 5, K? CF. Also, for any t? ( ? U[i]), there must exist a integer j (j = 1, 2, i =1 … , δ-1), so that t? U[j]. By Lemma 3, no proper super-itemsequence of t exists in U[j]. By Lemma 2, t? subU[k] (k>j), that is, any proper super-itemsequence of t cannot have a higher support count than that of t. Thus, for any δ −1 t? ( ? U[i]), no proper super-itemsequence of t has the same or higher support count than that of t. According to i =1 δ −1 δ −1 δ −1 i =1 i =1 i =1 Definition 7, ( ? U[i]) is closed, i.e. ? U[i] ? NF. Therefore, (K? ( ? U[i])) ? C. ? Based on the above analysis, we can conclude that INSTANT uses more compact in-memory data structures (Itemsequential Sets: K and U) than closed itemsets or closed itemsequences in a data stream. 4. Experimental Evaluation To evaluate the performance of INSTANT, we have compared its performance with two representative algorithms: MOMENT [9] and DSM-FI [17] which are both reviewed in Section 1.1. MOMENT aims at mining closed frequent itemsets from a data stream. It uses the sliding window technique. When new transactions are added into the sliding window or old ones are removed from it, the in-memory data structure CET will be updated. Therefore, MOMENT is similar to our method in maintaining special itemsets (closed or maximal) and so it can be expected to get a comparable performance. 8 DSM-FI is an algorithm for mining all frequent itemsets over the entire history of data streams. It employs an inmemory data structure, called IsFI-forest, which can be seen as a typical tree pattern in data stream mining. DSM-FI is a two-phase algorithm: (1) maintaining IsFI-forest from the data stream; and (2) generating frequent itemsets from IsFI-forest. Compared with DSM-FI, our method has two obvious differences: (1) our method only maintains maximal so-far itemsequences; and (2) our algorithm directly outputs frequent so-far patterns once they are generated. In this section, all experiments were implemented in C++ and compiled using the g++ 3.2.2. The experiments were conducted with a 2.8 GHz CPU with 1 GB of RAM and a Redhat Linux 9.0 operation system. The datasets for our experiments were generated using the data generator described in [3]. The main parameters in the generator are: the number of transactions D=90,000~10,000, the average size of transactions T=20, and the maximal potentially frequent itemsets I=5. According to the sizes of data sets, we denote these data sets by T20I5D90K~ T20I5D10K. Experiment 1. The running times of INSTANT, MOM-ENT and DSM-FI against various minimum supports and different sizes of data. Figure 5 shows relevant results. Set-ups: The size of the current window on INSTANT, the sliding window on MOMENT and the bock on DSMFI is all 10K (transactions). Figure 5 (a) uses a single data set T20I5D90K, and Figure 5 (b) uses a fixed minimum support of 0.3%. Results: INSTANT is faster than DSM-FI with all minimum supports. As the minimum support decreases, INSTANT has a relative stable running time while MO-ENT’s running time is obviously growing. As the number of transactions increases, INSTANT has less scale-up than both MOMENT and DSM-FI. Inst ant Inst ant M oment M oment DSM -FI DSM -FI 400 700 350 600 300 Running Time(Sec) Running Time(Sec) 450 250 200 150 100 500 400 300 200 100 50 0 0 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 10 30 50 70 90 Number of Transact ions(K) M inimum Support (% ) (a) (b) Figure 5 Running Time with Different Minimum Supports and Numbers of Transactions Analysis of Results: The stable running time on INSTANT comes from its efficient storage structure and hierarchical maintenance for maximal itemsequences. With a lower support, all maximal so-far itemsequences are sorted into less groups but the size of frequent so-far itemsequences is bigger. With a higher support, all maximal so-far itemsequences are sorted into more groups but the size of frequent so-far itemsequences is smaller. Also, with an increasing number of transactions from streaming data, the set of maximal itemsequences is relatively stable and so INSTANT holds a very little scale-up time. As Figure 5 (b) shows, although INSTANT can be slower in the beginning than MOMENT and DSM-FI, it can achieve a better performance than MOMENT and DSM-FI as the number of transactions increases. Experiment 2. Data characteristics in memory and shedding data out of memory with different minimum supports. Table 2 gives relevant statistics. Set-ups: Data set T20I5D50K; the size of current window on INSTANT, the sliding window on MOMENT and the bock on DSM-FI are all 25K (transactions). Results: The second and third columns on Table 1 indicate that as the minimum support decreases, the number of itemsequences on INSTANT and the number of itemsets on MOMENT both grow, but INSTANT uses less storage and has a lower scale-up rate in memory than that of MOMENT. The last two columns on Table 1 show that INSTANT has less data shedded out of memory than DSM-FI with any minimum support. 9 Table 2. Data Characteristics Related to Memory Min. Sup. (%) 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 INSTANT: Maximal Itemseq. # MOMENT: Closed Itemset # INSTANT: Removed Itemseq. # DSM-FI: Removed itemset # 586 754 1045 1234 1409 1636 1924 2153 2314 3178 1156 1420 1800 2235 3056 4527 7059 7059 10935 19497 15 16 18 22 22 23 25 33 36 41 431 389 354 313 263 211 165 100 62 48 (b) Analysis of Results: The better efficiency in memory usage on INSTANT can be guaranteed by the theorem in Section 3.2. Also, INSTANT and DSM-FI both mine all frequent itemsets or itemsequences over the entire history of data streams, but INSTANT can keep more useful historical information than DSM-FI in a limited memory space. Thus, with the same memory consumption, INSTANT can cope with more transactions than DSM-FI in a largescale data stream. 5. Conclusion Discovering frequent itemsets or itemsequences in an online fashion is an essential challenge in streaming data mining. This paper provides a solution to this problem. By modeling a transaction as an itemsequence, we have presented an online algorithm INSTANT that can find maximal frequent so-far itemsequences and instantly output them. The uniqueness of our proposed effort is that it adopts useful operators among itemsequential sets to facilitate the algorithm performance. Most important of all, the in-memory data structures maintained by INSTANT are smaller than closed so-far itemsequences. Therefore, it can achieve better performances in both time and space consumptions. References [1] T. Asai, H. Arimura, K. Abe, S. Kawasoe and S. Arikawa, Online algorithms for mining semi-structureed data stream. In Proc. 2002 Intl. Conf. on Data Mining (ICDM’02), Maebashi City, Japan, December, 2002, pp. 27-36. [2] R. Agrawal, T. Imielinski, and A. Swami, Mining association rules between sets of items in large databases. In ACM SIGMOD Conf. Management of Data, 1993. [3] R. Agrawal, and R. Srikant, Fast algorithms for mining association rules. In Proc. Of the 20th Intl. Conf. on Very Large Databases (VLDB’94), Santiago, Chile, Sept, 1994, pp. 487-499. [4] B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom, Models and issues in data stream systems. In Proc. of SIGMOD/PODS, Madison, Wisconsin, USA, June 3-5, 2002, pp. 1-16. [5] B. Babcock, M. Datar, and R. Motwani, Load shedding techniques for data stream systems. In Proc. of the 2003 Workshop on Management and Processing of Data Streams (MPDS 2003), San Diego, California, USA, Sunday, June 8, 2003. [6] B. Babcock, M. Datar, and R. Motwani, Load shedding for aggregation queries over data streams. In 20th Intl Conf. on Data Engineering, Boston, Massachusetts, USA, March 30 - April 02, 2004, pp. 350-358. [7] J. Chang. and W. Lee, Finding recent frequent itemsets adaptively over online data Streams. In Proc. of the 9th ACM SIGKDD Intl. Conf. on Knowledge Discovery & Data Mining (KDD2003), Washington, DC, August 24-27, 2003, pp 226235. [8] J. Chang, and W. Lee, Decaying obsolete information in finding recent frequent itemsets over data stream. IEICE Transaction on Information and Systems, Vol. E87-D, No. 6, June, 2004. [9] Y. Chi, H. Wang, P. Yu, and R. Muntz, MOMENT: Maintaining closed frequent itemsets over a stream sliding window. In Proc. Of 4th IEEE Intl. Conf. on Data Mining, Brighton, UK, November, 2004, pp.59-66. [10] Y. Chi, P. Yu, H. Wang, and R. Muntz, Loadstar: A load shedding scheme for classifying Data streams. In the 4th SIAM International Conference on Data Mining (SDM), Newport Beach, USA, April 2005. [11] G. Dong, J. Han, L. Lakshmanan, J. Pei, H. Wang, and P. Yu, Online mining of changes from data streams: Research prob- 10 [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] lems and preliminary results. In Proc. of the 2003 Workshop on Management and Processing of Data Streams (MPDS 2003), San Diego, California, USA, Sunday, June 8, 2003. C. Giannella, J. Han, E. Robertson, and C. Liu, Mining frequent itemsets over arbitrary time intervals in data streams. Technical Report, Indiana University, 2003. TR587. J. Hsu, Data mining trends and developments: The key data mining technologies and applications for the 21st century. In D Colton, M J Payne, N Bhatnagar, and C R Woratschek (Eds.), The Proc. of ISECON 2002, v 19 (San Antonio): 224b. AITP Foundation for Information Technology Education. ISSN: 1542-7382. J. Han, J. Pei, and Y. Yin, Mining frequent patterns without candidate generation. In Proceedings of the SIGMOD Conference, Dallas, Texas, USA: ACM Press, 2000, pp. 1-12. J. Han, J. Pei, Y. Yin, and R. Mao, Mining frequent patterns without candidate generation: A frequent-pattern tree approach. Data Mining and Knowledge Discovery, 2004, 8(1):53-87. E. Keogh, S. Chu, D. Har, and M. Pazzani, An online algorithm for segmenting time series. In Proc. of IEEE International Conference on Data Mining, 2001, pp 289-296. H. Li, S. Lee, and M. Shan, An efficient algorithm for mining frequent itemsets over the entire history of data streams. In 1st International Workshop on Knowledge Discovery in Data Streams, Pisa, Italy, Sept. 20-24, 2004. Listed at Mining Data Streams Bibliography: http://www.csse.monash.edu.au/~mgaber/WResources.htm. G. S. Manku, and R. Motwani, Approximate frequency counts over data streams. In Proc. of the 28th VLDB Conference, Hong Kong, China, August, 2002, pp. 346-357. N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal, Discovering frequent closed itemsets for association rules. In: Beeri C, et al, eds. Proc. of the 7th Int'l. Conf. on Database Theory, Jerusalem: Springer-Verlag, 1999, pp. 398-416. J. Pei, J. Han J, and R. Mao, CLOSET: An efficient algorithm for mining frequent closed itemsets. In: Gunopulos D, et al, eds. Proc. of the 2000 ACM SIGMOD Int'l. Workshop on Data Mining and Knowledge Discovery, Dallas: ACM Press, 2000, pp. 21~30. T. Palpanas, M. Vlachos, E. Keogh, D. Gunopulos, and W. Truppel, Online amnesic approximation of streaming time series. In ICDE-2004, Boston, MA, USA, March, 2004, pp. 338-349. W. Teng, M. Chen, and P. Yu, A Regression-Based Temporal Pattern Mining Scheme for Data Streams. In Proc. of the 29th VLDB Conference, Berlin, Germany, Sept. 2003, pp. 93-104. M. J. Zaki, and C. J. Hsiao, CHARM: An efficient algorithm for closed itemset mining. In: Grossman R, et al, eds. Proc. of the 2nd SIAM Int'l. Conf. on Data Mining, Arlington: SIAM, 2002, pp.12-28. Y. Zhu, and D. Shasha, StatStream: statistical monitoring of thousands of data streams in real time. In Bernstein P, Ioannidis Y, Ramakrishnan R, eds. Proc. of the 28th Int'l Conf. on Very Large Data Bases, Hong Kong: Morgan Kaufmann, 2002, pp. 358-369. 11

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Online Mining of Maximal Frequent Itemsequences from Data Streams