Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
MINE RULE: Semantic Dimensions in Association Rule Mining Rosa Meo and Giuseppe Psaila Università degli Studi di Torino and Università degli Studi di Bergamo, Italy INTRODUCTION Mining of Association Rules is one of the most adopted techniques for data mining in the most wide spread application domains. A great deal of work has been carried out in last years on the development of efficient algorithms for association rules extraction. Indeed, this problem is a computational difficult task (known to be NP-hard, see Calders 2004) which has been augmented by the fact that normally association rules are being extracted from very large databases. Moreover, in order to increase the relevance and interestingness of obtained results and reduce the volume of the overall result, constraints on association rules are introduced and must be evaluated (Ng et al. 1998, Srikant et al. 1997). However, in this contribution, we do not focus on the problem of developing efficient algorithms but on the semantic problem behind the extraction of association rules (see also Tsur et al. 1998 for an interesting generalization of this problem). We want to put in evidence which are the semantic dimensions that characterize the extraction of association rules, that is, we describe in a more general way which classes of problems association rules solve. In order to accomplish this, we adopt a general purpose query language, designed for the extraction of association rules from relational databases. The operator of this language, MINE RULE, allows the expression of constraints, constituted by standard SQL predicates, that make it suitable to be employed with success in many diverse application problems. For a comparison between this query language and other state of the art languages for data mining see Imielinski, Virmani, Abdoulghani 1996, Han et al. 1996, Netz et al. 2001, and Botta et al. 2004. In Imielinski, Mannila 1996 a new approach to data mining is proposed which is constituted by a new generation of databases, called Inductive Databases (IDBs). With an IDB the user/analyst can use advanced query languages for data mining to interact with the knowledge discovery (KDD) system, extract data mining descriptive and predictive patterns from the database and store them in the database. Boulicaut et al. 1998 and Baralis et al. 1999 discuss the usage of MINE RULE in this context. We want to show that thanks to a highly expressive query language it is possible to exploit all the semantic possibilities of association rules and solve very different problems with a unique language, whose statements are instantiated along the different semantic dimensions of the same application domain. We discuss examples of statements solving problems in different application domains that nowadays are of a great importance: the first application is the analysis of a retail data, whose aim is market basket analysis (Agrawal et al. 1993) and the discovery of user profiles for customer relationship management (CRM). The second application is the analysis of data registered in a Web server on the accesses to Web sites by users (Cooley et al. 2000 present a study on the same application domain). The last domain is the analysis of genomic databases containing data on microarray experiments (Usama M. Fayyad 2003). We show many practical examples of MINE RULE statements and discuss the application problems that can be solved by analysing the association rules that result by those statements. BACKGROUND An association rule has the form B H where B and H are sets of items, respectively called body (the antecedent) and head (the consequent). An association rule (also denoted for short with rule) intuitively means items in B and H are often associated within the observed data. Two numerical parameters denote the validity of the rule: support is the fraction of source data for which the rule holds; confidence is the conditional probability that H holds provided that B holds. Two minimum thresholds for support and confidence are specified before rules are extracted, so that only significant rules are extracted. This very general definition is anyway incomplete and very ambiguous. For example, which is the meaning of “fraction of source data for which the rule holds”? or which are the items associated by a rule? If we do not answer these basic questions, an association rule does not have a precise meaning. Consider for instance the original problem for which association rules were initially proposed in Agrawal et al. 1993, the market baskets analysis. If we have a database collecting single purchase transactions (for instance, transactions performed by customers in a retail store) we might wish to extract association rules that associate items sold within the same transactions. Intuitively, we are defining the semantics of our problem: items are associated by a rule if they appear together in the same transaction; support denotes the fraction of the total transactions that contain all the items in the rule (both B and H), while confidence denotes the conditional probability that, found B in a transaction, also H is found in the same transaction. Thus a rule {pants, shirt} {socks, shoes} support=0.02 confidence=0.23 means that the items “pants”, “shirt”, “socks” and “shoes” appear together in the 2% of transactions, while having found items “pants” and “shirt” in a transaction, the probability that the same transaction contains also “socks” and “shoes” is 23%. Semantic Dimensions MINE RULE puts in evidence the semantic dimensions that characterize the extraction of association rules from within relational databases, and forces users (typically analysts) to understand these semantic dimensions. Indeed, extracted association rules describe the most recurrent values of certain attributes that occur in the data (in the above example the values of the purchased product). This is the first semantic dimension that characterizes the problem. These recurrent values are observed within sets of data grouped by some common features (such as the transaction identifier in the previous example, but in general the date, the customer identifier, and so on). This constitutes the second semantic dimension of the association rule problem. Therefore, extracted association rules describe the observed values of the first dimension, that are recurrent in entities identified by the second dimension. When values belonging to the first dimension are associated, it is possible that not every associations are suitable, but only a subset of them should be selected, based on a coupling conditions on attributes of the analyzed data (for instance, a temporal sequence between events described in B and H). This is the third semantic dimension of the problem; the coupling condition is called mining condition. It is clear that MINE RULE is not tied to any particular application domain, since the semantic dimensions allow to discover significant and unexpected information in very different application domains. The main features and clauses of MINE RULE are (see Meo et al. 1998 for a detailed description): – Selection of the relevant set of data for a data mining process. This feature is specified by the FROM clause. – Selection of the grouping features w.r.t. which data are observed. These features are expressed by the GROUP BY clause. – Definition of the structure of rules and cardinality constraints on body and head, specified in the SELECT clause. Elements in rules can be single values or tuples. – Definition of coupling constraints. These are constraints applied at the rule level (mining condition instantiated by a WHERE clause associated to SELECT) for coupling values. – Definition of rule evaluation measures and minimum thresholds. These are support and confidence (even if theoretically also other statistical measures would be possible). Support of a rule is computed on the total number of groups in which it occurs and satisfies the given constraints. Confidence is the ratio between the rule support and the support of the body satisfying the given constraints. Thresholds are specified by clause EXTRACTING RULES WITH. MAIN THRUST OF THE CHAPTER In this Section we introduce MINE RULE in the context of the three application domains. We describe many examples of queries that can be conceived as a sort of template because are instantiated along the relevant dimensions of an application domain and solve some frequent, similar and critical situations for users of different applications. First Application: Retail Data Analysis We consider a typical data warehouse gathering information on customers purchases in a retail store: FactTable (TransId, CustId, TimeId, ItemId, Num, Discount) Customer (CustId, Profession, Age, Sex) Rows in FactTable describe sales. The dimensions of data are the customer (CustId), the time (TimeId) and the purchased item (ItemId); each sale is characterized by the nuumber of sold pieces (Num) and the discount (Discount); the transaction identifier (TransId) is reported as well. We also report table Customer. Example 1: We want to extract a set of association rules, named FrequentItemSets, that finds the associations between sets of items (first dimension of the problem) purchased together in a sufficient number of dates (second dimension) with no specific coupling condition (third dimension). These associations provide the business relevant sets of items because are the most frequent in time. The MINE RULE statement is now reported. MINE RULE FrequentItemSets AS SELECT DISTINCT 1..n ItemId AS BODY, 1..n ItemId AS HEAD, SUPPORT, CONFIDENCE FROM FactTable GROUP BY TimeId EXTRACTING RULES WITH SUPPORT:0.2, CONFIDENCE:0.4 The first dimension of the problem is specified in the SELECT clause that specifies the schema of each element in association rules, the cardinality of body and head (in terms of lower and upper bound) and the statistical measures for the evaluation of association rules (support and confidence); in the example, body and head are not empty sets of items and their upper bound is unlimited. The GROUP BY clause provides the second dimension of the problem: since attribute TimeId is specified, rules denote that associated items have been sold in the same date (intuitively, rows are grouped by values of TimeId, and rules associate values of attribute ItemId appearing in the same group). Support of an association rule is computed in terms of the number of groups in which any element of the rule co-occurs; confidence is computed analogously. In this example, support is computed over the different instants of time, since grouping is made according to the time identifier. Support and confidence of rules must be not lower than the values in EXTRACTING clause (respectively 0.3 and 0.4). Example 2: Customer profiling is a key problem in CRM applications. Association rules allow to obtain a description of customers (e.g. w.r.t. age and profession) in terms of frequently purchased products. To do that, values coming from two distinct dimensions of data must be associated. MINE RULE CustomerProfiles AS SELECT DISTINCT 1..1 Profession, Age AS BODY, 1..n Item AS HEAD, SUPPORT, CONFIDENCE FROM FactTable JOIN Customer ON FactTable.CustId=Customer.CustId GROUP BY CustId EXTRACTING RULES WITH SUPPORT:0.6, CONFIDENCE:0.9 The observed entity is the customer (first dimension of data) described by a single pair in the body (cardinality constraint 1..1); the head associates products frequently purchased by customers (second dimension of data) with the profile reported in the body (see the SELECT clause). Thus a rule {(employee, 35)} {socks, shoes} support=0.7 confidence=0.96 means that customers which are employees and 35 years old often (96% of cases) buy socks and shoes. Support tells about the absolute frequency of the profile in the customer base (GROUP BY clause). This solution can be easily generalized for any profiling problem. Second Application: Web Log Analysis Typically, Web servers store information concerning access to Web sites stored in a standard log file. This is a relational table (WebLogTable) that typically contains at least the following attributes: RequestID: identifier of the request; IPcaller: IP address from which the request is originated; Date: date of the request; TS: time stamp; Operation: kind of operation (for instance, get or put); Page URL: URL of the requested page; Protocol: transfer protocol (such as TCP/IP); Return Code: code returned by the Web server; Dimension: dimension of the page (in Bytes). Example 1: To discover Web communities of users on the basis of the pages they frequently visited, we might find associations between sets of users (first dimension) that have all visited a certain number of pages (second dimension); no coupling conditions are necessary (third dimension). Users are observed by means of their IP address, Ipcaller, whose values are associated by rules (see SELECT). Support and confidence of association rules, in this case, are computed based on the number of pages visited by users in rules (see GROUP BY). Thus rule {Ip1, Ip2} {Ip3, Ip4} support=0.4 confidence=0.45 means that users operating from Ip1, Ip2, Ip3 and Ip4 visited the same set of pages, which constitute the 40% of total pages in the site. MINE RULE UsersSamePages AS SELECT DISTINCT 1..n IPcaller AS BODY, 1..n IPcaller AS HEAD, SUPPORT, CONFIDENCE FROM WebLogTable GROUP BY PageUrl EXTRACTING RULES WITH SUPPORT:0.2, CONFIDENCE:0.4 Example 2: In Web log analysis, it is interesting to discover the most frequent crawling paths. MINE RULE FreqSeqPages AS SELECT DISTINCT 1..n PageUrl AS BODY, 1..n PageUrl AS HEAD, SUPPORT, CONFIDENCE WHERE BODY.Date < HEAD.Date FROM WebLogTable GROUP BY IPcaller EXTRACTING RULES WITH SUPPORT:0.3, CONFIDENCE:0.4 Rows are grouped by user (IPcaller) and sets of pages, frequently visited by a sufficient number of users, are associated. Furthermore, pages are associated only if they denote a sequential pattern (third dimension): in fact, the mining condition WHERE BODY.Date < HEAD.Date constrains the temporal ordering between pages in antecedent and consequent of rules; consequently, rule {P1, P2} {P3, P4, P5} support=0.5 confidence=0.6 means that 50% of users visit pages P3, P4 and P5 after pages P1 and P2. This solution can be easily generalized for any problem requiring the search for sequential patterns. Many other examples are possible, such as rules that associate users to frequently visited Web pages (highlight the fidelity of the users to the service provided by a Web site) or frequent requests of a page by a browser that cause an error in the Web server (interesting because it constitutes a favourable situation to hackers’ attacks). Third Application: Genes Classification by Microarray Experiments We consider information on a single microarray experiment, containing data on several samples of biological tissue, tied to correpondent probes on a silicon chip. Each sample is treated (or hybridized) in various ways and under different experimental conditions: these can determine the over-expression of a set of genes. This means that the sets of genes are active in the experimental conditions (or inactive if, on the contrary, are under-expressed). Biologists are interested to discover which sets of genes are similarly expressed and under which conditions. A microrray typically contains some hundreds of samples, and for each sample several thousands of genes are measured. Thus input relation, called MicroArrayTable, contains the following information: SampleID: identifier of the sample of biological tissue tied to a probe on the microchip; GeneId: identifier of the gene measured in the sample. TreatmentConditionId: identifier of the experimental conditions under which the sample has been treated; LevelOfExpression: measured value if higher than a threshold T2, the genes are over-expressed; if lower than another threshold T1, genes are under-expressed. Example: This analysis discovers sets of genes (first dimension of the problem) that, in the same experimental conditions (second dimension), are similarly expressed (third dimension). MINE RULE SimilarlyCorrelatedGenes AS SELECT DISTINCT 1..n GeneId AS BODY, 1..n GeneId AS HEAD, SUPPORT, CONFIDENCE WHERE BODY.LevelOfExpression < T1 AND HEAD.LevelOfExpression < T1 OR BODY.LevelOfExpression > T2 AND HEAD.LevelOfExpression > T2 FROM MicroArrayTable GROUP BY SampleId, TreatmentConditionId EXTRACTING RULES WITH SUPPORT:0.95, CONFIDENCE:0.8 The mining condition introduced by WHERE constrains both the sets of genes to be similarly expressed in the same experimental conditions (i.e., samples of tissue treated in the same conditions). Support threshold (0.95) determines the proportion of samples in which the sets of genes must be similarly expressed; confidence determines how strongly the two sets of genes are correlated. This statement might help biologists to discover the sets of genes that are involved in the production of proteins involved in the development of certain diseases (e.g., cancer). FUTURE TRENDS This contribution wants to evaluate the usability of a mining query language and its results association rules – in some practical applications. We identified many useful patterns, corresponding to concrete user problems. We showed that the exploitation of the nuggets of information embedded in the databases and of the specialized mining constructs provided by the query languages, enables the rapid customization of the mining procedures following to the real users’ needs. Given our experience, we also claim that, independently of the application domain, the use of queries in advanced languages, as opposed to ad-hoc heuristics, eases the specification and the discovery of a large spectrum of patterns. This motivates the need for powerful query languages in KDD systems. For the future, we believe that a critical point will be the availability of powerful query optimizers, such as the one proposed in Meo 2003. This one is able to solve data mining queries incrementally, that is by modification of the previous queries results, materialized in the database. CONCLUSION In this contribution, we focused on the semantic problem behind the extraction of association rules. We put in evidence which are the semantic dimensions that characterize the extraction of association rules; we did this by applying a general purpose query language designed for the extraction of association rules, named MINE RULE, to three important application domains. The query examples we provided show that the mining language is powerful, and at the same time versatile because its operational semantics seems to be the basic one. Indeed these experiments allow us to claim that Mannila and Imielinski’s initial view on inductive databases was correct: <<There is no such thing as real discovery, just a matter of the expressive power of the query languages>>. REFERENCES Agrawal R., Imielinski T., Swami A. (1993). Mining association rules between sets of items in large databases. Proceedings of ACM SIGMOD International Conference on Management of Data. Baralis E., Psaila G. (1999). Incremental refinement of mining queries, Proceedings of the First International Conference on Data Warehousing and Knowledge Discovery, Italy. Botta M., Boulicaut J.-F., Masson C., Meo R. (2004). Query languages supporting descriptive rule mining: a comparative study, in R.Meo, P.Lanzi, M.Klemettinen, (eds.) Database Support for Data Mining Applications, p. 27-54, Springer-Verlag, LNCS 2682. Boulicaut J.-F., Klemettinen M., Mannila H. (1998). Querying inductive databases: A case study on the MINE RULE operator, Proceedings of PKDD 1998 International Conference on Principles of Data Mining and Knowledge Discovery, France. Calders T. (2004). Computational complexity of itemset frequency satisfiability, Proceedings of ACM PODS Principles Of Database Systems, June. Cooley R., Tan P.N., Srivastava J. (2000). Discovery of interesting usage patterns from Web data. LNCS/LNAI. Springer Verlag, 2000. Usama M. Fayyad. (2003). Special Issue of SIGKDD Explorations on Microarray Data Mining, vol. 4, n.2. Han J., Fu Y., Wang W., Koperski K., Zaiane O. (1996). DMQL: A data mining query language for relational databases, Proceedings of ACM SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, Canada. Imielinski T., Mannila, H. (1996). A database perspective on knowledge discovery. Communications of the ACM, 39(11):58–64, November. Imielinski T., Virmani A. , Abdoulghani A. (1996). DataMine: application programming interface and query language for database mining, Proceedings of ACM International Conference on SIGKDD. Meo R., Psaila G., Ceri S. (1998). An extension to SQL for mining association rules. Journal of Data Mining and Knowledge Discovery, 2(2). Meo R. (2003). Optimization of a language for data mining. Proceedings of ACM Symposium on Applied Computing – Special track on Data Mining, Florida. Netz A., Chaudhuri S., Fayyad U.M., Bernhardt J. (2001). Integrating data mining with SQL databases: OLE DB for data mining, Proceedings IEEE ICDE International Conference on Data Engineering, Germany. Ng R.T., Lakshmanan V.S., Han J., Pang A. (1998). Exploratory mining and pruning optimizations of constrained associations rules, Proceedings of ACM SIGMOD International Conference Management of Data. Srikant R., Vu Q., Agrawal R. (1997). Mining association rules with item constraints, Proceedings of 1997 ACM SIGKDD International Conference on Knowledge Discovery from Databases. Tsur D., Ullman J.D., Abiteboul S., Clifton C., Motwani R., Nestorov S., Rosenthal A. (1998). Query Flocks: A generalization of association-rule mining, Proceedings of 1998 ACM SIGMOD International Conference Management of Data. TERMS AND THEIR DEFINITION Association Rule: An association between two sets of items co-occurring frequently in groups of data. Constraint-based Mining: Data mining obtained by means of evaluation of queries in a query language allowing predicates. CRM: Management, understanding and control of data on the customers of a company for the purposes of enhancing business and minimizing the customers churn. Inductive Database: Database system integrating in the database source data and data mining patterns defined as the result of data mining queries on source data. KDD: Knowledge Discovery Process from the database performing tasks of data preprocessing, transformation and selection, extraction of data mining patterns and their post-processing and interpretation. Semantic Dimension: concept or entity of the studied domain that is being observed in terms of other concepts or entities. WEB Log: File stored by the Web server containing data on users accesses to a Web site.