Download Entry for Encyclopedia of Data Warehousing and Mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
MINE RULE: Semantic Dimensions in
Association Rule Mining
Rosa Meo and Giuseppe Psaila
Università degli Studi di Torino and Università degli Studi di Bergamo, Italy
INTRODUCTION
Mining of Association Rules is one of the most adopted techniques for data mining
in the most wide spread application domains. A great deal of work has been carried out in
last years on the development of efficient algorithms for association rules extraction.
Indeed, this problem is a computational difficult task (known to be NP-hard, see Calders
2004) which has been augmented by the fact that normally association rules are being
extracted from very large databases. Moreover, in order to increase the relevance and
interestingness of obtained results and reduce the volume of the overall result,
constraints on association rules are introduced and must be evaluated (Ng et al. 1998,
Srikant et al. 1997). However, in this contribution, we do not focus on the problem of
developing efficient algorithms but on the semantic problem behind the extraction of
association rules (see also Tsur et al. 1998 for an interesting generalization of this
problem).
We want to put in evidence which are the semantic dimensions that characterize the
extraction of association rules, that is, we describe in a more general way which classes
of problems association rules solve. In order to accomplish this, we adopt a general
purpose query language, designed for the extraction of association rules from relational
databases. The operator of this language, MINE RULE, allows the expression of
constraints, constituted by standard SQL predicates, that make it suitable to be employed
with success in many diverse application problems. For a comparison between this query
language and other state of the art languages for data mining see Imielinski, Virmani,
Abdoulghani 1996, Han et al. 1996, Netz et al. 2001, and Botta et al. 2004.
In Imielinski, Mannila 1996 a new approach to data mining is proposed which is
constituted by a new generation of databases, called Inductive Databases (IDBs). With an
IDB the user/analyst can use advanced query languages for data mining to interact with
the knowledge discovery (KDD) system, extract data mining descriptive and predictive
patterns from the database and store them in the database. Boulicaut et al. 1998 and
Baralis et al. 1999 discuss the usage of MINE RULE in this context.
We want to show that thanks to a highly expressive query language it is possible to
exploit all the semantic possibilities of association rules and solve very different
problems with a unique language, whose statements are instantiated along the different
semantic dimensions of the same application domain. We discuss examples of statements
solving problems in different application domains that nowadays are of a great
importance: the first application is the analysis of a retail data, whose aim is market
basket analysis (Agrawal et al. 1993) and the discovery of user profiles for customer
relationship management (CRM). The second application is the analysis of data
registered in a Web server on the accesses to Web sites by users (Cooley et al. 2000
present a study on the same application domain). The last domain is the analysis of
genomic databases containing data on microarray experiments (Usama M. Fayyad 2003).
We show many practical examples of MINE RULE statements and discuss the
application problems that can be solved by analysing the association rules that result by
those statements.
BACKGROUND
An association rule has the form B  H where B and H are sets of items,
respectively called body (the antecedent) and head (the consequent). An association rule
(also denoted for short with rule) intuitively means items in B and H are often associated
within the observed data. Two numerical parameters denote the validity of the rule:
support is the fraction of source data for which the rule holds; confidence is the
conditional probability that H holds provided that B holds. Two minimum thresholds for
support and confidence are specified before rules are extracted, so that only significant
rules are extracted.
This very general definition is anyway incomplete and very ambiguous. For
example, which is the meaning of “fraction of source data for which the rule holds”? or
which are the items associated by a rule? If we do not answer these basic questions, an
association rule does not have a precise meaning. Consider for instance the original
problem for which association rules were initially proposed in Agrawal et al. 1993, the
market baskets analysis. If we have a database collecting single purchase transactions (for
instance, transactions performed by customers in a retail store) we might wish to extract
association rules that associate items sold within the same transactions. Intuitively, we are
defining the semantics of our problem: items are associated by a rule if they appear
together in the same transaction; support denotes the fraction of the total transactions that
contain all the items in the rule (both B and H), while confidence denotes the conditional
probability that, found B in a transaction, also H is found in the same transaction.
Thus a rule
{pants, shirt}  {socks, shoes} support=0.02 confidence=0.23
means that the items “pants”, “shirt”, “socks” and “shoes” appear together in the
2% of transactions, while having found items “pants” and “shirt” in a transaction, the
probability that the same transaction contains also “socks” and “shoes” is 23%.
Semantic Dimensions
MINE RULE puts in evidence the semantic dimensions that characterize the
extraction of association rules from within relational databases, and forces users
(typically analysts) to understand these semantic dimensions. Indeed, extracted
association rules describe the most recurrent values of certain attributes that occur in the
data (in the above example the values of the purchased product). This is the first semantic
dimension that characterizes the problem. These recurrent values are observed within sets
of data grouped by some common features (such as the transaction identifier in the
previous example, but in general the date, the customer identifier, and so on). This
constitutes the second semantic dimension of the association rule problem. Therefore,
extracted association rules describe the observed values of the first dimension, that are
recurrent in entities identified by the second dimension.
When values belonging to the first dimension are associated, it is possible that not
every associations are suitable, but only a subset of them should be selected, based on a
coupling conditions on attributes of the analyzed data (for instance, a temporal sequence
between events described in B and H). This is the third semantic dimension of the
problem; the coupling condition is called mining condition.
It is clear that MINE RULE is not tied to any particular application domain, since
the semantic dimensions allow to discover significant and unexpected information in very
different application domains.
The main features and clauses of MINE RULE are (see Meo et al. 1998 for a detailed
description):
– Selection of the relevant set of data for a data mining process. This feature is
specified by the FROM clause.
– Selection of the grouping features w.r.t. which data are observed. These features
are expressed by the GROUP BY clause.
– Definition of the structure of rules and cardinality constraints on body and head,
specified in the SELECT clause. Elements in rules can be single values or tuples.
– Definition of coupling constraints. These are constraints applied at the rule level
(mining condition instantiated by a WHERE clause associated to SELECT) for coupling
values.
– Definition of rule evaluation measures and minimum thresholds. These are
support and confidence (even if theoretically also other statistical measures would be
possible). Support of a rule is computed on the total number of groups in which it occurs
and satisfies the given constraints. Confidence is the ratio between the rule support and
the support of the body satisfying the given constraints. Thresholds are specified by
clause EXTRACTING RULES WITH.
MAIN THRUST OF THE CHAPTER
In this Section we introduce MINE RULE in the context of the three application
domains. We describe many examples of queries that can be conceived as a sort of
template because are instantiated along the relevant dimensions of an application domain
and solve some frequent, similar and critical situations for users of different applications.
First Application: Retail Data Analysis
We consider a typical data warehouse gathering information on customers
purchases in a retail store:
FactTable (TransId, CustId, TimeId, ItemId, Num, Discount)
Customer (CustId, Profession, Age, Sex)
Rows in FactTable describe sales. The dimensions of data are the customer
(CustId), the time (TimeId) and the purchased item (ItemId); each sale is characterized by
the nuumber of sold pieces (Num) and the discount (Discount); the transaction identifier
(TransId) is reported as well. We also report table Customer.
Example 1: We want to extract a set of association rules, named
FrequentItemSets, that finds the associations between sets of items (first
dimension of the problem) purchased together in a sufficient number of dates (second
dimension) with no specific coupling condition (third dimension). These associations
provide the business relevant sets of items because are the most frequent in time. The
MINE RULE statement is now reported.
MINE RULE FrequentItemSets AS
SELECT DISTINCT 1..n ItemId AS BODY, 1..n ItemId AS HEAD, SUPPORT, CONFIDENCE
FROM FactTable
GROUP BY TimeId
EXTRACTING RULES WITH SUPPORT:0.2, CONFIDENCE:0.4
The first dimension of the problem is specified in the SELECT clause that specifies
the schema of each element in association rules, the cardinality of body and head (in
terms of lower and upper bound) and the statistical measures for the evaluation of
association rules (support and confidence); in the example, body and head are not empty
sets of items and their upper bound is unlimited.
The GROUP BY clause provides the second dimension of the problem: since attribute
TimeId is specified, rules denote that associated items have been sold in the same date
(intuitively, rows are grouped by values of TimeId, and rules associate values of
attribute ItemId appearing in the same group).
Support of an association rule is computed in terms of the number of groups in which any
element of the rule co-occurs; confidence is computed analogously. In this example,
support is computed over the different instants of time, since grouping is made according
to the time identifier. Support and confidence of rules must be not lower than the values
in EXTRACTING clause (respectively 0.3 and 0.4).
Example 2: Customer profiling is a key problem in CRM applications. Association rules
allow to obtain a description of customers (e.g. w.r.t. age and profession) in terms of
frequently purchased products. To do that, values coming from two distinct dimensions
of data must be associated.
MINE RULE CustomerProfiles AS
SELECT DISTINCT 1..1 Profession, Age AS BODY, 1..n Item AS HEAD, SUPPORT,
CONFIDENCE
FROM FactTable JOIN Customer ON FactTable.CustId=Customer.CustId
GROUP BY CustId
EXTRACTING RULES WITH SUPPORT:0.6, CONFIDENCE:0.9
The observed entity is the customer (first dimension of data) described by a single pair in
the body (cardinality constraint 1..1); the head associates products frequently purchased
by customers (second dimension of data) with the profile reported in the body (see the
SELECT clause). Thus a rule
{(employee, 35)}  {socks, shoes} support=0.7 confidence=0.96
means that customers which are employees and 35 years old often (96% of cases) buy
socks and shoes. Support tells about the absolute frequency of the profile in the customer
base (GROUP BY clause). This solution can be easily generalized for any profiling
problem.
Second Application: Web Log Analysis
Typically, Web servers store information concerning access to Web sites stored in a
standard log file. This is a relational table (WebLogTable) that typically contains at
least the following attributes:
RequestID: identifier of the request;
IPcaller: IP address from which the request is originated;
Date: date of the request;
TS: time stamp;
Operation: kind of operation (for instance, get or put);
Page URL: URL of the requested page;
Protocol: transfer protocol (such as TCP/IP);
Return Code: code returned by the Web server;
Dimension: dimension of the page (in Bytes).
Example 1: To discover Web communities of users on the basis of the pages they
frequently visited, we might find associations between sets of users (first dimension) that
have all visited a certain number of pages (second dimension); no coupling conditions are
necessary (third dimension). Users are observed by means of their IP address,
Ipcaller, whose values are associated by rules (see SELECT). Support and confidence
of association rules, in this case, are computed based on the number of pages visited by
users in rules (see GROUP BY). Thus rule
{Ip1, Ip2}  {Ip3, Ip4} support=0.4 confidence=0.45
means that users operating from Ip1, Ip2, Ip3 and Ip4 visited the same set of pages, which
constitute the 40% of total pages in the site.
MINE RULE UsersSamePages AS
SELECT DISTINCT 1..n IPcaller AS BODY, 1..n IPcaller AS HEAD, SUPPORT, CONFIDENCE
FROM WebLogTable
GROUP BY PageUrl
EXTRACTING RULES WITH SUPPORT:0.2, CONFIDENCE:0.4
Example 2: In Web log analysis, it is interesting to discover the most frequent crawling
paths.
MINE RULE FreqSeqPages AS
SELECT DISTINCT 1..n PageUrl AS BODY, 1..n PageUrl AS HEAD, SUPPORT, CONFIDENCE
WHERE BODY.Date < HEAD.Date
FROM WebLogTable
GROUP BY IPcaller
EXTRACTING RULES WITH SUPPORT:0.3, CONFIDENCE:0.4
Rows are grouped by user (IPcaller) and sets of pages, frequently visited by a
sufficient number of users, are associated. Furthermore, pages are associated only if they
denote a sequential pattern (third dimension): in fact, the mining condition WHERE
BODY.Date < HEAD.Date constrains the temporal ordering between pages in
antecedent and consequent of rules; consequently, rule
{P1, P2}  {P3, P4, P5} support=0.5 confidence=0.6
means that 50% of users visit pages P3, P4 and P5 after pages P1 and P2.
This solution can be easily generalized for any problem requiring the search for
sequential patterns.
Many other examples are possible, such as rules that associate users to frequently visited
Web pages (highlight the fidelity of the users to the service provided by a Web site) or
frequent requests of a page by a browser that cause an error in the Web server (interesting
because it constitutes a favourable situation to hackers’ attacks).
Third Application: Genes Classification by Microarray Experiments
We consider information on a single microarray experiment, containing data on
several samples of biological tissue, tied to correpondent probes on a silicon chip. Each
sample is treated (or hybridized) in various ways and under different experimental
conditions: these can determine the over-expression of a set of genes. This means that the
sets of genes are active in the experimental conditions (or inactive if, on the contrary, are
under-expressed). Biologists are interested to discover which sets of genes are similarly
expressed and under which conditions.
A microrray typically contains some hundreds of samples, and for each sample
several thousands of genes are measured. Thus input relation, called MicroArrayTable,
contains the following information:
SampleID: identifier of the sample of biological tissue tied to a probe on the
microchip;
GeneId: identifier of the gene measured in the sample.
TreatmentConditionId: identifier of the experimental conditions under which the
sample has been treated;
LevelOfExpression: measured value  if higher than a threshold T2, the genes are
over-expressed; if lower than another threshold T1, genes are under-expressed.
Example: This analysis discovers sets of genes (first dimension of the problem) that, in
the same experimental conditions (second dimension), are similarly expressed (third
dimension).
MINE RULE SimilarlyCorrelatedGenes AS
SELECT DISTINCT 1..n GeneId AS BODY, 1..n GeneId AS HEAD, SUPPORT, CONFIDENCE
WHERE BODY.LevelOfExpression < T1 AND HEAD.LevelOfExpression < T1 OR
BODY.LevelOfExpression > T2 AND HEAD.LevelOfExpression > T2
FROM MicroArrayTable
GROUP BY SampleId, TreatmentConditionId
EXTRACTING RULES WITH SUPPORT:0.95, CONFIDENCE:0.8
The mining condition introduced by WHERE constrains both the sets of genes to be
similarly expressed in the same experimental conditions (i.e., samples of tissue treated in
the same conditions). Support threshold (0.95) determines the proportion of samples in
which the sets of genes must be similarly expressed; confidence determines how strongly
the two sets of genes are correlated.
This statement might help biologists to discover the sets of genes that are involved
in the production of proteins involved in the development of certain diseases (e.g.,
cancer).
FUTURE TRENDS
This contribution wants to evaluate the usability of a mining query language and its
results  association rules – in some practical applications. We identified many useful
patterns, corresponding to concrete user problems. We showed that the exploitation of the
nuggets of information embedded in the databases and of the specialized mining
constructs provided by the query languages, enables the rapid customization of the
mining procedures following to the real users’ needs. Given our experience, we also
claim that, independently of the application domain, the use of queries in advanced
languages, as opposed to ad-hoc heuristics, eases the specification and the discovery of a
large spectrum of patterns. This motivates the need for powerful query languages in KDD
systems.
For the future, we believe that a critical point will be the availability of powerful
query optimizers, such as the one proposed in Meo 2003. This one is able to solve data
mining queries incrementally, that is by modification of the previous queries results,
materialized in the database.
CONCLUSION
In this contribution, we focused on the semantic problem behind the extraction of
association rules. We put in evidence which are the semantic dimensions that characterize
the extraction of association rules; we did this by applying a general purpose query
language designed for the extraction of association rules, named MINE RULE, to three
important application domains.
The query examples we provided show that the mining language is powerful, and at the
same time versatile because its operational semantics seems to be the basic one. Indeed
these experiments allow us to claim that Mannila and Imielinski’s initial view on
inductive databases was correct: <<There is no such thing as real discovery, just a matter
of the expressive power of the query languages>>.
REFERENCES
Agrawal R., Imielinski T., Swami A. (1993). Mining association rules between sets of items in large
databases. Proceedings of ACM SIGMOD International Conference on Management of Data.
Baralis E., Psaila G. (1999). Incremental refinement of mining queries, Proceedings of the First International
Conference on Data Warehousing and Knowledge Discovery, Italy.
Botta M., Boulicaut J.-F., Masson C., Meo R. (2004). Query languages supporting descriptive rule mining: a
comparative study, in R.Meo, P.Lanzi, M.Klemettinen, (eds.) Database Support for Data Mining Applications,
p. 27-54, Springer-Verlag, LNCS 2682.
Boulicaut J.-F., Klemettinen M., Mannila H. (1998). Querying inductive databases: A case study on the MINE
RULE operator, Proceedings of PKDD 1998 International Conference on Principles of Data Mining and
Knowledge Discovery, France.
Calders T. (2004). Computational complexity of itemset frequency satisfiability, Proceedings of ACM PODS
Principles Of Database Systems, June.
Cooley R., Tan P.N., Srivastava J. (2000). Discovery of interesting usage patterns from Web data.
LNCS/LNAI. Springer Verlag, 2000.
Usama M. Fayyad. (2003). Special Issue of SIGKDD Explorations on Microarray Data Mining, vol. 4, n.2.
Han J., Fu Y., Wang W., Koperski K., Zaiane O. (1996). DMQL: A data mining query language for relational
databases, Proceedings of ACM SIGMOD Workshop on Research Issues on Data Mining and Knowledge
Discovery, Canada.
Imielinski T., Mannila, H. (1996). A database perspective on knowledge discovery. Communications of the
ACM, 39(11):58–64, November.
Imielinski T., Virmani A. , Abdoulghani A. (1996). DataMine: application programming interface and query
language for database mining, Proceedings of ACM International Conference on SIGKDD.
Meo R., Psaila G., Ceri S. (1998). An extension to SQL for mining association rules. Journal of Data Mining
and Knowledge Discovery, 2(2).
Meo R. (2003). Optimization of a language for data mining. Proceedings of ACM Symposium on Applied
Computing – Special track on Data Mining, Florida.
Netz A., Chaudhuri S., Fayyad U.M., Bernhardt J. (2001). Integrating data mining with SQL databases: OLE
DB for data mining, Proceedings IEEE ICDE International Conference on Data Engineering, Germany.
Ng R.T., Lakshmanan V.S., Han J., Pang A. (1998). Exploratory mining and pruning optimizations of
constrained associations rules, Proceedings of ACM SIGMOD International Conference Management of
Data.
Srikant R., Vu Q., Agrawal R. (1997). Mining association rules with item constraints, Proceedings of 1997
ACM SIGKDD International Conference on Knowledge Discovery from Databases.
Tsur D., Ullman J.D., Abiteboul S., Clifton C., Motwani R., Nestorov S., Rosenthal A. (1998). Query Flocks:
A generalization of association-rule mining, Proceedings of 1998 ACM SIGMOD International Conference
Management of Data.
TERMS AND THEIR DEFINITION
Association Rule: An association between two sets of items co-occurring frequently in
groups of data.
Constraint-based Mining: Data mining obtained by means of evaluation of queries in a
query language allowing predicates.
CRM: Management, understanding and control of data on the customers of a company
for the purposes of enhancing business and minimizing the customers churn.
Inductive Database: Database system integrating in the database source data and data
mining patterns defined as the result of data mining queries on source data.
KDD: Knowledge Discovery Process from the database performing tasks of data preprocessing, transformation and selection, extraction of data mining patterns and
their post-processing and interpretation.
Semantic Dimension: concept or entity of the studied domain that is being observed in
terms of other concepts or entities.
WEB Log: File stored by the Web server containing data on users accesses to a Web site.