Download Apriori Based: Mining Infrequent and Non-Present Item Sets

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
International Journal of Electrical & Computer Science IJECS-IJENS Vol:14 No:03
31
Apriori Based: Mining Infrequent and Non-Present
Item Sets from Transactional Data Bases
Sujatha Kamepalli1, Raja Sekhara Rao Kurra2 and Sundara Krishna.Y.K.3
1
Research Scholar, CSE Department, Krishna University
Machilipatnam, Andhra Pradesh, India
sujatha101012@gmail.com
2
Director, Sri Prakash College of Engineering,
Thuni , Andhra Pradesh, India.
krr_it@yahoo.co.in
3
Professor, CSE Department, Krishna University
Machilipatnam, Andhra Pradesh, India.
yksk2010@gmail.com
Abstract-- Item set mining has been an active area of research
due to its successful application in various data mining scenarios
including finding association rules. Though most of the past work
has been on finding frequent item sets, infrequent item set mining
has demonstrated its utility in web mining, bioinformatics and
other fields. In this paper, we propose a new method based on
Apriori algorithm to find infrequent item sets and non-present
item sets. Finally, we analyze the behavior of our proposed
method by considering a transactional data base.
Index Term-- Data mining, frequent item sets, infrequent item
sets, non-present item sets, Apriori.
1.
INTRODUCTION
Frequent pattern mining was first proposed by
Agrawal et al. (1993) [2] for market basket analysis in the
form of association rule mining. It analyses customer buying
habits by finding associations between the different items that
customers place in their “shopping baskets”. For instance, if
customers are buying milk, how likely are they going to also
buy cereal (and what kind of cereal) on the same trip to the
supermarket? Such information can lead to increased sales by
helping retailers do selective marketing and arrange their shelf
space [3]. Patterns that are rarely found in database are often
considered to be uninteresting and are eliminated using the
support measure. Such patterns are known as infrequent
patterns [4].An infrequent pattern is an item set or a rule
whose support is less than the minimum support threshold.
Mining frequent item sets has found extensive
utilization in various data mining applications including
consumer market-basket analysis [13], inference of patterns
from web page access logs [17], and iceberg-cube computation
[14], Extensive research has, therefore, been conducted in
finding efficient algorithms for frequent item set mining,
especially in finding association rules [15], However,
significantly less attention has been paid to mining of
infrequent item sets, even though it has got important usage in
(i) mining of negative association rules from infrequent item
sets [18], (ii) statistical disclosure risk assessment where rare
patterns in anonymous census data can lead to statistical
disclosure [16], (iii) fraud detection where rare patterns in
financial or tax data may suggest unusual activity associated
with fraudulent behavior [16], and (iv) bioinformatics where
rare patterns in micro array data may suggest genetic disorders
[16].
Although a vast majority of infrequent patterns are
uninteresting, some of them might be useful to the analysis,
particularly those that correspond to negative correlations in
data. For example, the sale of DVDs and VCRs together is low
because any customer who buys a DVD will most likely not
buy a VCR and vice versa. Such negative correlated patterns
are useful to help identify competing items, which are items
that can be substituted for one another. Examples of competing
items include tea versus coffee, butter versus margarine,
regular versus diet soda, and desktop versus laptop computers.
Some infrequent patterns may also suggest the occurrence of
interesting rare events or exceptional situations in the data. For
example, if {Fire = Yes} is frequent but {Fire = Yes, Alarm =
On} is infrequent, then the latter is an interesting infrequent
pattern because it may indicate faulty alarm systems. To detect
such unusual situations, the expected support of a pattern must
be determined, so that, if a pattern turns out to have a
considerably lower support than expected, it is declared as an
interesting infrequent pattern [5]. The mining task that focuses
on discovering frequent patterns from the databases is called
frequent pattern mining. In frequent pattern mining, only
frequent patterns are returned while infrequent patterns are
simply discarded without further consideration. This is
because the most valuable information is carried by the
frequent patterns and the infrequent patterns cannot adequately
reflect the typical characteristics from the data because of their
rare occurrence [8]. However, since the late 1990s, more and
more researchers have realized the importance of infrequent
patterns with the increasing demands from applications of
anomaly detection, especially in medicine [6], genetics [10],
molecular biology [7] and network security [11]. In these
1412503-7474-IJECS-IJENS © June 2014 IJENS
IJENS
International Journal of Engineering & Computer Science IJECS-IJENS Vol:14 No:03
areas, infrequent patterns are considered significant due to the
huge influence they may have. In the study of finding a better
treatment approach for a special disease, researchers would
like spend more time on studying an abnormal case rather than
reading the millions of records of healthy people [9]. In this
scenario, more effort has been put into the development of
infrequent pattern mining and non-present patterns [12].
2. FREQUENT, INFREQUENT AND NON -PRESENT ITEM SETS
Frequent, Infrequent and non-present item set mining
are special cases of pattern mining. The problem of frequent
item set mining consists at looking only for patterns with
support at least equal to a fixed threshold. However, in
infrequent item set mining we are exclusively looking for noncommon patterns: patterns that are present in the data
transactions with a frequency smaller than a fixed threshold. In
the case of non-present pattern mining, we are only interested
by patterns that are not present [1].
Mining infrequent patterns is a challenging endeavor
because there are enormous numbers of such patterns that can
be derived from a given data set. More specifically, the key
issues in mining infrequent patterns are:
(1) How to identify interesting infrequent patterns, and
(2) How to efficiently discover them in large data sets.
3.
RELATED W ORK
The problem of mining frequent item sets was first
introduced by Agrawal et al. [2], who proposed the Apriori
algorithm. Mehdi Adda at all[1] explains about the frame work
for rare and non-present pattern mining. Jiawei Han at all[3]
explains about frequent pattern mining and its future
directions. Ashish Gupta at all proposed algorithms for
Minimally Infrequent Itemset Mining using Pattern-Growth
Paradigm and Residual Trees. Alex Tze Hiang Sim at all
explains about Mining Infrequent and Interesting Rules from
Transaction Records. Laszlo Szathmary at all describes about
Generating Rare Association Rules Using the Minimal Rare
Item sets Family. The research done on Infrequent and nonpresent item set mining is not restricted to the above said ,but
the above said papers does not explains about only infrequent
and non-present item set mining but also they combined these
with frequent pattern mining.
4. P ROPOSED METHOD
In the general Apriori algorithm the frequent item sets are
found by generating the candidate item sets based on minimum
support threshold. In that process the frequent item sets are
considered while discarding the infrequent item sets.
Since mining infrequent item sets is also very
important in many applications such as medical field, fraud
detection and anomaly detection in our proposed algorithm
only infrequent item sets and non-present item sets are found.
For finding the infrequent item sets the support is to be
considered that is minimum negative count is considered here
32
and for non-present item sets it is not necessary to consider the
minimum support threshold.
Table I
Infrequent 1-Item sets
Infrequent 1-itemsets
Data base
ITEM
INFRE
QUENC
Y
TID
TID
ITEM
SET
1
3
20,40,60
10
1,3,4,7
2
2
10,50
20
2,3,5,6,8
3
1
40
30
1,2,3,5,7
4
5
20,30,40,50,60
40
2,5,8
5
3
10,50,60
50
1,3,7,8
6
5
10,30,40,50,60
60
2,3,7
7
2
20,40
8
3
10,30,60
If the support frame work is considered let us take the
minimum support threshold that is minimum negative count as
3, in the above example items 1, 4, 5, 6, 8 are considered as
infrequent.
In the general Apriori algorithm since the frequent
patterns are found and infrequent patterns are discarded, a
pruning strategy is applied based on Apriori principle i.e. if an
item set is found to be infrequent then all its super sets are also
infrequent.
In the proposed algorithm also in finding infrequent
and non-present item sets the Apriori property is applicable
that is if we find an item set as infrequent then all its supersets
are considered as infrequent but here these super sets are not
to be pruned as in Apriori and they are considered in to the
solution as infrequent k- item sets. But here we cannot prune
the frequent item sets since if an item set is frequent then its
super sets may or may not be frequent. So it is necessary to go
through all possible 1-item, 2-item …sets to find all possible
infrequent item sets.
The non-present item sets are found directly from
infrequent 1-item sets. Suppose for example if a 2-item set {1,
6} is considered, from the above infrequent 1-item set table it
is clear that the item 1 is not present in TID 20, 40, 60 and
item 6 is not present in TID10,30,40,50,60 , by taking the
union of TIDs the 2-item set {1,6} is not present in all the
TIDs 10,20,30,40,50,60. So the set {1, 6} is said as nonpresent item set. By following the same procedure all the
infrequent and non-present item sets are found.
1412503-7474-IJECS-IJENS © June 2014 IJENS
IJENS
International Journal of Engineering & Computer Science IJECS-IJENS Vol:14 No:03
Table II
Infrequent 2-Item sets
33
Table V
Infrequent 2-Item sets
Infrequent 2-item sets corresponding to item 1
ITEM
INFREQUENCY
TID
(1,2)
5
10,20,40,50,60
(1,3)
3
20,40,60
(1,4)
5
20,30,40,50,60
(1,5)
5
10,20,40,50,60
(1,6)
6
10,20,30,40,50,60
(1,7)
3
20,40,60
(1,8)
5
10,20,30,40,60
In the above table it is clear that the item set (1, 6) is
non-present item set and the remaining are infrequent item
sets. These infrequent item sets are found from the infrequent
1-item set table.
Table III
Infrequent 2-Item sets
ITEM SET
INFREQUENCY
TIDS
(4,5)
(4,6)
6
6
10,20,30,40,50,60
10,20,30,40,50,60
(4,7)
5
20,30,40,50,60
(4,8)
6
10,20,3040,50,60
Table VI
Infrequent 2-Item sets
ITEM SET
INFREQUENCY
TIDS
(5,6)
(5,7)
5
5
10,30,40,50,60
10,20,40,50,60
(5,8)
4
10,30,,50,60
ITEM SET
INFREQUENCY
TIDS
(2,3)
3
10,40,50
(2,4)
6
10,20,30,40,50,60
ITEM SET
INFREQUENCY
TIDS
(2,5)
3
10,50,60
(6,7)
(6,8)
6
5
10,20,30,40,50,60
10,30,40,50,60
(2,6)
5
10,30,40,50,60
(2,7)
4
10,20,40,50
(2,8)
4
10,30,50,60
Table IV
Infrequent 2-Item sets
ITEM SET
INFREQUENCY
TIDS
(3,4)
(3,5)
5
4
20,30,40,50,60
10,40,50,60
(3,6)
5
10,30,40,50,60
(3,7)
2
30,40
(3,8)
4
10,30,40,60
Table VII
Infrequent 2-Item sets
Table VIII
Infrequent 2-Item sets
ITEM SET
INFREQUENCY
TIDS
(7,8)
5
10,20,30,40,60
The remaining k-item sets can be calculated in the
same procedure.
4.1. Algorithm:
Input: A Transactional Data base, minimum negative-count.
Output: Infrequent k-Item sets and Non-present Item sets.
Method:
1. start
2. scan the given transactional data base
3. for(k=1 ton) do //n is the max no.of items
4. for (each k-item set i) do
5. for (each transaction T) do
6. if(i does not belongs to T)
7. return TID;
8. end for1
9. end for2
10. end for3
1412503-7474-IJECS-IJENS © June 2014 IJENS
IJENS
International Journal of Engineering & Computer Science IJECS-IJENS Vol:14 No:03
11. for(each k-item set i)
12. count number of TIDs //count gives infrequency
13. if(count>= minimum negative count)
14. return infrequent k-item set;
15. if(count=total number of transactions)
16. return non-present item set;
17. end for
18. end
The above described algorithm gives the pseudo code for the
process that I have described. The practical code is not yet
implemented, it is under process.
5. CONCLUSIONS
Since the late 1990s, more and more researchers have realized
the importance of infrequent patterns with the increasing
demands from applications of anomaly detection, especially in
medicine [6], genetics [10], molecular biology [7] and network
security [11]. In these areas, infrequent patterns are considered
significant due to the huge influence they may have. In the
study of finding a better treatment approach for a special
disease, researchers would like spend more time on studying
an abnormal case rather than reading the millions of records of
healthy people [9]. In this scenario, more effort has been put
into the development of infrequent pattern mining and nonpresent patterns [12].
The proposed algorithm is used to find the infrequent
patterns and non-present patterns within one data base scan
that is by finding the infrequent 1-item sets from that we can
find all the remaining sets of infrequent item sets and nonpresent item sets.
6. FUTURE W ORK
The proposed algorithm works efficiently to find the
infrequent item sets and non-present item sets. It finds all the
infrequent item sets and non-present item sets within one data
base scan. But the proposed method not considers any pruning
strategy. It is better to implement any pruning strategy to
improve the complexity of the proposed method.
[1]
[2]
[3]
[4]
[5]
REFERENCES
A. Sehgal, A. Presente, L. Dudus, and J.F. Engelhardt. Isolation of
differentially expressed DNAs during ferret tracheal development:
Application of differential display PCR.Experimental Lung
Research, 22(4):419– 434, 1996.
Agrawal R, Imielinski T, Swami A (1993) Mining association
rules between sets of items in large databases. In: Proceedings of
the 1993ACM- SIGMODinternational conference on management
of data (SIGMOD’93), Washington, DC, pp 207– 216.
B. Saha, M. Lazarescu, and S Venkatesh. Infrequent item mining
in multiple data streams. In Proceedings of IEEE International
Conference on Data Mining (ICDM 2007), pages 569–574,
Omaha, NE, October 2007.
D. J. Haglin and A. M. Manning. On minimal Infrequent itemset
mining. In DMIN, pages 141–147, 2007.
F. Medici, M.I. Hawa, A. Giorgini, A. Panelo, C.M. Solfelix,
R.D.G. Leslie, and P.Pozzilli. Antibodies to GAD65 and a tyrosine
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
34
phosphatase-like molecule IA-2ic in
Filipino Type I diabetic
patients. Diabetes Care, 22(9):1458–1461, 1999.
INFREQUENT PATTERNS, CHAPTER- 5 Introduction to data
mining, Pang-Ning Tan, Michael Steinbach, Vipin Kumar
Pearson Education, book.
Introduction to data mining, Pang-Ning Tan, Michael Steinbach,
Vipin Kumar Pearson Education, book.
J. Srivastava and R. Cooley. Web usage mining: Discovery and
applications of usage patterns from web data. SIGKDD
Explorations, 1:12–23, 2000.
J. Yang and J. Logan. A data mining and survey study On diseases
associated with paraesophageal hernia. In AMIA Annual
Symposium Proceedings, pages 829– 833, 2006.
Jiawei Han · Hong Cheng · Dong Xin · Xifeng Yan “Frequent
pattern mining: current status and future Directions” Data Min
Knowl Disc (2007) 15:55–86 DOI 10.1007/s10618-006-0059-1.
K. Beyer and R. Ramakrishnan. Bottom-up computation of sparse
and iceberg cube.SIGMOD Rec., 28:359–370, 1999.
Mehdi Adda, Lei Wu, Sharon White(2012), Yi Feng ” pattern
detection with rare item-set mining” International Journal on Soft
Computing, Artificial Intelligence and Applications (IJSCAI),
Vol.1, No.1, August 2012.
R. Agrawal and R. Srikant. Fast algorithms for mining association
rules in large databases. In VLDB, pages 487– 499, 1994.
R. Agrawal, T. Imieinski, and A. Swami. Mining association rules
between sets of items in large databases. In P. Buneman and S.
Jajodia, editors, Proceedings of the
ACM SIGMOD
International Conference on the Management of Data, pages
207– 216, Washington DC, 1993. ACM Press.
W. Shi, F.K. Ngok, and D.R. Zusman. Cell density regulates
cellular reversal frequency in Myxococcus
xanthus. In
Proceedings of the National Academy of Sciences of the United
States of America, volume 93(9), pages 4142–4146,1996.
X. Dong, Z. Niu, X. Shi, X. Zhang, and D. Zhu. Mining both
positive and negative association rules from frequent and
infrequent itemsets. In ADMA, pages 122–133,2007.
X. Wu, C. Zhang, and S. Zhang. Efficient mining of both positive
and negative association rules. ACM Transactions on Information
Systems, 22(3):381–405, 2004.
K. Sujatha is pursuing her Ph.D. in Krishna University,Machilipatnam,
A.P. She is interested doing research in data mining. Present she is
carrying her research in Infrequent Pattern Mining. She has seven
international journal publications with high impact factors and indexing.
She has two national journal publications in peer refereed journals. She
presented papers in two national conferences. She attended for several
workshops and faculty development programs. Recently she attended for
two AICTE sponsored 2-week workshops. She is member in Indian
Association of Engineers (IAE).She has a total of 10 years experience in
teaching. She is working as an Associate Professor in CSE Department,
Malineni Lakshmaiah Engineering College, Singaraya konda, Prakasam
District. A.P.
Prof. K. Rajasekhara Rao is director of Sri Prakash College of
Engineering, Thuni, East Godavari District, Andhra Pradesh. He hold
several key positions in K.L.University, as Dean (Administration) &
Principal, K L College of Engineering (Autonomous). Having more than
26 years of teaching and research experience, Prof. Rao is actively
engaged in the research related to Embedded Systems, Software
Engineering and Knowledge Management. He had obtained Ph.D in
Computer Science & Engineering from Acharya Nagarjuna University
(ANU), Guntur, Andhra Pradesh and produced 58 publications in various
International/National Journals and Conferences. Prof.KRR was awarded
with “Patron Award” for his outstanding contribution, by
India’s
prestigious professional society Computer Society of India (CSI) for the
year 2011 in Ahemadabad. He has been adjudged as best teacher and has
been honored with “Best Teacher Award”, seven times. Dr. Rajasekhar is
a Fellow of IETE, Life Member’s of IE, ISTE, ISCA & CSI (Computer
1412503-7474-IJECS-IJENS © June 2014 IJENS
IJENS
International Journal of Engineering & Computer Science IJECS-IJENS Vol:14 No:03
35
Society of India). Dr.Rajasekhar is nominated as sectional committee
member for Engineering Sciences of 100 th Annual Convention of Indian
Science Congress Association. He has been the past Chairman of the
Koneru Chapter of CSI.
Dr. Y. K. Sundara Krishna qualified in Ph.D. in Computer Science and
Engineering from Osmania University, Hyderabad. Now, he is working as
Professor in the Department of Computer Science, Krishna University,
Machilipatnam and presently holding several key positions in Krishna
University. His research interests are Mobile Computing, Service Oriented
Architecture and Geographical Information Systems and having practical
work experience in the areas of Computing Systems including Developing
Simulators for Distributed Dynamic Cellular Computing Systems,
Applications of Embedded and Win32 clients, Maintenance of Multi-user
System Software. He has about 24 international publications and about 2
national publications. Has attended about 5 international conferences and
25 national conferences.
1412503-7474-IJECS-IJENS © June 2014 IJENS
IJENS