Download Journal of Information Science

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Journal of Information Science
http://jis.sagepub.com/
MapReduce-based web mining for prediction of web-user navigation
Meijing Li, Xiuming Yu and Keun Ho Ryu
Journal of Information Science published online 29 July 2014
DOI: 10.1177/0165551514544096
The online version of this article can be found at:
http://jis.sagepub.com/content/early/2014/07/18/0165551514544096
A more recent version of this article was published on - Sep 12, 2014
Published by:
http://www.sagepublications.com
On behalf of:
Chartered Institute of Library and Information Professionals
Additional services and information for Journal of Information Science can be found at:
Email Alerts: http://jis.sagepub.com/cgi/alerts
Subscriptions: http://jis.sagepub.com/subscriptions
Reprints: http://www.sagepub.com/journalsReprints.nav
Permissions: http://www.sagepub.com/journalsPermissions.nav
Version of Record - Sep 12, 2014
>> OnlineFirst Version of Record - Jul 29, 2014
What is This?
Downloaded from jis.sagepub.com at CHUNGBUK NATIONAL UNIV LIB on October 30, 2014
Article
MapReduce-based web mining for
prediction of web-user navigation
Journal of Information Science
1–11
Ó The Author(s) 2014
Reprints and permissions:
sagepub.co.uk/journalsPermissions.nav
DOI: 10.1177/0165551514544096
jis.sagepub.com
Meijing Li
Database/Bioinformatics Laboratory, College of Electrical and Computer Engineering, Chungbuk National University, South Korea
Xiuming Yu
Database/Bioinformatics Laboratory, College of Electrical and Computer Engineering, Chungbuk National University, South Korea
Keun Ho Ryu
Database/Bioinformatics Laboratory, College of Electrical and Computer Engineering, Chungbuk National University, South Korea
Abstract
Predicting web user behaviour is typically an application for finding frequent sequence patterns. With the rapid growth of the Internet,
a large amount of information is stored in web logs. Traditional frequent-sequence-pattern-mining algorithms are hard pressed to analyse information from within big datasets. In this paper, we propose an efficient way to predict navigation patterns of web users by
improving frequent-sequence-pattern-mining algorithms based on the programming model of MapReduce, which can handle huge datasets efficiently. During the experiments, we show that our proposed MapReduce-based algorithm is more efficient than traditional
frequent-sequence-pattern-mining algorithms, and by comparing our proposed algorithms with current existed algorithms in webusage mining, we also prove that using the MapReduce programming model saves time.
Keywords
Frequent sequence patterns; MapReduce; web-usage mining; web user behaviour
1. Introduction
Data mining is a technology to analyse data to obtain valuable information. The web has become very popular and is
widely used. With the rapid growth in the number of transactions on the web, a large amount of data is automatically
gathered by web servers. A vast amount of useful information is hidden in these recorded web data. Analysis of this web
data has become a challenge in the field of data mining.
Web mining [1] is aimed at finding interesting and valuable patterns from web data. According to different kinds of
web data, there are three categories of web mining: web-content mining, which extracts useful information and knowledge from web page contents, such as text and multimedia in web pages; web structure mining, which discovers useful
knowledge from the structure of the hyperlinks; and web-usage mining, which finds user access patterns from the web
log files. In this paper, the groundwork of our proposed approach is to find information about web users’ behaviour, and
web-usage mining is the theme of our research. Recently, many studies have been proposed on web-usage mining [2–6],
such as the Apriori algorithm [7] and the FP-Growth algorithm [8]. The Apriori algorithm is the most classic and most
widely used algorithm for finding frequent-sequence patterns, but because it has to scan the dataset many times and generates a large number of candidate sequences when the dataset is large, it costs too much. Because the resources in our
work environment are limited, inefficient performance results when the Apriori algorithm is applied to a huge dataset.
Compared with the Apriori algorithm, the FP-Growth algorithm is a better algorithm for finding frequent-sequence patterns; it just needs to scan the dataset twice. The first scan is to compute a list (F-list), and the second scan is to
Corresponding author:
Keun Ho Ryu, Database/Bioinformatics Laboratory, College of Electrical and Computer Engineering, Chungbuk National University, South Korea.
Email: khryu@dblab.chungbuk.ac.kr
Downloaded from jis.sagepub.com at CHUNGBUK NATIONAL UNIV LIB on October 30, 2014
Li et al.
2
compress the dataset into an FP-tree. However, when the dataset is huge, that second scan is also very expensive in
recursive searching and constructing the FP-tree.
Hadoop-MapReduce [9, 10] is a programming model and an associated implementation for processing and generating
large datasets, which processes large datasets in parallel fashion [11]. It allows users to not worry about the irritating
details of coordinating parallel sub-tasks and maintaining distributed file storage. It greatly increases user productivity
while users can process a large amount of parallel data. A MapReduce program consists of two parts: a map function,
which processes a key/value pair to generate a set of intermediate key/value pairs; and a reduce function, which merges
all of the intermediate values associated with the same intermediate key [12].
With the above problems, many solutions have been presented to improve the existing algorithms using HadoopMapReduce, such as the MR-Apriori algorithm, which improves the Apriori algorithm based on the Hadoop-MapReduce
programming model, and the MR-FPGrowth algorithm, which improves the FP-Growth algorithm based on the HadoopMapReduce programming model. In this paper, we try to apply the approved frequent sequence pattern-mining algorithms to web-usage mining, but it is impossible to apply the improved algorithms to web log data directly. When the
dataset size is huge, there is still room to improve the algorithms. In our proposed approach, web log data is transformed
into sequence data based on MapReduce in a parallel fashion, and the current parallel implementations of the frequentsequence algorithms are improved by reducing the experimental dataset size before applying the algorithms based on
MapReduce.
The rest of this paper is organized as follows. Related work is discussed in Section 2. The problems are stated and illustrated in Section 3. In Section 4, the details of our proposed approach are presented. The experiment and evaluations are
discussed in Section 5. Finally, Section 6 concludes the paper.
2. Related work
As mentioned in the previous section, many algorithms have been proposed for sequence-pattern mining. Currently, two
typical algorithms are the Apriori [7] and the FP-Growth [8] algorithms. Then, some parallel Apriori [13–16] and FPGrowth [17] algorithms based on the Hadoop-MapReduce model were proposed, because performance of the traditional
Apriori and FP-Growth algorithms is inefficient, especially when dealing with huge datasets.
Although the performance of the MRApriori [16] and Parallel FP-Growth (PFP) [17] algorithms is better than traditional sequence pattern-mining algorithms in relatively huge datasets, they are still hypodynamic when mining web log
files if the dataset is very large. This paper proposes an improved implementation of the MRApriori and PFP algorithms
from adopting a large page set (LPS) approach based on the Hadoop-MapReduce model, called the LPS-MRApriori and
LPS-PFP algorithms. In our experiment, the proposed approach is compared with other existing sequence-pattern-mining
algorithms (MRApriori and PFP). We concluded that our proposed approach for web-usage mining outperforms the
others.
3. Problem statement
This section is devoted to the problem statement for this paper. To predict user navigation from a huge web log dataset,
two sub-problems need to be solved: transform the huge web log dataset into sequence data, and enhance efficiency,
based on applying current excellent frequent-sequence-pattern-mining algorithms.
An access record in a web log file generally consists of many parameters, and each parameter denotes a different
meaning. Figure 1 shows a common web log format.
Web log data in its raw format is not fit to be used directly for the data-mining task. The data in the web log file must
be transformed into sequence data. The Web log file can be transformed into sequence data based on the user session.
Figure 1. Common log format for web log data.
Journal of Information Science, 2014, pp. 1–11 Ó The Author(s), DOI: 10.1177/0165551514544096
Downloaded from jis.sagepub.com at CHUNGBUK NATIONAL UNIV LIB on October 30, 2014
Li et al.
3
Definition 1: User session is a session of activity for one web user with a unique IP address or user name used on a web site during a specified period of time. For example, when a web site is accessed for more than 10 min without any operation, the web site
cannot operate again, and then the size of user session of this web site is known as10 min.
Definition 2: A large page set is a set of frequent web pages. Which web pages are frequent? We define the web pages whose support thresholds are greater than or equal to user specified minimum support threshold minsup_lps. The value of support threshold
of web page is defined as shown in equation (1).
support(pi ) =
Ni
N
ð1Þ
where N denotes the number of user sessions, support(pi ) denotes the support threshold of web page pi , and Ni denotes
the number of web page pi accessed in N user sessions. For example, we assume that there are 100 user sessions, the first
web page p1 accessed in user sessions 80 times and then the support threshold of this web page is:
supportðp1 Þ =
80
= 80%:
100
We also assume that the value of minimum support threshold minsup_lps is 70%, and then the first web page is a large
web page because the support threshold of the first web page is greater than minsup_lps.
4. Proposed approach for web-usage mining
In this paper, we apply an improved frequent sequence pattern-mining algorithm to web-usage mining to predict web
user navigation. Our approach consists of two phases: data preprocessing and frequent-sequence-pattern mining. In the
data preprocessing phase, the web log file is processed by removing irrelevant attributes, transforming URLs into code
numbers and removing records that are missing value data to get clean data for our mining task. In the second phase, the
web log file resulting from the first phase is transformed into sequence data on the basis of user session using the
MapReduce programming model. Then, in order to reduce the dataset size, a large web page set is generated based on
the MapReduce programming model. Finally, existing parallel frequent sequence pattern-mining algorithms are applied
to the decremented dataset to obtain access patterns. Figure 2 shows the work flow of our approach.
Figure 2. Work flow of our approach.
Journal of Information Science, 2014, pp. 1–11 Ó The Author(s), DOI: 10.1177/0165551514544096
Downloaded from jis.sagepub.com at CHUNGBUK NATIONAL UNIV LIB on October 30, 2014
Li et al.
4
4.1. Data preprocessing
Web log data is automatically recorded in web log files on web servers when web users access the web server through
their browsers. Not all of the records sorted into the web log files are legal or necessary for the mining task, so before
analysis of the web log data, the data cleaning phase needs to be implemented.
4.1.1. Removing records with missing value data. Some of the records sorted in the web log file are not complete because
some of the parameters of the records are lost. For example, if a click-through to a web page was executed while the web
server was shut down, then in the log file only the IP address, user ID and access time are recorded; the method, URL,
referrer and agent will be lost. This kind of record is illegal for our mining task, so these records must be removed.
4.1.2. Removing illegal records with exception status numbers. Some illegal records are caused by errors in the requests or by
the server. Although the records are intact, the activity does not execute normally. For example, records with the status
numbers 400 or 404 are caused by HTTP client errors, bad requests or a request not found. Records with status numbers
500 or 505 are caused by HTTP server errors, when the internal server cannot connect, or when the HTTP version is not
supported. These kinds of data are illegal for our task, so the records must be removed.
4.1.3. Removing irrelevant records with no significant URLs. Some URLs in the records consist of .txt, .jpg, .gif or .js extensions, which are automatically generated while a web page is requested. These records are irrelevant to our mining task,
so they must be removed.
4.1.4. Selecting the essential attributes. As shown in the common log format of web log data in Figure 1, there are many
attributes in one record, but for web-usage mining, not all the attributes are necessary. In this paper, the attributes for IP
address, time and URL are essential attributes to our task, so they should remain but the rest of the attributes should be
discarded.
4.2. Frequent sequence-pattern mining
In this section, an efficient approach to finding access patterns is presented. Working with the web log file, web log data
should first be transformed into sequence data. Then, based on MapReduce, a large web page set is generated to reduce
the size of the dataset. Finally, two existing parallel frequent sequence-pattern-mining algorithms, MR-Apriori and MRFPGrowth, are applied to the decremented sequence dataset to obtain access patterns.
4.2.1. Generation of sequence data. The challenge in generation of sequence data from web log files is how to identify an
activity from accessing a website. In this section, we identify an activity from accessing a website based on the definition
of user session (Definition 1). Some websites specify user sessions on their websites; for example, for some security websites, the specified period of a user session will be shorter, such as 10 min; for general websites, the specified period of a
user session will be longer, and the value always set at 1 h. Some websites do not define user session at all.
In this paper, we define the specified period of a user session as 1 h, which means that all the access records of one
web user with a unique IP address within 1 h can be considered a transaction. All the URLs in a transaction can be sorted
as one sequence. To obtain all the transactions that are transformed from the web log data, we first sort the preprocessed
data by IP address and time. Then, we group the records by splitting the records into hours for each IP address, based on
the specified period of a user session. Then, all the URLs requested in one user session and by the same user are sorted
into one transaction based on the MapReduce programming model. In the MapReduce map function, the attributes IP
and Time are defined as the intermediate key; they are connected by a colon, and the URL is defined as the intermediate
value. The process of the map function can be defined as map(IP:Time, URL) !list(IP:Time, URL), and an intermediate
key/value pair will be generated. In the reduce function for MapReduce, all of the intermediate values associated with
the same intermediate key will be merged, and the reduce function for MapReduce can be written as reduce(IP:Time,
list(URL)) ! (IP:Time, URLs). For example, in Fig. 3, the data in the left table are signed as input data to the map function, and for the previous two records, they can be processed as map(82.117.202.158:1, 4) and map(82.117.202.158:1, 1)
in the input to the map function, which generates a list for output from the map function as {(82.117.202.158:1, 4),
(82.117.202.158:1, 1)}. In the reduce function, ‘82.117.202.158:1’ is defined as an intermediate key, and input to the
reduce function can be defined as reduce(82.117.202.158:1, {4, 1}), in which {4, 1} means the list (URLs). Then,
Journal of Information Science, 2014, pp. 1–11 Ó The Author(s), DOI: 10.1177/0165551514544096
Downloaded from jis.sagepub.com at CHUNGBUK NATIONAL UNIV LIB on October 30, 2014
Li et al.
5
Figure 3. Generation of sequence data.
merging the intermediate records with the same key value ‘82.117.202.158:1’, we get the output from the reduce function: (82.117.202.158:1, 4:1). The final sequence data of the sample data in the left table is generated as shown in the
right table in Figure 3.
4.2.2. Generation of large page set. A large page set is the frequent web page set defined in Definition 2. In the above section, the web log file is converted into sequence data, and an LPS is generated from sequence data based on the
MapReduce programming model. In the map function for MapReduce, the attributes IP and Event are defined as the
intermediate key (they are connected by a colon), and ‘1’ is defined as the intermediate value. The process of the map
function can be defined as map(IP:event, 1) !list(IP:event, 1), and intermediate key/value pairs will be generated. In
the reduce function for MapReduce, all the intermediate values associated with the same intermediate key will be
merged, and the reduce function for MapReduce can be written as: reduce(event, 1) ! (event, sum). As shown in the
sequence data in the left table in Figure 4, there are four IP addresses, each IP address denotes a web user, and the first
sequence data can be processed as map(82.117.202.158:4, 1) and map(82.117.202.158:1, 1) for input to the map function; the fourth sequence of data can be processed as map(82.117.202.158:4, 1) for input to the map function. For the
output of the map function, we merge the sequence data with the same intermediate key. For example, for the first web
user, the sequence data can be merged as {(82.117.202.158:1, 1), (82.117.202.158:2, 1), (82.117.202.158:4, 1),
(82.117.202.158:6, 1), (82.117.202.158:7, 1)}, and each event occurs once in each IP address in the output from the map
function, although some events occur more than once in each IP address in the input to the map function. The pseudocode of map function is shown in Algorithm 1.
In the reduce function, we aim at counting the number of times each event occurs with web users. For example, in
the first IP address, the input to the reduce function is set as reduce(1,1), reduce(2,1), reduce(4,1), reduce(6,1) and
reduce(7,1); in the second IP address, the input to the reduce function is set as reduce(2,1), reduce(4,1) and reduce(7,1);
in the third IP address, the input to the reduce function is set as reduce(2,1), reduce(6,1) and reduce(7,1); in the fourth IP
address, the input to the reduce function is set as reduce(3,1), reduce(4,1), reduce(5,1), reduce(6,1) and reduce(7,1). The
output from the reduce function is generated as shown in the right table in Figure 4. Here, we assume that the minimum
support threshold of the large page set minsup_lps is 75%, which is equivalent to a minimum support count equal to
three out of four web user sessions, and then we can get the large pages {2, 4, 6, 7} where the support value is greater
than, or equal to, the minimum support count 3. The pseudo-code of reduce function is shown in Algorithm 2.
Journal of Information Science, 2014, pp. 1–11 Ó The Author(s), DOI: 10.1177/0165551514544096
Downloaded from jis.sagepub.com at CHUNGBUK NATIONAL UNIV LIB on October 30, 2014
Li et al.
6
Figure 4. Generation of the large page set.
Algorithm 1. Map(key, value)
Input: A set of sequence of web data D.
Output: < key#, value# > , where key# is set as ‘‘IP:event’’, value# equals to 1.
1. for each sequence data Di in D
2. key#.set(Di .IP + ‘‘:’’ + Di .event);
3. value#.set(1);
4. context.write(key#, value#);
Algorithm 2. Reduce(key, value)
Input: A set of candidate < key#,value# > data, which are generated by Algorithm 1, user specified minimum support
threshold minsup_lps, and user_session which is the number of web user sessions.
Output: < key##, value## > , where key## is large web page, value## is the support count of key##.
1. for each key## in context { // generated by Agorithm 1
2.
value##=value## + 1;
3. }
4. If (value##/user_session > =minsup_lps) {
5.
context.write(key##, value##);
6. }
4.2.3. Generation of frequent-sequence patterns. After getting the large page set, web pages that are visited frequently are
obtained. Then, we apply parallel frequent-sequence-pattern algorithms to the sequence dataset, which filter out the infrequent events. In this paper, we improve existing algorithms (MRApriori, the MapReduce-based Apriori algorithm, and
PFP, the Parallel FP-Growth algorithm) by getting a large page set and then applying the proposed approach to web log
data. As shown in Figure 5, before finding frequent-sequence patterns, infrequent events in the sequence dataset are filtered out based on the large page set, which can greatly reduce the dataset size. In finding frequent-sequence patterns, we
assume that the minimum support threshold of frequent sequence pattern minsup_fsp is 50%, which is equivalent to a
minimum support count equal to two out of four web users, and then we get the final frequent-sequence patterns {{6, 2},
{6, 7}} with support values 2 and 3.
Journal of Information Science, 2014, pp. 1–11 Ó The Author(s), DOI: 10.1177/0165551514544096
Downloaded from jis.sagepub.com at CHUNGBUK NATIONAL UNIV LIB on October 30, 2014
Li et al.
7
Figure 5. Generation of frequent-sequence patterns.
5. Experiment and evaluation
We ran several experiments to evaluate the performance of the proposed algorithms. All experiments were performed in
Hadoop 1.0.3 on a personal computer with an Intel Core2 Duo E7500 CPU, 4 GB RAM, and a 480 GB hard disk running
the Windows operating system. The proposed algorithms were implemented in Java.
5.1. Experiment data
In the experiments, we used weblog data come from a NASA website (http://ita.ee.lbl.gov/html/contrib/NASAHTTP.html), which is cited in Chordia and Adhiya [18], because that web log is large-scale temporal data with time
points. The event file can be generated from a web log file including the URL requested, HTTP method requested, the
IP address from which the request originated and a timestamp. For example, the IP address in the web log record can be
assigned to customer-ID, the HTTP method with the URL can be converted to an event-type and the timestamp can be
assigned to event-time.
Two web log files, NASA_access_log_Jul95 and NASA_access_log_Aug95, were used as our two experiment datasets. The first dataset was collected from 00:00:00 1 July 1995 to 23:59:59 31 July 1995 (a total of 31 days). The second
dataset was collected from 00:00:00 1 August 1995 to 23:59:59 31 August 1995. The uncompressed content of the first
dataset is 205.2 MB, and it contains 1,891,715 records. The uncompressed content of the second dataset is 167.8 MB,
and it contains 1,569,898 records.
5.2. Analysis from getting the large page set
The phase in which we get the large page set with the proposed approach can find frequently visited web pages among
the numerous web users. In other words, we can find relatively interesting web pages for analysis. The degree of relativity is controlled by the value of minsup_lps. The smaller the value, the higher the number of generalized large pages
obtained. As shown in Figure 6, we can see that there exists a value of the minimum support threshold for the large page
set, and from the larger values, the number of large pages will not change or will change little.
5.3. Analysis of the proposed approach
Comparisons of the efficiency of our proposed approach against the existing algorithms are given in Figs 7 and 8. To
evaluate the proposed approach with small and large datasets, we select experiment data from web log file
NASA_access_log_Jul95 by using a random sampling scheme to generate a dataset with 1000–1,000,000 records.
Journal of Information Science, 2014, pp. 1–11 Ó The Author(s), DOI: 10.1177/0165551514544096
Downloaded from jis.sagepub.com at CHUNGBUK NATIONAL UNIV LIB on October 30, 2014
Li et al.
8
Figure 6. The effect of parameter minsup_lps.
Figure 7. Execution time of the improved Apriori algorithm vs the traditional algorithms.
Compare our proposed approach (LPS-MRApriori) with the existing Apriori and MRApriori algorithms, and the proposed approach performs well in execution time.
The experiment results are shown in Figure 7, and from the result, we can see that the MapReduce-based sequence
pattern-mining algorithms (LPS-MRApriori and MRApriori) achieve virtually linear speedup, and our proposed approach
performs better in execution time than the existing MRApriori algorithm.
When the size of the dataset is small enough, the execution time of MapReduce-based algorithms will consume more
time because there are additional costs in distributed computation for managing and assigning the task nodes in
MapReduce. In other words, MapReduce-based algorithms perform better when the dataset is bigger.
We compared our proposed approach (LPS-PFP) with the existing FP-Growth and PFP algorithms, and the proposed
approach performs well in execution time. The experiment results are shown in Figure 8, and we know the MapReducebased sequence pattern-mining algorithm (PFP) achieves virtually linear speedup, with LPS-PFP performing better in
execution time than the existing FP-Growth and PFP algorithms.
We analysed the proposed approach based on the speed difference when comparing different algorithms (Apriori and
FP-Growth). Previous research has shown that the FP-Growth algorithm performs better in finding frequent-sequence
patterns than the Apriori algorithm. We re-implemented it, and the result is shown in Figure 9. Comparing Figure 10,
which shows the speed difference between our proposed LPS algorithm combined with MapReduce-based Apriori and
FP-Growth algorithms (LPS-MRApriori and LPS-PFP), we see that combining our proposed LPS algorithm with the
existing frequent sequence pattern-mining algorithms can reduce the difference between the algorithms.
5.4. Analysis of data node
In this section, we perform our proposed approach using the data of NASA_access_log_Jul95 considering the number of
data nodes. We set different number of data nodes as 1, 2, 4, 6, 8, respectively, and we get the relationships between
Journal of Information Science, 2014, pp. 1–11 Ó The Author(s), DOI: 10.1177/0165551514544096
Downloaded from jis.sagepub.com at CHUNGBUK NATIONAL UNIV LIB on October 30, 2014
Li et al.
9
Figure 8. Execution time of the improved FP-Growth algorithm vs the traditional algorithms.
Figure 9. Speed difference between traditional Apriori and FP-Growth algorithms.
Figure 10. Speed difference between the proposed LPS-MRApriori and the LPS-PFP algorithm.
execution time and the number of data nodes which is shown in Table 1 and Figure 11. From the experimental result, we
can see that there are massive swings in execution times while the number of data nodes is changed. The execution time
is not getting shorter while the number of data nodes is varying more and there is a specific value of the number of data
nodes that can make the execution time shortest.
6. Conclusion
We presented applications for an improved parallel Apriori algorithm (MRApriori) and FP-Growth algorithm (PFP) by
combining them with the proposed LPS algorithm, which is based on MapReduce for web log data. A novel approach
Journal of Information Science, 2014, pp. 1–11 Ó The Author(s), DOI: 10.1177/0165551514544096
Downloaded from jis.sagepub.com at CHUNGBUK NATIONAL UNIV LIB on October 30, 2014
Li et al.
10
Table 1. The result of execution time comparison with different number of data nodes.
Content
Data node
Input split size (MB)
Total time (ms)
Average time of Map task (ms)
Finished time of Map task (ms)
Average time of shuffle (ms)
Finished time of shuffle (ms)
Average time of Reduce task (ms)
Finished time of Reduce task (ms)
Killed map/Reduce task attempts
Value
1
200
96,814
70,154
74,751
2,586
82,147
6,514
91,218
0/0
2
100
56,241
42,625
46,253
2,105
49,552
4,051
55,142
0/4
4
50
27,582
18,859
20,661
1,252
22,065
1,105
25,886
0/5
6
35
30,521
21,018
22,219
1,511
24,512
1,045
28,563
0/3
8
25
33,857
21,664
23,472
1,425
26,574
1,216
29,963
0/5
Figure 11. The effect of the number of data nodes.
for data preprocessing was presented to generate a clean dataset for a frequent sequence pattern-mining task. We propose
a novel way to convert web log data into sequence data based on MapReduce. We presented a novel way to get a large
event data set to reduce the size of the input database, based on MapReduce, in order to save search space and execution
time. In the experiments, we find that our proposed MapReduce-based algorithm performs better than traditional algorithms. In future work, we will compare our proposed algorithm with other frequent sequence-mining algorithms.
Funding
This work was supported by the National Research Foundation of Korea grant funded by the Korea Government (no. 2008-0062611)
and the Ministry of Science, ICT and Future Planning, Korea, under the Information Technology Research Center support program
(NIPA-2014-H0301-14-1022) supervised by the National IT Industry Promotion Agency and Basic Science Research Program
through the National Research Foundation of Korea funded by the Ministry of Science, ICT and Future Planning (no.
2013R1A2A2A01068923).
References
[1]
[2]
[3]
[4]
[5]
Kosala R and Blockeel H. Web mining research: A survey. ACM Sigkdd Explorations Newsletter 2000; 2(1): 1–15.
Wang X, Wang L and Yuan F. Web usage mining. Journal of Hebei University Natural Science Edition 2002; 22: 404–409.
Mobasher B. Web usage mining. In: Web data mining: Exploring hyperlinks, contents and usage data, 1st edn. Berlin: Springer,
2006, pp. 449–483.
Yu X, Li M, Kim T, Jeong SP and Ryu KH. An application of improved gap-BIDE algorithm for discovering access patterns.
Applied Computational Intelligence and Soft Computing 2012; 11.
Yu X, Li M, Lee DG, Kim KD and Ryu KH. Application of closed gap-constrained sequential pattern mining in web log data.
In: Advances in control and communication. Berlin: Springer, 2012, pp. 649–656.
Journal of Information Science, 2014, pp. 1–11 Ó The Author(s), DOI: 10.1177/0165551514544096
Downloaded from jis.sagepub.com at CHUNGBUK NATIONAL UNIV LIB on October 30, 2014
Li et al.
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
11
Yu X, Li M, Paik I and Ryu KH. Prediction of web user behavior by discovering temporal relational rules from web log data.
In: Database and expert systems applications. Berlin: Springer, 2012, pp. 31–38.
Agrawal R and Srikant R. Fast algorithms for mining association rules. In: Proceedings of the 20th international conference on
very large data bases, VLDB, Vol. 1215, 1994, pp. 487–499.
Han J, Pei J, Yin Y and Mao R. Mining frequent patterns without candidate generation: A frequent-pattern tree approach. Data
Mining and Knowledge Discovery 2004; 8(1): 53–87.
Taylor RC. An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics. BMC
Bioinformatics 2010; 11(suppl. 12): S1.
Bhandarkar M. MapReduce programming with apache Hadoop. In: IEEE International Symposium on Parallel and Distributed
Processing (IPDPS), 2010, p. 1.
Dean J and Ghemawat S. MapReduce: Simplified data processing on large clusters. Communications of the ACM 2008; 51(1):
107–113.
Hadoop Map/Reduce Tutorial. ApacheTM, 2010, http://hadoop.apache.org/docs/r0.20.2/mapred_tutorial.html.
Yang XY, Liu Z and Fu Y. MapReduce as a programming model for association rules algorithm on Hadoop. In: IEEE 2010 3rd
international conference on information sciences and interaction sciences (ICIS), 2010, pp. 99–102.
Li N, Zeng L, He Q and Shi Z. Parallel implementation of apriori algorithm based on MapReduce. In: IEEE 13th ACIS international conference on software engineering, artificial intelligence, networking and parallel & distributed computing (SNPD),
2012, pp. 236–241.
Lin MY, Lee PY and Hsueh SC. Apriori-based frequent itemset mining algorithms on MapReduce. In: Proceedings of the 6th
international conference on ubiquitous information management and communication. New York: ACM, 2012, p. 76.
Yahya O, Hegazy O and Ezat E. An efficient implementation of Apriori algorithm based on Hadoop-Mapreduce model.
International Journal of Reviews in Computing 2012; 12: 59.
Li H, Wang Y, Zhang D, Zhang M and Chang EY. Pfp: Parallel fp-growth for query recommendation. In: Proceedings of the
2008 ACM conference on recommender systems. New York: ACM, 2008, pp. 107–114.
Chordia BS and Adhiya KP. Grouping web access sequences using sequence alignment method. Indian Journal of Computer
Science and Engineering 2011; 2(3): 308–314.
Journal of Information Science, 2014, pp. 1–11 Ó The Author(s), DOI: 10.1177/0165551514544096
Downloaded from jis.sagepub.com at CHUNGBUK NATIONAL UNIV LIB on October 30, 2014