Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Journal of Information Science http://jis.sagepub.com/ MapReduce-based web mining for prediction of web-user navigation Meijing Li, Xiuming Yu and Keun Ho Ryu Journal of Information Science published online 29 July 2014 DOI: 10.1177/0165551514544096 The online version of this article can be found at: http://jis.sagepub.com/content/early/2014/07/18/0165551514544096 A more recent version of this article was published on - Sep 12, 2014 Published by: http://www.sagepublications.com On behalf of: Chartered Institute of Library and Information Professionals Additional services and information for Journal of Information Science can be found at: Email Alerts: http://jis.sagepub.com/cgi/alerts Subscriptions: http://jis.sagepub.com/subscriptions Reprints: http://www.sagepub.com/journalsReprints.nav Permissions: http://www.sagepub.com/journalsPermissions.nav Version of Record - Sep 12, 2014 >> OnlineFirst Version of Record - Jul 29, 2014 What is This? Downloaded from jis.sagepub.com at CHUNGBUK NATIONAL UNIV LIB on October 30, 2014 Article MapReduce-based web mining for prediction of web-user navigation Journal of Information Science 1–11 Ó The Author(s) 2014 Reprints and permissions: sagepub.co.uk/journalsPermissions.nav DOI: 10.1177/0165551514544096 jis.sagepub.com Meijing Li Database/Bioinformatics Laboratory, College of Electrical and Computer Engineering, Chungbuk National University, South Korea Xiuming Yu Database/Bioinformatics Laboratory, College of Electrical and Computer Engineering, Chungbuk National University, South Korea Keun Ho Ryu Database/Bioinformatics Laboratory, College of Electrical and Computer Engineering, Chungbuk National University, South Korea Abstract Predicting web user behaviour is typically an application for finding frequent sequence patterns. With the rapid growth of the Internet, a large amount of information is stored in web logs. Traditional frequent-sequence-pattern-mining algorithms are hard pressed to analyse information from within big datasets. In this paper, we propose an efficient way to predict navigation patterns of web users by improving frequent-sequence-pattern-mining algorithms based on the programming model of MapReduce, which can handle huge datasets efficiently. During the experiments, we show that our proposed MapReduce-based algorithm is more efficient than traditional frequent-sequence-pattern-mining algorithms, and by comparing our proposed algorithms with current existed algorithms in webusage mining, we also prove that using the MapReduce programming model saves time. Keywords Frequent sequence patterns; MapReduce; web-usage mining; web user behaviour 1. Introduction Data mining is a technology to analyse data to obtain valuable information. The web has become very popular and is widely used. With the rapid growth in the number of transactions on the web, a large amount of data is automatically gathered by web servers. A vast amount of useful information is hidden in these recorded web data. Analysis of this web data has become a challenge in the field of data mining. Web mining [1] is aimed at finding interesting and valuable patterns from web data. According to different kinds of web data, there are three categories of web mining: web-content mining, which extracts useful information and knowledge from web page contents, such as text and multimedia in web pages; web structure mining, which discovers useful knowledge from the structure of the hyperlinks; and web-usage mining, which finds user access patterns from the web log files. In this paper, the groundwork of our proposed approach is to find information about web users’ behaviour, and web-usage mining is the theme of our research. Recently, many studies have been proposed on web-usage mining [2–6], such as the Apriori algorithm [7] and the FP-Growth algorithm [8]. The Apriori algorithm is the most classic and most widely used algorithm for finding frequent-sequence patterns, but because it has to scan the dataset many times and generates a large number of candidate sequences when the dataset is large, it costs too much. Because the resources in our work environment are limited, inefficient performance results when the Apriori algorithm is applied to a huge dataset. Compared with the Apriori algorithm, the FP-Growth algorithm is a better algorithm for finding frequent-sequence patterns; it just needs to scan the dataset twice. The first scan is to compute a list (F-list), and the second scan is to Corresponding author: Keun Ho Ryu, Database/Bioinformatics Laboratory, College of Electrical and Computer Engineering, Chungbuk National University, South Korea. Email: khryu@dblab.chungbuk.ac.kr Downloaded from jis.sagepub.com at CHUNGBUK NATIONAL UNIV LIB on October 30, 2014 Li et al. 2 compress the dataset into an FP-tree. However, when the dataset is huge, that second scan is also very expensive in recursive searching and constructing the FP-tree. Hadoop-MapReduce [9, 10] is a programming model and an associated implementation for processing and generating large datasets, which processes large datasets in parallel fashion [11]. It allows users to not worry about the irritating details of coordinating parallel sub-tasks and maintaining distributed file storage. It greatly increases user productivity while users can process a large amount of parallel data. A MapReduce program consists of two parts: a map function, which processes a key/value pair to generate a set of intermediate key/value pairs; and a reduce function, which merges all of the intermediate values associated with the same intermediate key [12]. With the above problems, many solutions have been presented to improve the existing algorithms using HadoopMapReduce, such as the MR-Apriori algorithm, which improves the Apriori algorithm based on the Hadoop-MapReduce programming model, and the MR-FPGrowth algorithm, which improves the FP-Growth algorithm based on the HadoopMapReduce programming model. In this paper, we try to apply the approved frequent sequence pattern-mining algorithms to web-usage mining, but it is impossible to apply the improved algorithms to web log data directly. When the dataset size is huge, there is still room to improve the algorithms. In our proposed approach, web log data is transformed into sequence data based on MapReduce in a parallel fashion, and the current parallel implementations of the frequentsequence algorithms are improved by reducing the experimental dataset size before applying the algorithms based on MapReduce. The rest of this paper is organized as follows. Related work is discussed in Section 2. The problems are stated and illustrated in Section 3. In Section 4, the details of our proposed approach are presented. The experiment and evaluations are discussed in Section 5. Finally, Section 6 concludes the paper. 2. Related work As mentioned in the previous section, many algorithms have been proposed for sequence-pattern mining. Currently, two typical algorithms are the Apriori [7] and the FP-Growth [8] algorithms. Then, some parallel Apriori [13–16] and FPGrowth [17] algorithms based on the Hadoop-MapReduce model were proposed, because performance of the traditional Apriori and FP-Growth algorithms is inefficient, especially when dealing with huge datasets. Although the performance of the MRApriori [16] and Parallel FP-Growth (PFP) [17] algorithms is better than traditional sequence pattern-mining algorithms in relatively huge datasets, they are still hypodynamic when mining web log files if the dataset is very large. This paper proposes an improved implementation of the MRApriori and PFP algorithms from adopting a large page set (LPS) approach based on the Hadoop-MapReduce model, called the LPS-MRApriori and LPS-PFP algorithms. In our experiment, the proposed approach is compared with other existing sequence-pattern-mining algorithms (MRApriori and PFP). We concluded that our proposed approach for web-usage mining outperforms the others. 3. Problem statement This section is devoted to the problem statement for this paper. To predict user navigation from a huge web log dataset, two sub-problems need to be solved: transform the huge web log dataset into sequence data, and enhance efficiency, based on applying current excellent frequent-sequence-pattern-mining algorithms. An access record in a web log file generally consists of many parameters, and each parameter denotes a different meaning. Figure 1 shows a common web log format. Web log data in its raw format is not fit to be used directly for the data-mining task. The data in the web log file must be transformed into sequence data. The Web log file can be transformed into sequence data based on the user session. Figure 1. Common log format for web log data. Journal of Information Science, 2014, pp. 1–11 Ó The Author(s), DOI: 10.1177/0165551514544096 Downloaded from jis.sagepub.com at CHUNGBUK NATIONAL UNIV LIB on October 30, 2014 Li et al. 3 Definition 1: User session is a session of activity for one web user with a unique IP address or user name used on a web site during a specified period of time. For example, when a web site is accessed for more than 10 min without any operation, the web site cannot operate again, and then the size of user session of this web site is known as10 min. Definition 2: A large page set is a set of frequent web pages. Which web pages are frequent? We define the web pages whose support thresholds are greater than or equal to user specified minimum support threshold minsup_lps. The value of support threshold of web page is defined as shown in equation (1). support(pi ) = Ni N ð1Þ where N denotes the number of user sessions, support(pi ) denotes the support threshold of web page pi , and Ni denotes the number of web page pi accessed in N user sessions. For example, we assume that there are 100 user sessions, the first web page p1 accessed in user sessions 80 times and then the support threshold of this web page is: supportðp1 Þ = 80 = 80%: 100 We also assume that the value of minimum support threshold minsup_lps is 70%, and then the first web page is a large web page because the support threshold of the first web page is greater than minsup_lps. 4. Proposed approach for web-usage mining In this paper, we apply an improved frequent sequence pattern-mining algorithm to web-usage mining to predict web user navigation. Our approach consists of two phases: data preprocessing and frequent-sequence-pattern mining. In the data preprocessing phase, the web log file is processed by removing irrelevant attributes, transforming URLs into code numbers and removing records that are missing value data to get clean data for our mining task. In the second phase, the web log file resulting from the first phase is transformed into sequence data on the basis of user session using the MapReduce programming model. Then, in order to reduce the dataset size, a large web page set is generated based on the MapReduce programming model. Finally, existing parallel frequent sequence pattern-mining algorithms are applied to the decremented dataset to obtain access patterns. Figure 2 shows the work flow of our approach. Figure 2. Work flow of our approach. Journal of Information Science, 2014, pp. 1–11 Ó The Author(s), DOI: 10.1177/0165551514544096 Downloaded from jis.sagepub.com at CHUNGBUK NATIONAL UNIV LIB on October 30, 2014 Li et al. 4 4.1. Data preprocessing Web log data is automatically recorded in web log files on web servers when web users access the web server through their browsers. Not all of the records sorted into the web log files are legal or necessary for the mining task, so before analysis of the web log data, the data cleaning phase needs to be implemented. 4.1.1. Removing records with missing value data. Some of the records sorted in the web log file are not complete because some of the parameters of the records are lost. For example, if a click-through to a web page was executed while the web server was shut down, then in the log file only the IP address, user ID and access time are recorded; the method, URL, referrer and agent will be lost. This kind of record is illegal for our mining task, so these records must be removed. 4.1.2. Removing illegal records with exception status numbers. Some illegal records are caused by errors in the requests or by the server. Although the records are intact, the activity does not execute normally. For example, records with the status numbers 400 or 404 are caused by HTTP client errors, bad requests or a request not found. Records with status numbers 500 or 505 are caused by HTTP server errors, when the internal server cannot connect, or when the HTTP version is not supported. These kinds of data are illegal for our task, so the records must be removed. 4.1.3. Removing irrelevant records with no significant URLs. Some URLs in the records consist of .txt, .jpg, .gif or .js extensions, which are automatically generated while a web page is requested. These records are irrelevant to our mining task, so they must be removed. 4.1.4. Selecting the essential attributes. As shown in the common log format of web log data in Figure 1, there are many attributes in one record, but for web-usage mining, not all the attributes are necessary. In this paper, the attributes for IP address, time and URL are essential attributes to our task, so they should remain but the rest of the attributes should be discarded. 4.2. Frequent sequence-pattern mining In this section, an efficient approach to finding access patterns is presented. Working with the web log file, web log data should first be transformed into sequence data. Then, based on MapReduce, a large web page set is generated to reduce the size of the dataset. Finally, two existing parallel frequent sequence-pattern-mining algorithms, MR-Apriori and MRFPGrowth, are applied to the decremented sequence dataset to obtain access patterns. 4.2.1. Generation of sequence data. The challenge in generation of sequence data from web log files is how to identify an activity from accessing a website. In this section, we identify an activity from accessing a website based on the definition of user session (Definition 1). Some websites specify user sessions on their websites; for example, for some security websites, the specified period of a user session will be shorter, such as 10 min; for general websites, the specified period of a user session will be longer, and the value always set at 1 h. Some websites do not define user session at all. In this paper, we define the specified period of a user session as 1 h, which means that all the access records of one web user with a unique IP address within 1 h can be considered a transaction. All the URLs in a transaction can be sorted as one sequence. To obtain all the transactions that are transformed from the web log data, we first sort the preprocessed data by IP address and time. Then, we group the records by splitting the records into hours for each IP address, based on the specified period of a user session. Then, all the URLs requested in one user session and by the same user are sorted into one transaction based on the MapReduce programming model. In the MapReduce map function, the attributes IP and Time are defined as the intermediate key; they are connected by a colon, and the URL is defined as the intermediate value. The process of the map function can be defined as map(IP:Time, URL) !list(IP:Time, URL), and an intermediate key/value pair will be generated. In the reduce function for MapReduce, all of the intermediate values associated with the same intermediate key will be merged, and the reduce function for MapReduce can be written as reduce(IP:Time, list(URL)) ! (IP:Time, URLs). For example, in Fig. 3, the data in the left table are signed as input data to the map function, and for the previous two records, they can be processed as map(82.117.202.158:1, 4) and map(82.117.202.158:1, 1) in the input to the map function, which generates a list for output from the map function as {(82.117.202.158:1, 4), (82.117.202.158:1, 1)}. In the reduce function, ‘82.117.202.158:1’ is defined as an intermediate key, and input to the reduce function can be defined as reduce(82.117.202.158:1, {4, 1}), in which {4, 1} means the list (URLs). Then, Journal of Information Science, 2014, pp. 1–11 Ó The Author(s), DOI: 10.1177/0165551514544096 Downloaded from jis.sagepub.com at CHUNGBUK NATIONAL UNIV LIB on October 30, 2014 Li et al. 5 Figure 3. Generation of sequence data. merging the intermediate records with the same key value ‘82.117.202.158:1’, we get the output from the reduce function: (82.117.202.158:1, 4:1). The final sequence data of the sample data in the left table is generated as shown in the right table in Figure 3. 4.2.2. Generation of large page set. A large page set is the frequent web page set defined in Definition 2. In the above section, the web log file is converted into sequence data, and an LPS is generated from sequence data based on the MapReduce programming model. In the map function for MapReduce, the attributes IP and Event are defined as the intermediate key (they are connected by a colon), and ‘1’ is defined as the intermediate value. The process of the map function can be defined as map(IP:event, 1) !list(IP:event, 1), and intermediate key/value pairs will be generated. In the reduce function for MapReduce, all the intermediate values associated with the same intermediate key will be merged, and the reduce function for MapReduce can be written as: reduce(event, 1) ! (event, sum). As shown in the sequence data in the left table in Figure 4, there are four IP addresses, each IP address denotes a web user, and the first sequence data can be processed as map(82.117.202.158:4, 1) and map(82.117.202.158:1, 1) for input to the map function; the fourth sequence of data can be processed as map(82.117.202.158:4, 1) for input to the map function. For the output of the map function, we merge the sequence data with the same intermediate key. For example, for the first web user, the sequence data can be merged as {(82.117.202.158:1, 1), (82.117.202.158:2, 1), (82.117.202.158:4, 1), (82.117.202.158:6, 1), (82.117.202.158:7, 1)}, and each event occurs once in each IP address in the output from the map function, although some events occur more than once in each IP address in the input to the map function. The pseudocode of map function is shown in Algorithm 1. In the reduce function, we aim at counting the number of times each event occurs with web users. For example, in the first IP address, the input to the reduce function is set as reduce(1,1), reduce(2,1), reduce(4,1), reduce(6,1) and reduce(7,1); in the second IP address, the input to the reduce function is set as reduce(2,1), reduce(4,1) and reduce(7,1); in the third IP address, the input to the reduce function is set as reduce(2,1), reduce(6,1) and reduce(7,1); in the fourth IP address, the input to the reduce function is set as reduce(3,1), reduce(4,1), reduce(5,1), reduce(6,1) and reduce(7,1). The output from the reduce function is generated as shown in the right table in Figure 4. Here, we assume that the minimum support threshold of the large page set minsup_lps is 75%, which is equivalent to a minimum support count equal to three out of four web user sessions, and then we can get the large pages {2, 4, 6, 7} where the support value is greater than, or equal to, the minimum support count 3. The pseudo-code of reduce function is shown in Algorithm 2. Journal of Information Science, 2014, pp. 1–11 Ó The Author(s), DOI: 10.1177/0165551514544096 Downloaded from jis.sagepub.com at CHUNGBUK NATIONAL UNIV LIB on October 30, 2014 Li et al. 6 Figure 4. Generation of the large page set. Algorithm 1. Map(key, value) Input: A set of sequence of web data D. Output: < key#, value# > , where key# is set as ‘‘IP:event’’, value# equals to 1. 1. for each sequence data Di in D 2. key#.set(Di .IP + ‘‘:’’ + Di .event); 3. value#.set(1); 4. context.write(key#, value#); Algorithm 2. Reduce(key, value) Input: A set of candidate < key#,value# > data, which are generated by Algorithm 1, user specified minimum support threshold minsup_lps, and user_session which is the number of web user sessions. Output: < key##, value## > , where key## is large web page, value## is the support count of key##. 1. for each key## in context { // generated by Agorithm 1 2. value##=value## + 1; 3. } 4. If (value##/user_session > =minsup_lps) { 5. context.write(key##, value##); 6. } 4.2.3. Generation of frequent-sequence patterns. After getting the large page set, web pages that are visited frequently are obtained. Then, we apply parallel frequent-sequence-pattern algorithms to the sequence dataset, which filter out the infrequent events. In this paper, we improve existing algorithms (MRApriori, the MapReduce-based Apriori algorithm, and PFP, the Parallel FP-Growth algorithm) by getting a large page set and then applying the proposed approach to web log data. As shown in Figure 5, before finding frequent-sequence patterns, infrequent events in the sequence dataset are filtered out based on the large page set, which can greatly reduce the dataset size. In finding frequent-sequence patterns, we assume that the minimum support threshold of frequent sequence pattern minsup_fsp is 50%, which is equivalent to a minimum support count equal to two out of four web users, and then we get the final frequent-sequence patterns {{6, 2}, {6, 7}} with support values 2 and 3. Journal of Information Science, 2014, pp. 1–11 Ó The Author(s), DOI: 10.1177/0165551514544096 Downloaded from jis.sagepub.com at CHUNGBUK NATIONAL UNIV LIB on October 30, 2014 Li et al. 7 Figure 5. Generation of frequent-sequence patterns. 5. Experiment and evaluation We ran several experiments to evaluate the performance of the proposed algorithms. All experiments were performed in Hadoop 1.0.3 on a personal computer with an Intel Core2 Duo E7500 CPU, 4 GB RAM, and a 480 GB hard disk running the Windows operating system. The proposed algorithms were implemented in Java. 5.1. Experiment data In the experiments, we used weblog data come from a NASA website (http://ita.ee.lbl.gov/html/contrib/NASAHTTP.html), which is cited in Chordia and Adhiya [18], because that web log is large-scale temporal data with time points. The event file can be generated from a web log file including the URL requested, HTTP method requested, the IP address from which the request originated and a timestamp. For example, the IP address in the web log record can be assigned to customer-ID, the HTTP method with the URL can be converted to an event-type and the timestamp can be assigned to event-time. Two web log files, NASA_access_log_Jul95 and NASA_access_log_Aug95, were used as our two experiment datasets. The first dataset was collected from 00:00:00 1 July 1995 to 23:59:59 31 July 1995 (a total of 31 days). The second dataset was collected from 00:00:00 1 August 1995 to 23:59:59 31 August 1995. The uncompressed content of the first dataset is 205.2 MB, and it contains 1,891,715 records. The uncompressed content of the second dataset is 167.8 MB, and it contains 1,569,898 records. 5.2. Analysis from getting the large page set The phase in which we get the large page set with the proposed approach can find frequently visited web pages among the numerous web users. In other words, we can find relatively interesting web pages for analysis. The degree of relativity is controlled by the value of minsup_lps. The smaller the value, the higher the number of generalized large pages obtained. As shown in Figure 6, we can see that there exists a value of the minimum support threshold for the large page set, and from the larger values, the number of large pages will not change or will change little. 5.3. Analysis of the proposed approach Comparisons of the efficiency of our proposed approach against the existing algorithms are given in Figs 7 and 8. To evaluate the proposed approach with small and large datasets, we select experiment data from web log file NASA_access_log_Jul95 by using a random sampling scheme to generate a dataset with 1000–1,000,000 records. Journal of Information Science, 2014, pp. 1–11 Ó The Author(s), DOI: 10.1177/0165551514544096 Downloaded from jis.sagepub.com at CHUNGBUK NATIONAL UNIV LIB on October 30, 2014 Li et al. 8 Figure 6. The effect of parameter minsup_lps. Figure 7. Execution time of the improved Apriori algorithm vs the traditional algorithms. Compare our proposed approach (LPS-MRApriori) with the existing Apriori and MRApriori algorithms, and the proposed approach performs well in execution time. The experiment results are shown in Figure 7, and from the result, we can see that the MapReduce-based sequence pattern-mining algorithms (LPS-MRApriori and MRApriori) achieve virtually linear speedup, and our proposed approach performs better in execution time than the existing MRApriori algorithm. When the size of the dataset is small enough, the execution time of MapReduce-based algorithms will consume more time because there are additional costs in distributed computation for managing and assigning the task nodes in MapReduce. In other words, MapReduce-based algorithms perform better when the dataset is bigger. We compared our proposed approach (LPS-PFP) with the existing FP-Growth and PFP algorithms, and the proposed approach performs well in execution time. The experiment results are shown in Figure 8, and we know the MapReducebased sequence pattern-mining algorithm (PFP) achieves virtually linear speedup, with LPS-PFP performing better in execution time than the existing FP-Growth and PFP algorithms. We analysed the proposed approach based on the speed difference when comparing different algorithms (Apriori and FP-Growth). Previous research has shown that the FP-Growth algorithm performs better in finding frequent-sequence patterns than the Apriori algorithm. We re-implemented it, and the result is shown in Figure 9. Comparing Figure 10, which shows the speed difference between our proposed LPS algorithm combined with MapReduce-based Apriori and FP-Growth algorithms (LPS-MRApriori and LPS-PFP), we see that combining our proposed LPS algorithm with the existing frequent sequence pattern-mining algorithms can reduce the difference between the algorithms. 5.4. Analysis of data node In this section, we perform our proposed approach using the data of NASA_access_log_Jul95 considering the number of data nodes. We set different number of data nodes as 1, 2, 4, 6, 8, respectively, and we get the relationships between Journal of Information Science, 2014, pp. 1–11 Ó The Author(s), DOI: 10.1177/0165551514544096 Downloaded from jis.sagepub.com at CHUNGBUK NATIONAL UNIV LIB on October 30, 2014 Li et al. 9 Figure 8. Execution time of the improved FP-Growth algorithm vs the traditional algorithms. Figure 9. Speed difference between traditional Apriori and FP-Growth algorithms. Figure 10. Speed difference between the proposed LPS-MRApriori and the LPS-PFP algorithm. execution time and the number of data nodes which is shown in Table 1 and Figure 11. From the experimental result, we can see that there are massive swings in execution times while the number of data nodes is changed. The execution time is not getting shorter while the number of data nodes is varying more and there is a specific value of the number of data nodes that can make the execution time shortest. 6. Conclusion We presented applications for an improved parallel Apriori algorithm (MRApriori) and FP-Growth algorithm (PFP) by combining them with the proposed LPS algorithm, which is based on MapReduce for web log data. A novel approach Journal of Information Science, 2014, pp. 1–11 Ó The Author(s), DOI: 10.1177/0165551514544096 Downloaded from jis.sagepub.com at CHUNGBUK NATIONAL UNIV LIB on October 30, 2014 Li et al. 10 Table 1. The result of execution time comparison with different number of data nodes. Content Data node Input split size (MB) Total time (ms) Average time of Map task (ms) Finished time of Map task (ms) Average time of shuffle (ms) Finished time of shuffle (ms) Average time of Reduce task (ms) Finished time of Reduce task (ms) Killed map/Reduce task attempts Value 1 200 96,814 70,154 74,751 2,586 82,147 6,514 91,218 0/0 2 100 56,241 42,625 46,253 2,105 49,552 4,051 55,142 0/4 4 50 27,582 18,859 20,661 1,252 22,065 1,105 25,886 0/5 6 35 30,521 21,018 22,219 1,511 24,512 1,045 28,563 0/3 8 25 33,857 21,664 23,472 1,425 26,574 1,216 29,963 0/5 Figure 11. The effect of the number of data nodes. for data preprocessing was presented to generate a clean dataset for a frequent sequence pattern-mining task. We propose a novel way to convert web log data into sequence data based on MapReduce. We presented a novel way to get a large event data set to reduce the size of the input database, based on MapReduce, in order to save search space and execution time. In the experiments, we find that our proposed MapReduce-based algorithm performs better than traditional algorithms. In future work, we will compare our proposed algorithm with other frequent sequence-mining algorithms. Funding This work was supported by the National Research Foundation of Korea grant funded by the Korea Government (no. 2008-0062611) and the Ministry of Science, ICT and Future Planning, Korea, under the Information Technology Research Center support program (NIPA-2014-H0301-14-1022) supervised by the National IT Industry Promotion Agency and Basic Science Research Program through the National Research Foundation of Korea funded by the Ministry of Science, ICT and Future Planning (no. 2013R1A2A2A01068923). References [1] [2] [3] [4] [5] Kosala R and Blockeel H. Web mining research: A survey. ACM Sigkdd Explorations Newsletter 2000; 2(1): 1–15. Wang X, Wang L and Yuan F. Web usage mining. Journal of Hebei University Natural Science Edition 2002; 22: 404–409. Mobasher B. Web usage mining. In: Web data mining: Exploring hyperlinks, contents and usage data, 1st edn. Berlin: Springer, 2006, pp. 449–483. Yu X, Li M, Kim T, Jeong SP and Ryu KH. An application of improved gap-BIDE algorithm for discovering access patterns. Applied Computational Intelligence and Soft Computing 2012; 11. Yu X, Li M, Lee DG, Kim KD and Ryu KH. Application of closed gap-constrained sequential pattern mining in web log data. In: Advances in control and communication. Berlin: Springer, 2012, pp. 649–656. Journal of Information Science, 2014, pp. 1–11 Ó The Author(s), DOI: 10.1177/0165551514544096 Downloaded from jis.sagepub.com at CHUNGBUK NATIONAL UNIV LIB on October 30, 2014 Li et al. [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] 11 Yu X, Li M, Paik I and Ryu KH. Prediction of web user behavior by discovering temporal relational rules from web log data. In: Database and expert systems applications. Berlin: Springer, 2012, pp. 31–38. Agrawal R and Srikant R. Fast algorithms for mining association rules. In: Proceedings of the 20th international conference on very large data bases, VLDB, Vol. 1215, 1994, pp. 487–499. Han J, Pei J, Yin Y and Mao R. Mining frequent patterns without candidate generation: A frequent-pattern tree approach. Data Mining and Knowledge Discovery 2004; 8(1): 53–87. Taylor RC. An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics. BMC Bioinformatics 2010; 11(suppl. 12): S1. Bhandarkar M. MapReduce programming with apache Hadoop. In: IEEE International Symposium on Parallel and Distributed Processing (IPDPS), 2010, p. 1. Dean J and Ghemawat S. MapReduce: Simplified data processing on large clusters. Communications of the ACM 2008; 51(1): 107–113. Hadoop Map/Reduce Tutorial. ApacheTM, 2010, http://hadoop.apache.org/docs/r0.20.2/mapred_tutorial.html. Yang XY, Liu Z and Fu Y. MapReduce as a programming model for association rules algorithm on Hadoop. In: IEEE 2010 3rd international conference on information sciences and interaction sciences (ICIS), 2010, pp. 99–102. Li N, Zeng L, He Q and Shi Z. Parallel implementation of apriori algorithm based on MapReduce. In: IEEE 13th ACIS international conference on software engineering, artificial intelligence, networking and parallel & distributed computing (SNPD), 2012, pp. 236–241. Lin MY, Lee PY and Hsueh SC. Apriori-based frequent itemset mining algorithms on MapReduce. In: Proceedings of the 6th international conference on ubiquitous information management and communication. New York: ACM, 2012, p. 76. Yahya O, Hegazy O and Ezat E. An efficient implementation of Apriori algorithm based on Hadoop-Mapreduce model. International Journal of Reviews in Computing 2012; 12: 59. Li H, Wang Y, Zhang D, Zhang M and Chang EY. Pfp: Parallel fp-growth for query recommendation. In: Proceedings of the 2008 ACM conference on recommender systems. New York: ACM, 2008, pp. 107–114. Chordia BS and Adhiya KP. Grouping web access sequences using sequence alignment method. Indian Journal of Computer Science and Engineering 2011; 2(3): 308–314. Journal of Information Science, 2014, pp. 1–11 Ó The Author(s), DOI: 10.1177/0165551514544096 Downloaded from jis.sagepub.com at CHUNGBUK NATIONAL UNIV LIB on October 30, 2014