Download 1 - wseas

Using Fuzzy Techniques for Different Aspects of Web Mining CARO LUCAS¹, AMIR HOSSEIN KEYHANIPOUR², TAYYEBE SARHADI² ¹Control and Intelligent Processing Center of Excellence, Electrical and Computer Eng. Department, University of Tehran, Tehran, Iran and School of Intelligent Systems, Institute for studies in theoretical Physics and Mathematics, Tehran, Iran ²Control and Intelligent Processing Center of Excellence, Electrical and Computer Eng. Department, University of Tehran, Tehran, Iran Abstract: Fuzzy ideas and techniques are widely used in real-world applications. A challenging area of real-world applications is Web-Mining, with respect of its size and growth. Some tries are done to overcome these serious problems and the most promising one is improvement of traditional web mining techniques with fuzzy ideas. In this research using some fuzzy, basics of a fuzzy search engine are introduced. In Content Mining, Fuzzy techniques such as Linguistic Indexing and Fuzzy Pattern Matching are used. Some basic ideas of Fuzzy are employed to improve the performance of usual structure mining, and at last, fuzzy clustering techniques are used in web usage mining. A comparison of the search results with the traditional web mining techniques shows the efficiency of the proposed algorithms. Key-words: Web Mining, Search engine, Fuzzy, Content Mining, Structure Mining, Usage Mining, Linguistic Indexing, Pattern Matching. 1. Introduction The evolution of the World Wide Web has brought us enormous and ever growing amounts of data and information. As a statistics [1], Google has indexed about 4.3 billion textual documents until May 2004. With the abundant data provided by the web, it has become an important resource for research. However, traditional data extraction and mining techniques can not be applied directly to the web due to its semistructured or even unstructured nature. Traditional web mining is concerned with “the use of data mining techniques to automatically discover and extract information from World Wide Web documents and services” [3]. Three areas of Web mining are commonly distinguished: content mining, structure mining, and usage mining (see Fig. 1) [2]. Web content mining is a form of text mining. For web documents, the mining methods are mainly focused on information extraction and integration (i.e., gathering explicit information from different web sites for its access). Web structure mining usually operates on the hyperlink structure of Web pages. The primary Web resource that is being mined is a set of pages, ranging from a single Web site to the Web as a whole. Web structure mining exploits the additional information that is (often implicitly) contained in the structure of hypertext. In web usage mining, the primary web resource that is being mined is a record of the requests made by visitors to a web site, most often collected in a web server log. This data proposes a template of user’s behavior and it can be used to personalize and improve the search results regarding the user characteristics. Numerous projects also are defined and done in this field. We only introduce an interesting one which its name is “Web Personalization and Mining Using Robust Fuzzy Clustering Methods” [9]. Some objectives of this project are: Developing new robust fuzzy relational clustering algorithms that are suitable for large applications, as well as suitable similarity metrics between such feature vectors, and developing new similarity metrics to manipulate non-numeric features when a numeric approach is infeasible. In this paper we introduce the use of some fuzzy techniques in all areas of traditional web mining to enhance this process. At first we will review some previous works and then introduce our web mining approaches in three continuing sections. Then experimental results are discussed. 2. Related Works There are some proposed techniques about fuzzy data mining such as using Fuzzy Linguistic approach for information retrieval[4,5,6], and use of Fuzzy Aggregation methods for information retrieval[7,8]. Also some data mining tools are proposed which use of these techniques; a famous one is DataEngine produced by “Intelligenter Technologien GmbH”. It is a data retrieving tool which uses Decision Trees, Fuzzy Rules, K-means, Neural Networks (MLP, Kohonen) and Regression (linear). Another group of existing tools is fuzzy search engines. Most famous ones are Excite and SearchNZ. The latter is a faulttolerant ("fuzzy") [15] search engine, restricted to New Zealand cyberspace. Its name, "fuzzy", means that it allows you to search in words even if you are not sure about the correct spelling. 3. Fuzzy Web Mining Web mining, when looked upon in data mining terms, can be said to have three operations of interests - clustering (finding natural groupings of users, pages etc.), associations (which URLs tend to be requested together), and sequential analysis (the order in which URLs tend to be accessed). As in most real-world problems, the clusters and associations in Web mining do not have crisp boundaries; and often overlap considerably. In addition, bad exemplars (outliers) and incomplete data can easily occur in the data set, due to a wide variety of reasons inherent to web browsing and logging. Another important aspect is limited query interface based on keyword- 2 oriented search. Fuzzification of these operations is an obvious way. In the following sections we introduce fuzzy extensions of web content, structure and usage mining. They are used to construct indexes for search engines. There are several types of crawlers: Traditional Crawlers, Periodic Crawlers, Incremental Crawlers and Focused Crawlers. Here we describe a Fuzzy Focused Crawler. A Focused Crawler [12] retrieves documents relevant to a predefined topic, trying to avoid irrelevant areas of the Web. It uses a Classifier to relate documents to topics. Classifier also determines how useful outgoing links are. Using fuzzy classifiers can improve the crawling mechanism. 4. Fuzzy Web Content Mining Fuzzy Web Content Mining is the use of fuzzy ideas in traditional web content mining. In traditional web content mining, there are two phases: web page content mining and search result mining. The core of former phase is Resource identification, which is the process of retrieving the intended web documents. It is done by web search and metasearch engines, or by crawlers. We will focus on crawlers and fuzzify them. The latter phase includes clustering search results and categorizing documents using phrases in titles and snippets. Critical operations are:  Preprocessing consists of two tasks: selecting interesting data from the downloaded web documents, and transforming this data into a formal representation. Most methods use wrappers for extracting simple data (e.g. proper names, prices, phone numbers, e-mail addresses, etc.) from web documents, and construct tables as formal representations. Fuzzy Aggregation (e.g. OWA operators) can be used for linguistic indexing documents.  Generalization is the automatic discovery of patterns across multiple web documents. Most methods use data mining techniques for discovering association rules, clusters and classification trees and classification rules. Fuzzy clustering and classification are fuzzy extensions of this process. 4.2. Fuzzy Pattern Matching Aim of pattern matching in this part is finding similarity between query and document clusters. A fuzzy approach as shown will improve the quality of search results. Fundamentally, the intimate relationship that exists between fuzzy set theory and pattern recognition comes from the fact that the vast majority of real world classes are fuzzy in nature. From all the fuzzy techniques that have been developed, the fuzzy c-means algorithm is probably the most widely used. Fuzzy c-means (FCM) is a method of clustering which allows one piece of data to belong to two or more clusters [16]. This method is frequently used in pattern recognition. It is based on minimization of the following objective function: N C J m   uijm xi  c j , 1  m   (1) 2 i 1 j 1 where m is any real number greater than 1, uij is the degree of membership of xi in the cluster j, xi is the ith member of ddimensional measured data, cj is the ddimension center of the cluster, and ||*|| is any norm expressing the similarity between any measured data and the center. Fuzzy partitioning is carried out through an iterative optimization of the objective function shown above, with the update of membership uij and the cluster centers c j by: 4.1. Fuzzy Crawlers Crawlers are Robots (spiders) that traverse the hypertext structure in the Web, and collect information from visited pages. 3 N 1 uij   x c j  i   k 1  xi  c k    2    2 C This iteration  when , max ij uij ( k 1) will  uij k 2 m 1 stop   , , where  is a termination criterion between 0 and 1, whereas k is the iteration steps. This procedure converges to a local minimum or a saddle point of J m . , cj  u i 1 N m ij u i 1 . xi (2) m ij C 2 wi , bi , i  1, 2  w1  s j  1  w1   si  sk , s j , si  S , ( j  i ) such that, k  min T , i  round (w1.( j  i))} , where “round” is the usual round operation, and b1  s j , b2  si . If w j  1 and wi  0 with i  j i , then 4.3. Fuzzy Indexing Search results are documents containing a complex of search keywords. To identify the importance of these results, they must be indexed according to a policy. Traditional approach is indexing the results with crisp methods. A fuzzy approach is LOWA (Linguistic Indexing using OWA aggregation). LOWA operator [10] is based on the OWA operator defined by Yager [11]. the convex combination is defined as: C m wi , bi , i  1,..., m  b j . Definition1. Let   a1 ,..., am be a set of labels to be aggregated, then the LOWA operator,  is defined as: First step is assignment of linguistic quantifiers to keywords, which is described below. Then using LOWA operator, indexing retrieved documents is done. Examples of using LOWA are described in [11]. A possible solution for calculation of weighting vector of LOWA operator is proposed by Yager which in the case of nondecreasing proportional fuzzy linguistic quantifier, Q, is given by this expression: (3) where W= w1,..., wm  , is a weighting being the membership function of Q, as follows:   a1 ,..., am   W .BT  C m wk ,b k , k  1,.., m wi  Q( i )  Q((i  1) ), i  1,..., n. n n  w1  b1  1  w1   C m 1Bh , bh , h  2,.., m (5) vector, such that, (i) wi  [0,1] and, (ii)  wi  1, Bh  i wh m  wk , h  2,.., m 2 , and B  b1 ,..., bm  is a vector associated to A, such that, B   ( A)  a (1) ,..., a ( n )  (4) in which, a ( j )  a ( i ) i  j , with  being a permutation over the set labels A. C m is the convex combination operator of m labels and if m  2 , then it is defined as  0 r  a Q(r )   b  a  1 if r  a   if a  r  b   if r  b  (6) 5. Fuzzy Web Structure Mining Traditional Web Structure mining usually operates on the hyperlink structure of Web pages to exploits the additional information that is (often implicitly) contained in the structure of hypertext. Therefore, an important application area of it is the identification of the relative relevance of different pages that appear equally pertinent when analyzed with respect to their content in isolation. Another application area of web structure mining is the identification of relative importance of web pages, which is used in prioritization of pages returned from search. Two of main techniques used in this area are PageRanking and CLEVER method. In the subsequent sections first fuzzy extensions of these techniques are introduced and then we will present a fuzzy version of HITS algorithm using these. 5.1 Fuzzy PageRanking PageRanking is used by Google to prioritize pages returned from search by looking at Web structure. Importance of page is calculated based on number of pages which point to it – Backlinks. Weighting is used to provide more importance to backlinks coming from important pages. Traditional PageRanking is: N N  PR p    1 p  ...  np  N N 1 n  (7), where : N ip : number of links coming out of page i toward page j, and N i : number of links coming out of page i A suggestion for Fuzzy PageRanking is:  N N  PR p   Rank  Sup 1 p ,..., np   N1 Nn     (8), where : N ij : number of links coming out of page i toward page j, N i : number of links coming out of page i, Sup : is a fuzzy union operator, and Very Important , x  0.9  Important ,0.7  x  0.9     Rank   Related ,0.5  x  0.7   Good ,0.3  x  0.5    Irrelevant , x  0.3  pages which are the best source for requested information. According to above criteria, CLEVER method ranks pages primarily by measuring links between them. CLEVER works as follow: 5.2 Fuzzy CLEVER Method Web Structure Mining analyzes hyperlink topology by discovering authoritative information sources. This information is found in authority pages, which are defined in relation to hubs as their counterparts: Hubs are pages that link to many related authorities. Authorities are highly important 1. Starts by collecting a set of pages 2. Gathers all pages of initial link, plus any pages linking to 5 them Ranks result by counting links Links have noise, not clear which pages are best Recalculate scores Pages with most links are established as most important, links transmit more weigh 7. Repeat calculation number of times till scores are refined 3. 4. 5. 6. Using fuzzy PageRanking technique can be considered as a fuzzy extension of CLEVER method. that is pointed to by many good hubs, while a good hub is a page that points to many good authorities. A fuzzy extension of this algorithm is as follows. Note that it uses fuzzy PageRanking to identify hub and authority pages. 5.3 Fuzzy HITS Algorithm HITS (Hyperlink Induced Topic Search) algorithm introduced by Kleinberg at 1998 is a topic-focused search method. This method, views the Web as a graph and uses a purely link structure-based computation, ignoring the textual content. The main idea of HITS is this: a good authority is a page Input: W // WWW viewed as directed graph q // Query s // Support Output: A // Set of authority pages H // Set of hub pages HITS Algorithm: R  SE W , q ; // Collect highest ranked pages for the query q from a text - based search engine B  R  pages which are linked to from R pages which link to pages in R; G B, L   Subgraph of W induced by B;   G B,L1  Delete linkes in G within th e same site; x p  Supy q for all q, where q,p  L1 ; // Find authority weights. y p  Supx p for all q, where q,p  L1 ; // Find hub weights. A  p | p has one of the highest x p ; H  p | p has one of the highest y p ; needs of Web-based applications. Web usage mining consists of three phases, namely preprocessing, pattern discovery, and pattern analysis [13]. The usage data are collected at the different sources -e.g. Client/Server logs-, and they will represent the access patterns of users. 6 Fuzzy Web Usage Mining Traditional Web usage mining is the application of data mining techniques to discover usage patterns from Web data, in order to understand and better serve the 6 In Fuzzy extension of this technique, we use fuzzy clustering to cluster Web-Access logs and then use this clusters for discovery and matching usage pattern. Categories in most data mining tasks are rarely well separated, and hence, the class partition is best described by fuzzy memberships. A well known technique is fuzzy Competitive Agglomeration (CA) algorithm which can automatically cluster data into the optimal number of components. However, CA deals with object or feature data only, whereas session similarity data is relational. Moreover, the session dissimilarity measure we define is not Euclidean. Therefore, we use extended CA so that it can work on non-Euclidean relational data, which is Competitive Agglomeration for Relational Data (CARD) algorithm can deal with complex and subjective distance/similarity measures which are not restricted to be Euclidean. CARD is described in more detail in [14]. We will use CARD for automatic discovery of user session profiles in Web log data. CARD uses the notion “user session” as being a temporally compact sequence of web accesses by a user. CARD analyzes session profiles to captures both the individual URLs in a profile as well as the structure of the site. The Competitive Agglomeration for Relational Data (CARD) algorithm can deal with complex and subjective dissimilarity/similarity measures which are not restricted to be Euclidean. The resulting clusters are evaluated subjectively, as well as based on standard statistical criteria. The modeled search engine gets two keywords based of which it searches the data set and returns 20 of the most related pages to the keywords, in order of their priority , determined by a page rank associated to each page as described below. In the first method the engine specifies relevancies of the pages to the keywords, based on the percentage of the first keyword occurrences to the total words that the page contains. The more the percentage of the first keyword is, the more related the page is considered. In case of equality of the percentages of the first keyword in two pages, the percentage of the second one determines the relevance in the same way. It implies that this method gives a priority to the first keyword when searching; however, in the second way related pages are retrieved regarding the multiplication of the first word repetition ratio to the total number of words in the page with the second one’s, so it considers an equivalent weight for both keywords contained in the pages. On the other hand, the criterion by which the fuzzy method evaluates the pages is the minimum of percentages of the keywords in a single page to compare with that of another. As the logic of the problem implies, the best answers result from the pages which are the most related to both the first word and the second word. So, not only the fuzzy method give an equivalent value to both keywords but it also tends to extract the pages which contain both of the words to some extents, and obviously subjects to more logical results. Specifying a rank to each page in traditional method obeys (7). Consequently, the page ranks in fuzzy method is evaluated with (8). The results of these methods are summarized in below tables. Therefore, the fuzzy method eliminates the effects of unimportant pages by substitution of 'MAX' operator instead of '+' operator to specify page ranks. 7 Experimental Results To consider the results of fuzzifying web mining methods, we applied two traditional sample searching methods, as well as the fuzzified solution of them, to a sample data set, representing the web in a small scale. The sample data set has about 500 html pages, each of which contains several hyper links to the other pages of the same data set. 7 Keyword1 Repetition Percentage 2.0833335 1.980198 8.8 1.5837106 1.8987341 6.25 1.5748031 4.4871798 1.4184396 2.0833335 2.0833335 2.0833335 1.923077 8.227848 1.5384616 1.4084507 3.508772 2.857143 2.4193547 1.5837106 Keyword2 Repetition Percentage 10.416666 6.930693 2.4 0.45248872 0.0 0.0 3.1496062 0.64102566 0.0 10.416666 10.416666 10.416666 0.0 4.43038 1.5384616 0.0 3.508772 0.0 1.6129031 0.45248872 Page Rank 21.584131 5.3121285 4.2817764 1.8476298 1.4291381 1.1428572 1.0430099 0.6896016 0.3845039 0.38095245 0.38095245 0.38095245 0.34237936 0.31307533 0.31151435 0.26178026 0.25396827 0.25 0.02554945 0.017857144 Table1: Results of Traditional Method 1 Keyword1 Repetition Percentage 2.0833335 1.980198 8.8 1.2658228 1.5748031 0.75757575 1.2820513 1.0714287 2.0833335 2.0833335 2.0833335 8.227848 0.952381 3.508772 1.0714287 1.0714287 1.0714287 1.0714287 1.0714287 Keyword2 Repetition Percentage 10.416666 6.930693 2.4 3.164557 3.1496062 7.575758 3.846154 2.857143 10.416666 10.416666 10.416666 4.43038 8.571429 3.508772 2.857143 2.857143 2.857143 2.857143 2.857143 8 Page Rank 21.584131 5.3121285 4.2817764 2.440153 1.0430099 1.0137123 0.9156288 0.5769231 0.38095245 0.38095245 0.38095245 0.31307533 0.2809524 0.25396827 0.24358976 0.17777778 0.16666667 0.14285715 0.083333336 2.4193547 1.6129031 0.02554945 Table2: Results of Traditional Method 2 Keyword1 Repetition Percentage 2.0833335 8.8 1.980198 1.0714287 1.1363636 1.1811024 1.2345679 1.2658228 1.2820513 1.3157895 2.0833335 2.0833335 2.0833335 8.227848 1.5748031 1.1363636 1.0752689 3.508772 1.5384616 2.4193547 Keyword2 Repetition Percentage 10.416666 2.4 6.930693 2.857143 2.2727273 1.1811024 1.2345679 3.164557 3.846154 1.3157895 10.416666 10.416666 10.416666 4.43038 3.1496062 2.2727273 1.0752689 3.508772 1.5384616 1.6129031 Page Rank 0.5 0.33333334 0.33333334 0.25 0.25 0.25 0.2 0.2 0.2 0.16666667 0.16666667 0.16666667 0.16666667 0.14285715 0.14285715 0.14285715 0.125 0.11111111 0.11111111 0.017857144 Table3: Results of Fuzzy Method Fig.2 shows the map of used dataset. It can be seen, there are some pages to which a lot of links exit fro the other pages, so they have high page ranks and that’s why the first few pages in 3 results are the same. used to categorize crawled pages. In structure mining, we fuzzified page ranking and used it in CLEVER method. These algorithms were employed to enhance the performance of HITS algorithm. In last partusage mining- fuzzy pattern matching was used to search and organize web server logs. Experimental results showed that mining web according to these techniques can produce better and more precise search results. Using other indexing, clustering and pattern matching techniques in comparison of proposed approach, can be regarded as future works. 8 Conclusion and Future Works Web mining can be regarded as a three part activity. In this paper we used some fuzzy techniques in these parts to improve them. In content mining, main activity is crawling web to find pages related to some specified categories of data. So we used fuzzy techniques in crawling and pattern matching, and at last, fuzzy indexing was 9 Fig.2. Map of Dataset References [1] http://searchenginewatch.com, May 2004. [2] B. Berendt, A. Hotho, and G. Stumme, Towards Semantic Web Mining, Institute of Information Systems, Humboldt University Berlin, Germany, 2003. [3] O. Etziöni. The World Wide Web: Quagmire or Gold Mine? Communications of the ACM, Vol.39, No.11, pp. 65-68. Nov. 1996. [4] E. Herrera-Viedma, Modeling the Retrieval Process of an Information Retrieval System Using an Ordinal Fuzzy Linguistic Approach, University of Granada, November 1997. [5] Aggregation Operators for Linguistic Weighted Information, F. Herrera, E. Herrera-Viedma, University of Granada, October 1995. [6] E. Herrera-Viedma and E. Peis, Evaluating the Informative Quality of Documents in SGML Format Using Fuzzy Linguistic Techniques Basedon Computing with Words, Department of Computer Science and A. I.,Library Science Studies School, University of Granada, 2001. [7] G. Kazai, M. Lalmas and T. Rölleke, A Model for the Representation and Focussed Retrieval of Structured Documents based on Fuzzy Aggregation, Department of Computer Science Queen Mary, University of London, 2003. [8] H.L. Larsen, Importance weighted OWA aggregation of multicriteria queries, Proceedings of the North American Fuzzy Information Processing Society conference, New York, 10–12 June 1999 (NAFIPS'99). Pp. 740-744. [9] http://www.cs.umbc.edu/~ajoshi/webmine, May 2004. [10] F. Herrera, E. Herrera-Viedma, On the Linguistic OWA Operator and Extensions, September 1996, University of Granada. [11] R.R. Yager, On Ordered Weighted Averaging Aggregation Operators in Multcriteria Decision Making, IEEE Transactions on Systems, Man, and Cybernetics 18 (1988) 183-190. [12] S. Chakrabarti, M. van den Berg and B. Dom, Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery, Proceedings of the 8th International WWW Conference, pp. 545-562, Toronto, Canada, May 1999. [13] J. Srivastava, R. Cooley, M. Deshpande and P. Tan, Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data, 2001, Department of Computer Science and Engineering, University of Minnesota. [14] O. Nasraoui, H. Frigui, A. Joshi and R. Krishnapuram, Mining Web Access Logs Using Relational Competitive Fuzzy Clustering, 2001. [15] www.dataengine.de, May 2004. [16] J. C. Bezdek: Pattern Recognition with Fuzzy Objective Function Algorithms, Plenum Press, 1981, New York. 11

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download 1 - wseas