Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Using Fuzzy Techniques for Different Aspects of Web Mining
CARO LUCAS¹, AMIR HOSSEIN KEYHANIPOUR², TAYYEBE SARHADI²
¹Control and Intelligent Processing Center of Excellence, Electrical and Computer Eng. Department,
University of Tehran, Tehran, Iran and School of Intelligent Systems, Institute for studies in theoretical
Physics and Mathematics, Tehran, Iran
²Control and Intelligent Processing Center of Excellence, Electrical and Computer Eng. Department,
University of Tehran, Tehran, Iran
Abstract: Fuzzy ideas and techniques are widely used in real-world applications. A challenging
area of real-world applications is Web-Mining, with respect of its size and growth. Some tries are
done to overcome these serious problems and the most promising one is improvement of
traditional web mining techniques with fuzzy ideas. In this research using some fuzzy, basics of a
fuzzy search engine are introduced. In Content Mining, Fuzzy techniques such as Linguistic
Indexing and Fuzzy Pattern Matching are used. Some basic ideas of Fuzzy are employed to
improve the performance of usual structure mining, and at last, fuzzy clustering techniques are
used in web usage mining. A comparison of the search results with the traditional web mining
techniques shows the efficiency of the proposed algorithms.
Key-words: Web Mining, Search engine, Fuzzy, Content Mining, Structure Mining, Usage
Mining, Linguistic Indexing, Pattern Matching.
1. Introduction
The evolution of the World Wide Web
has brought us enormous and ever growing
amounts of data and information. As a
statistics [1], Google has indexed about 4.3
billion textual documents until May 2004.
With the abundant data provided by the web,
it has become an important resource for
research.
However,
traditional
data
extraction and mining techniques can not be
applied directly to the web due to its semistructured or even unstructured nature.
Traditional web mining is concerned with
“the use of data mining techniques to
automatically
discover
and
extract
information from World Wide Web
documents and services” [3]. Three areas of
Web mining are commonly distinguished:
content mining, structure mining, and usage
mining (see Fig. 1) [2].
Web content mining is a form of text
mining. For web documents, the mining
methods are mainly focused on information
extraction and integration (i.e., gathering
explicit information from different web sites
for its access).
Web structure mining usually operates on
the hyperlink structure of Web pages. The
primary Web resource that is being mined is
a set of pages, ranging from a single Web
site to the Web as a whole. Web structure
mining exploits the additional information
that is (often implicitly) contained in the
structure of hypertext.
In web usage mining, the primary web
resource that is being mined is a record of
the requests made by visitors to a web site,
most often collected in a web server log.
This data proposes a template of user’s
behavior and it can be used to personalize
and improve the search results regarding the
user characteristics.
Numerous projects also are defined and
done in this field. We only introduce an
interesting one which its name is “Web
Personalization and Mining Using Robust
Fuzzy Clustering Methods” [9]. Some
objectives of this project are: Developing
new robust fuzzy relational clustering
algorithms that are suitable for large
applications, as well as suitable similarity
metrics between such feature vectors, and
developing new similarity metrics to
manipulate non-numeric features when a
numeric approach is infeasible.
In this paper we introduce the use of some
fuzzy techniques in all areas of traditional
web mining to enhance this process. At first
we will review some previous works and
then introduce our web mining approaches
in three continuing sections. Then
experimental results are discussed.
2. Related Works
There are some proposed techniques
about fuzzy data mining such as using Fuzzy
Linguistic approach for information
retrieval[4,5,6], and use of Fuzzy
Aggregation methods for information
retrieval[7,8]. Also some data mining tools
are proposed which use of these techniques;
a famous one is DataEngine produced by
“Intelligenter Technologien GmbH”. It is a
data retrieving tool which uses Decision
Trees, Fuzzy Rules, K-means, Neural
Networks (MLP, Kohonen) and Regression
(linear). Another group of existing tools is
fuzzy search engines. Most famous ones are
Excite and SearchNZ. The latter is a faulttolerant ("fuzzy") [15] search engine,
restricted to New Zealand cyberspace. Its
name, "fuzzy", means that it allows you to
search in words even if you are not sure
about the correct spelling.
3. Fuzzy Web Mining
Web mining, when looked upon in data
mining terms, can be said to have three
operations of interests - clustering (finding
natural groupings of users, pages etc.),
associations (which URLs tend to be
requested together), and sequential analysis
(the order in which URLs tend to be
accessed). As in most real-world problems,
the clusters and associations in Web mining
do not have crisp boundaries; and often
overlap considerably. In addition, bad
exemplars (outliers) and incomplete data can
easily occur in the data set, due to a wide
variety of reasons inherent to web browsing
and logging. Another important aspect is
limited query interface based on keyword-
2
oriented search. Fuzzification of these
operations is an obvious way.
In the following sections we introduce
fuzzy extensions of web content, structure
and usage mining.
They are used to construct indexes for
search engines. There are several types of
crawlers: Traditional Crawlers, Periodic
Crawlers, Incremental Crawlers and
Focused Crawlers. Here we describe a
Fuzzy Focused Crawler.
A Focused Crawler [12] retrieves
documents relevant to a predefined topic,
trying to avoid irrelevant areas of the Web.
It uses a Classifier to relate documents to
topics. Classifier also determines how useful
outgoing links are. Using fuzzy classifiers
can improve the crawling mechanism.
4. Fuzzy Web Content Mining
Fuzzy Web Content Mining is the use of
fuzzy ideas in traditional web content
mining. In traditional web content mining,
there are two phases: web page content
mining and search result mining.
The core of former phase is Resource
identification, which is the process of
retrieving the intended web documents. It is
done by web search and metasearch engines,
or by crawlers. We will focus on crawlers
and fuzzify them.
The latter phase includes clustering search
results and categorizing documents using
phrases in titles and snippets. Critical
operations are:
Preprocessing consists of two tasks:
selecting interesting data from the
downloaded web documents, and
transforming this data into a formal
representation. Most methods use
wrappers for extracting simple data
(e.g. proper names, prices, phone
numbers, e-mail addresses, etc.)
from web documents, and construct
tables as formal representations.
Fuzzy Aggregation (e.g. OWA
operators) can be used for linguistic
indexing documents.
Generalization is the automatic
discovery of patterns across multiple
web documents. Most methods use
data
mining
techniques
for
discovering
association
rules,
clusters and classification trees and
classification rules. Fuzzy clustering
and
classification
are
fuzzy
extensions of this process.
4.2. Fuzzy Pattern Matching
Aim of pattern matching in this part is
finding similarity between query and
document clusters. A fuzzy approach as
shown will improve the quality of search
results.
Fundamentally, the intimate relationship
that exists between fuzzy set theory and
pattern recognition comes from the fact that
the vast majority of real world classes are
fuzzy in nature. From all the fuzzy
techniques that have been developed, the
fuzzy c-means algorithm is probably the
most widely used.
Fuzzy c-means (FCM) is a method of
clustering which allows one piece of data to
belong to two or more clusters [16]. This
method is frequently used in pattern
recognition. It is based on minimization of
the
following
objective
function:
N
C
J m uijm xi c j , 1 m (1)
2
i 1 j 1
where m is any real number greater than 1,
uij is the degree of membership of xi in the
cluster j, xi is the ith member of ddimensional measured data, cj is the ddimension center of the cluster, and ||*|| is
any norm expressing the similarity between
any measured data and the center.
Fuzzy partitioning is carried out through an
iterative optimization of the objective
function shown above, with the update of
membership uij and the cluster centers c j by:
4.1. Fuzzy Crawlers
Crawlers are Robots (spiders) that
traverse the hypertext structure in the Web,
and collect information from visited pages.
3
N
1
uij
x c
j
i
k 1 xi c k
2
2
C
This
iteration
when , max ij uij
( k 1)
will
uij
k
2
m 1
stop
, , where
is a termination criterion between 0 and 1,
whereas k is the iteration steps. This
procedure converges to a local minimum or
a saddle point of J m .
,
cj
u
i 1
N
m
ij
u
i 1
. xi
(2)
m
ij
C 2 wi , bi , i 1, 2 w1 s j 1 w1 si sk ,
s j , si S , ( j i )
such
that,
k min T , i round (w1.( j i))}
, where “round” is the usual round operation,
and b1 s j , b2 si .
If w j 1 and wi 0 with i j i , then
4.3. Fuzzy Indexing
Search results are documents containing a
complex of search keywords. To identify the
importance of these results, they must be
indexed according to a policy. Traditional
approach is indexing the results with crisp
methods. A fuzzy approach is LOWA
(Linguistic
Indexing
using
OWA
aggregation). LOWA operator [10] is based
on the OWA operator defined by Yager
[11].
the convex combination is defined as:
C m wi , bi , i 1,..., m b j .
Definition1. Let a1 ,..., am be a set
of labels to be aggregated, then the LOWA
operator, is defined as:
First step is assignment of linguistic
quantifiers to keywords, which is described
below. Then using LOWA operator,
indexing retrieved documents is done.
Examples of using LOWA are described in
[11].
A possible solution for calculation of
weighting vector of LOWA operator is
proposed by Yager which in the case of nondecreasing proportional fuzzy linguistic
quantifier, Q, is given by this expression:
(3)
where W= w1,..., wm , is a weighting
being the membership function of Q, as
follows:
a1 ,..., am W .BT C m wk ,b k , k 1,.., m wi Q( i ) Q((i 1) ), i 1,..., n.
n
n
w1 b1 1 w1 C m 1Bh , bh , h 2,.., m
(5)
vector, such that, (i) wi [0,1] and, (ii)
wi 1, Bh
i
wh
m
wk
, h 2,.., m
2
, and B b1 ,..., bm is a vector
associated to A, such that,
B ( A) a (1) ,..., a ( n ) (4)
in which, a ( j ) a ( i ) i j , with
being a permutation over the set labels A.
C m is the convex combination operator of m
labels and if m 2 , then it is defined as
0
r a
Q(r )
b a
1
if r a
if a r b
if r b
(6)
5. Fuzzy Web Structure Mining
Traditional Web Structure mining usually
operates on the hyperlink structure of Web
pages to exploits the additional information
that is (often implicitly) contained in the
structure of hypertext. Therefore, an
important application area of it is the
identification of the relative relevance of
different pages that appear equally pertinent
when analyzed with respect to their content
in isolation. Another application area of web
structure mining is the identification of
relative importance of web pages, which is
used in prioritization of pages returned from
search. Two of main techniques used in this
area are PageRanking and CLEVER
method. In the subsequent sections first
fuzzy extensions of these techniques are
introduced and then we will present a fuzzy
version of HITS algorithm using these.
5.1 Fuzzy PageRanking
PageRanking is used by Google to
prioritize pages returned from search by
looking at Web structure. Importance of
page is calculated based on number of pages
which point to it – Backlinks. Weighting is
used to provide more importance to
backlinks coming from important pages.
Traditional
PageRanking
is:
N
N
PR p 1 p
... np
N
N
1
n
(7), where :
N ip : number of links coming out of page i toward page j, and
N i : number of links coming out of page i
A suggestion for Fuzzy PageRanking is:
N
N
PR p Rank Sup 1 p ,..., np
N1
Nn
(8), where :
N ij : number of links coming out of page i toward page j,
N i : number of links coming out of page i,
Sup : is a fuzzy union operator, and
Very Important , x 0.9
Important ,0.7 x 0.9
Rank Related ,0.5 x 0.7
Good ,0.3 x 0.5
Irrelevant , x 0.3
pages which are the best source for
requested information.
According to above criteria, CLEVER
method ranks pages primarily by measuring
links between them. CLEVER works as
follow:
5.2 Fuzzy CLEVER Method
Web Structure Mining analyzes hyperlink
topology by discovering authoritative
information sources. This information is
found in authority pages, which are defined
in relation to hubs as their counterparts:
Hubs are pages that link to many related
authorities. Authorities are highly important
1. Starts by collecting a set of pages
2. Gathers all pages of initial link, plus any pages linking to
5
them
Ranks result by counting links
Links have noise, not clear which pages are best
Recalculate scores
Pages with most links are established as most important,
links transmit more weigh
7. Repeat calculation number of times till scores are refined
3.
4.
5.
6.
Using fuzzy PageRanking technique can
be considered as a fuzzy extension of
CLEVER method.
that is pointed to by many good hubs, while
a good hub is a page that points to many
good authorities.
A fuzzy extension of this algorithm is as
follows. Note that it uses fuzzy PageRanking
to identify hub and authority pages.
5.3 Fuzzy HITS Algorithm
HITS (Hyperlink Induced Topic Search)
algorithm introduced by Kleinberg at 1998
is a topic-focused search method. This
method, views the Web as a graph and uses
a purely link structure-based computation,
ignoring the textual content. The main idea
of HITS is this: a good authority is a page
Input:
W // WWW viewed as directed graph
q // Query
s // Support
Output:
A // Set of authority pages
H // Set of hub pages
HITS Algorithm:
R SE W , q ; // Collect highest ranked pages for the query q from a text - based search engine
B R pages which are linked to from R pages which link to pages in R;
G B, L Subgraph of W induced by B;
G B,L1 Delete linkes in G within th e same site;
x p Supy q for all q, where q,p L1 ; // Find authority weights.
y p Supx p for all q, where q,p L1 ; // Find hub weights.
A p | p has one of the highest x p ;
H p | p has one of the highest y p ;
needs of Web-based applications. Web
usage mining consists of three phases,
namely preprocessing, pattern discovery,
and pattern analysis [13]. The usage data are
collected at the different sources -e.g.
Client/Server logs-, and they will represent
the access patterns of users.
6 Fuzzy Web Usage Mining
Traditional Web usage mining is the
application of data mining techniques to
discover usage patterns from Web data, in
order to understand and better serve the
6
In Fuzzy extension of this technique, we
use fuzzy clustering to cluster Web-Access
logs and then use this clusters for discovery
and matching usage pattern.
Categories in most data mining tasks are
rarely well separated, and hence, the class
partition is best described by fuzzy
memberships. A well known technique is
fuzzy Competitive Agglomeration (CA)
algorithm which can automatically cluster
data into the optimal number of components.
However, CA deals with object or feature
data only, whereas session similarity data is
relational.
Moreover,
the
session
dissimilarity measure we define is not
Euclidean. Therefore, we use extended CA
so that it can work on non-Euclidean
relational data, which is Competitive
Agglomeration for Relational Data (CARD)
algorithm can deal with complex and
subjective distance/similarity measures
which are not restricted to be Euclidean.
CARD is described in more detail in [14].
We will use CARD for automatic
discovery of user session profiles in Web log
data. CARD uses the notion “user session”
as being a temporally compact sequence of
web accesses by a user. CARD analyzes
session profiles to captures both the
individual URLs in a profile as well as the
structure of the site. The Competitive
Agglomeration for Relational Data (CARD)
algorithm can deal with complex and
subjective dissimilarity/similarity measures
which are not restricted to be Euclidean. The
resulting clusters are evaluated subjectively,
as well as based on standard statistical
criteria.
The modeled search engine gets two
keywords based of which it searches the data
set and returns 20 of the most related pages
to the keywords, in order of their priority ,
determined by a page rank associated to
each page as described below.
In the first method the engine specifies
relevancies of the pages to the keywords,
based on the percentage of the first keyword
occurrences to the total words that the page
contains. The more the percentage of the
first keyword is, the more related the page is
considered. In case of equality of the
percentages of the first keyword in two
pages, the percentage of the second one
determines the relevance in the same way. It
implies that this method gives a priority to
the first keyword when searching; however,
in the second way related pages are retrieved
regarding the multiplication of the first word
repetition ratio to the total number of words
in the page with the second one’s, so it
considers an equivalent weight for both
keywords contained in the pages.
On the other hand, the criterion by which
the fuzzy method evaluates the pages is the
minimum of percentages of the keywords in
a single page to compare with that of
another. As the logic of the problem implies,
the best answers result from the pages which
are the most related to both the first word
and the second word. So, not only the fuzzy
method give an equivalent value to both
keywords but it also tends to extract the
pages which contain both of the words to
some extents, and obviously subjects to
more logical results.
Specifying a rank to each page in
traditional method obeys (7). Consequently,
the page ranks in fuzzy method is evaluated
with (8).
The results of these methods are
summarized in below tables. Therefore, the
fuzzy method eliminates the effects of
unimportant pages by substitution of 'MAX'
operator instead of '+' operator to specify
page
ranks.
7 Experimental Results
To consider the results of fuzzifying web
mining methods, we applied two traditional
sample searching methods, as well as the
fuzzified solution of them, to a sample data
set, representing the web in a small scale.
The sample data set has about 500 html
pages, each of which contains several hyper
links to the other pages of the same data set.
7
Keyword1 Repetition
Percentage
2.0833335
1.980198
8.8
1.5837106
1.8987341
6.25
1.5748031
4.4871798
1.4184396
2.0833335
2.0833335
2.0833335
1.923077
8.227848
1.5384616
1.4084507
3.508772
2.857143
2.4193547
1.5837106
Keyword2 Repetition
Percentage
10.416666
6.930693
2.4
0.45248872
0.0
0.0
3.1496062
0.64102566
0.0
10.416666
10.416666
10.416666
0.0
4.43038
1.5384616
0.0
3.508772
0.0
1.6129031
0.45248872
Page Rank
21.584131
5.3121285
4.2817764
1.8476298
1.4291381
1.1428572
1.0430099
0.6896016
0.3845039
0.38095245
0.38095245
0.38095245
0.34237936
0.31307533
0.31151435
0.26178026
0.25396827
0.25
0.02554945
0.017857144
Table1: Results of Traditional Method 1
Keyword1 Repetition
Percentage
2.0833335
1.980198
8.8
1.2658228
1.5748031
0.75757575
1.2820513
1.0714287
2.0833335
2.0833335
2.0833335
8.227848
0.952381
3.508772
1.0714287
1.0714287
1.0714287
1.0714287
1.0714287
Keyword2 Repetition
Percentage
10.416666
6.930693
2.4
3.164557
3.1496062
7.575758
3.846154
2.857143
10.416666
10.416666
10.416666
4.43038
8.571429
3.508772
2.857143
2.857143
2.857143
2.857143
2.857143
8
Page Rank
21.584131
5.3121285
4.2817764
2.440153
1.0430099
1.0137123
0.9156288
0.5769231
0.38095245
0.38095245
0.38095245
0.31307533
0.2809524
0.25396827
0.24358976
0.17777778
0.16666667
0.14285715
0.083333336
2.4193547
1.6129031
0.02554945
Table2: Results of Traditional Method 2
Keyword1 Repetition
Percentage
2.0833335
8.8
1.980198
1.0714287
1.1363636
1.1811024
1.2345679
1.2658228
1.2820513
1.3157895
2.0833335
2.0833335
2.0833335
8.227848
1.5748031
1.1363636
1.0752689
3.508772
1.5384616
2.4193547
Keyword2 Repetition
Percentage
10.416666
2.4
6.930693
2.857143
2.2727273
1.1811024
1.2345679
3.164557
3.846154
1.3157895
10.416666
10.416666
10.416666
4.43038
3.1496062
2.2727273
1.0752689
3.508772
1.5384616
1.6129031
Page Rank
0.5
0.33333334
0.33333334
0.25
0.25
0.25
0.2
0.2
0.2
0.16666667
0.16666667
0.16666667
0.16666667
0.14285715
0.14285715
0.14285715
0.125
0.11111111
0.11111111
0.017857144
Table3: Results of Fuzzy Method
Fig.2 shows the map of used dataset. It
can be seen, there are some pages to which a
lot of links exit fro the other pages, so they
have high page ranks and that’s why the first
few pages in 3 results are the same.
used to categorize crawled pages. In
structure mining, we fuzzified page ranking
and used it in CLEVER method. These
algorithms were employed to enhance the
performance of HITS algorithm. In last partusage mining- fuzzy pattern matching was
used to search and organize web server logs.
Experimental results showed that mining
web according to these techniques can
produce better and more precise search
results.
Using other indexing, clustering and
pattern matching techniques in comparison
of proposed approach, can be regarded as
future works.
8 Conclusion and Future
Works
Web mining can be regarded as a three
part activity. In this paper we used some
fuzzy techniques in these parts to improve
them. In content mining, main activity is
crawling web to find pages related to some
specified categories of data. So we used
fuzzy techniques in crawling and pattern
matching, and at last, fuzzy indexing was
9
Fig.2. Map of Dataset
References
[1] http://searchenginewatch.com, May
2004.
[2] B. Berendt, A. Hotho, and G. Stumme,
Towards Semantic Web Mining, Institute of
Information Systems, Humboldt University
Berlin, Germany, 2003.
[3] O. Etziöni. The World Wide Web:
Quagmire or Gold Mine? Communications
of the ACM, Vol.39, No.11, pp. 65-68. Nov.
1996.
[4] E. Herrera-Viedma, Modeling the
Retrieval Process of an Information
Retrieval System Using an Ordinal Fuzzy
Linguistic Approach, University of Granada,
November 1997.
[5] Aggregation Operators for Linguistic
Weighted Information, F. Herrera, E.
Herrera-Viedma, University of Granada,
October 1995.
[6] E. Herrera-Viedma and E. Peis,
Evaluating the Informative Quality of
Documents in SGML Format Using Fuzzy
Linguistic Techniques Basedon Computing
with Words, Department of Computer
Science and A. I.,Library Science Studies
School, University of Granada, 2001.
[7] G. Kazai, M. Lalmas and T. Rölleke, A
Model for the Representation and Focussed
Retrieval of Structured Documents based on
Fuzzy Aggregation, Department of
Computer Science Queen Mary, University
of London, 2003.
[8] H.L. Larsen, Importance weighted OWA
aggregation of multicriteria queries,
Proceedings of the North American Fuzzy
Information Processing Society conference,
New York, 10–12 June 1999 (NAFIPS'99).
Pp. 740-744.
[9]
http://www.cs.umbc.edu/~ajoshi/webmine, May 2004.
[10] F. Herrera, E. Herrera-Viedma, On the
Linguistic OWA Operator and Extensions,
September 1996, University of Granada.
[11] R.R. Yager, On Ordered Weighted
Averaging Aggregation Operators in
Multcriteria Decision Making, IEEE
Transactions on Systems, Man, and
Cybernetics 18 (1988) 183-190.
[12] S. Chakrabarti, M. van den Berg and B.
Dom, Focused Crawling: A New Approach
to Topic-Specific Web Resource Discovery,
Proceedings of the 8th International WWW
Conference, pp. 545-562, Toronto, Canada,
May 1999.
[13] J. Srivastava, R. Cooley, M. Deshpande
and P. Tan, Web Usage Mining: Discovery
and Applications of Usage Patterns from
Web Data, 2001, Department of Computer
Science and Engineering, University of
Minnesota.
[14] O. Nasraoui, H. Frigui, A. Joshi and R.
Krishnapuram, Mining Web Access Logs
Using Relational Competitive Fuzzy
Clustering, 2001.
[15] www.dataengine.de, May 2004.
[16] J. C. Bezdek: Pattern Recognition with
Fuzzy Objective Function Algorithms,
Plenum Press, 1981, New York.
11