Download a scalable web usage mining framework

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Transcript
Proceedings of the International Conference , “Computational Systems and Communication
Technology” Jan.,9,2009 - by Lord Venkateshwaraa Engineering College, Kanchipuram Dt.PIN-631
605,INDIA
A SCALABLE WEB USAGE MINING FRAMEWORK
FOR EVOLVING PATTERNS IN DYNAMIC WEBSITES
L. Paul Jasmine Rani 1, T. Kalai Chelvi 2
1:M.E II year (cse), 2: Assistant professor, Department of Computer science and Engineering.
S.A Engineering College, Poonamallee, Chennai-77.
Email: pauljasminerani@yahoo.co.in
ABSTRACT
This paper presents is a complete
framework and findings in mining Web usage
patterns from Web log files of a real Web site
that has all the challenging aspects of real-life
Web usage mining, including evolving user
profiles and external data describing an
ontology of the Web content. Even though the
Web site under study is part of a nonprofit
organization that does not “sell” any
products, it was crucial to understand “who”
the users were, “what” they looked at, and
“how their interests changed with time,” all of
which are important questions in Customer
Relationship Management (CRM). Hence, I
present an approach for discovering and
tracking evolving user profiles. I can also
describe how the discovered user profiles can
be enriched with explicit information need that
is inferred from search queries extracted from
Web log data. Profiles are also enriched with
other domain-specific information facets that
give a panoramic view of the discovered mass
usage modes. An objective validation strategy
is also used to assess the quality of the mined
profiles, in particular their adaptability in the
face of evolving user behavior.
Index Terms — Mining evolving clickstreams,
user profiles, Web usage mining, user access
patterns.
1 INTRODUCTION
Customer Relationship Management (CRM)
can use data from within and outside an
organization to allow an understanding of its
customers on an individual basis or on a group
basis such as by forming customer profiles. An
improved understanding of the customer’s
habits, needs, and interests can allow the
business to profit by, for instance, “cross
selling” or selling items related to the ones that
the customer wants to purchase. Hence,
reliable knowledge about the customers’
preferences and needs forms the basis for
effective CRM. As businesses move online,
the competition between businesses to keep
the loyalty of their old customers and to attract
new customers is even more important, since a
competitor’s Web site may be only one click
away. The fast and large amounts of data
available in these online settings have recently
made it necessary to use automated data
mining or knowledge discovery techniques to
discover Web user profiles. These different
modes of usage or the so-called mass user
profiles can be discovered using Web usage
mining techniques that can automatically
extract frequent access patterns from the
history of previous user click streams stored in
Web log files. These profiles can later be
harnessed toward personalizing the Web site
to the user or to support targeted marketing.
Although there have been considerable
advances in Web usage mining, there have
been no detailed studies presenting a fully
integrated approach to mine a real Web site
with the challenging characteristics of today’s
Web sites, such as evolving profiles, dynamic
content, and the availability of taxonomy or
databases in addition to Web logs. This paper,
presents a complete framework and a
summary of mining Web usage patterns with
real world challenges such as evolving access
patterns, dynamic pages, and external data
describing an ontology of the Web content and
how it relates to the business actors (in the
case of the studied Web site, the companies,
contractors, consultants, etc., in corrosion).
The Web site in this study is a portal that
provides access to news, events, resources,
company information (such as companies or
contractors supplying related products and
Copy Right @CSE/IT/ECE/MCA-LVEC-2009
Proceedings of the International Conference , “Computational Systems and Communication
Technology” Jan.,9,2009 - by Lord Venkateshwaraa Engineering College, Kanchipuram Dt.PIN-631
605,INDIA
services), and a library of technical and
regulatory documentation related to corrosion
and surface treatment. The portal also offers a
virtual meeting place between companies or
organizations seeking information about other
companies or organizations. The Web site in
my study is managed by a nonprofit
organization that does not sell anything but
only provides free information that is ideally
complete, accurate, and up to date. Hence, it
was crucial to understand the different modes
of usage and to know what kind of information
the visitors seek and read on the Web site and
how this information evolves with time. For
this reason, we perform clustering of the user
sessions extracted from the Web logs to
partition the users into several homogeneous
groups with similar activities and then extract
user profiles from each cluster as a set of
relevant URLs. This procedure is repeated in
subsequent new periods of Web logging (such
as biweekly), then the previously discovered
user profiles are tracked, and their evolution
pattern is categorized. When clustering the
user sessions, the Web site hierarchy to give
partial weights in the session similarity
between URLs that are distinct and yet located
closer together on this hierarchy. The Web site
hierarchy is inferred both from the URL
address and from a Web site database that
organizes most of the dynamic URLs along an
“is-a” ontology of items. We also enrich the
cluster profiles with various facets, including
search queries submitted just before landing
on the Web site, and inquiring and inquired
companies, in case users from (inquiring)
companies inquire about any of the (inquired)
companies listed on the Web site, which
provide related services.
2
WEB
USAGE
ARCHITECTURE
association rule and sequential patterns) as
part of the systems data mining engine.
Data cleaning is the first step performed in the
Web usage mining process. Some low level
data integration tasks may also be performed
at this stage, such as combining multiple logs,
incorporating referrer logs, etc. After the data
cleaning, the log entries must be partitioned
into logical clusters using one or a series of
transaction identification modules. The goal of
trans action identification is to create
meaningful clusters of references for each
user. The task of identifying transactions is
one of either dividing a large transaction into
multiple smaller ones or merging small
transactions into fewer larger ones. The input
and output transaction formats match so that
any number of modules to be combined in any
order, as the data analyst sees fit. Once the
domain -dependent data transformation phase
is completed, the resulting transaction data
must be formatted to conform to the data
model of the appropriate data-mining task.
For instance, the format of the data for the
association rule discovery task may be
different than the format necessary for mining
sequential patterns. Finally, a query
mechanism will allow the user to provide more
control over the discovery process by
specifying various constraints
MINING
The architecture divides the Web usage
mining process into two main parts. The first
part includes the domain dependent processes
of transforming the Web log data into suitable
transaction form. This includes preprocessing,
transaction identification, and data integration
components. The second part includes the
largely domain independent application of
generic data mining and pattern matching
techniques (such as the discovery of
3 AN OVERVIEW OF WEB USAGE
MINING
Recently, data mining techniques have been
applied to extract usage patterns from Web log
data . This process, known as Web usage
mining, is traditionally performed in several
stages to achieve its goals:
Copy Right @CSE/IT/ECE/MCA-LVEC-2009
Proceedings of the International Conference , “Computational Systems and Communication
Technology” Jan.,9,2009 - by Lord Venkateshwaraa Engineering College, Kanchipuram Dt.PIN-631
605,INDIA
1. collection of Web data such as
activities/clickstreams recorded in Web server
logs,
2. preprocessing of Web data such as filtering
crawlers requests, requests to graphics, and
identifying unique sessions,
3. analysis of Web data, also known as Web
Usage Mining, to discover interesting usage
patterns or profiles, and
4. interpretation/evaluation of the discovered
profiles.
5. tracking the evolution of the discovered
profiles.
3.1 Handling Profile Evolution
Most previous research efforts in Web usage
mining have worked with the assumption that
the Web usage data is static. However, the
dynamic aspects of Web usage have recently
become important. This is because Web access
patterns on a Web site are dynamic due not
only to the dynamics of Web site content and
structure but also to changes in the user’s
interests and, thus, their navigation patterns.
Thus, it is desirable to study and discover Web
usage patterns at a higher level, where such
dynamic tendencies and temporal events can
be distinguished According to Maloof and
Michalski , learning evolving concepts adds
another layer of difficulty to the process of
online learning, since concepts can no
longer be assumed to be constant. In a user
profiling system was developed based on
monitoring the user’s Web browsing and
e-mail habits. This system used a clustering
algorithm to group user interests into several
interest themes, and the user profiles had to
adapt to changing interests of the users over
time.
Maloof and Michalski further classified the
way online learning systems work into three
different modes: no memory, partial memory,
or full memory. In the no-memory mode, the
system does not use any past training. Where
as in the partial-memory mode, a subset of the
previously seen training examples is used for
later learning. Finally, in the full-memory
mode, all past training examples are used in
updating an existing model. It is important to
note that apart from (which was limited to a
small number of attributes and users), all of
the above approaches were proposed within a
supervised learning framework (classification)
or focused on adaptation to a single user
(predicting whether an object is relevant or
not). On the other hand, the work that we
present in this paper is based on an
unsupervised learning framework that tries to
learn mass anonymous user profiles on the
server side. Nonetheless, according to Maloof
and Michalski’s categorization of concept drift
systems, our proposed system can be
categorized as a no-memory revolutionary user
profile mining approach. However, the user
profile tracking and validation approach works
in the full-memory mode. Furthermore, in this
paper, we are more interested in quantifying
and categorizing or annotating the various
types of evolution (not only detecting
evolution and adapting to it), and this, in turn,
can form a higher level of knowledge, in
addition to the description of the profiles
themselves as user models. We adopt an
approach based on periodical batch mining
that has the advantage of being easy to adapt
to use any other unsupervised learning tool
that automatically discovers clusters in static
or dynamic data. In this work, we use the full
memory (periodical or\ window based), in
part, because our goal was to describe the user
profiles in certain periodical increments (about
two weeks each). Hence, it was essential to
fully mine the Web logs from each period and
then compare the subsequent results.
4 PROFILE DISCOVERY BASED ON WEB
USAGE MINING
The framework for our Web usage mining
and a road map to the rest of this paper is
summarized in Fig. 1, which starts with the
integration and preprocessing of Web server
logs and server content databases, includes
data cleaning and sessionization, and then
continues with the data mining/ pattern
discovery via clustering. This is followed by a
post processing of the clustering results to
obtain Web user profiles and finally ends with
Copy Right @CSE/IT/ECE/MCA-LVEC-2009
Proceedings of the International Conference , “Computational Systems and Communication
Technology” Jan.,9,2009 - by Lord Venkateshwaraa Engineering College, Kanchipuram Dt.PIN-631
605,INDIA
tracking profile evolution. The automatic
identification of user profiles is a knowledge
discovery task consisting of periodically
mining new contents of the user access log
files and is summarized in the following steps:
1. Preprocess Web log file to extract user
sessions. 2. Cluster the user sessions by using
Hierarchical Unsupervised Niche Clustering
(H-UNC)
3.
Summarize
session
clusters/categories into user profiles. 4. Enrich
the user profiles with additional facets by
using additional Web log data and external
domain knowledge. 5. Track current profiles
against existing profiles.
any domain specific optimization criterion and
any similarity measure, in particular a
subjective
measure that exploits domain
knowledge or ontologies, as given in.
However, unlike purely evolutionary searchbased algorithms, NU combines evolution with
local Piccard updates to estimate the scale i
of each profile, thus converging fast( about 20
generations). H-UNC is outlined as follows
4.1 Preprocessing the Web Log File to
Extract User Sessions
The access log of a Web server is a record
of all files (URLs) accessed by users on a Web
site. Each log entry consists of the access time,
IP address, URL viewed, REFERRER (the
Web page visited just prior to the current one),
etc. The first step in preprocessing consists of
mapping the NU URLs on a Web site to
distinct indices. A user session consists of
requests from the same IP address within a
predefined time period. Each URL in the site
is assigned a unique number
j  1,...,NU, where NU is the total number of
valid URLs. The ith user session is then
encoded as an NU-dimensional binary attribute
vector S(i) with the following property:
4.2 Clustering Sessions into an Optimal
Number of Categories
To cluster user sessions, we use H-UNC, a
divisive hierarchical version of a robust
clustering approach (Unsupervised Niche
Clustering (UNC)) that uses a Genetic
Algorithm (GA) to evolve a population of
candidate solutions through generations of
competition and reproduction. The main
outline of the H-UNC algorithm is sketched in
the following. The reason that I use H-UNC
instead of other clustering algorithms is that
unlike most other algorithms, H-UNC can
handle noise in the data and automatically
determines the number of clusters. In addition,
evolutionary optimization allows the use of
Copy Right @CSE/IT/ECE/MCA-LVEC-2009
Proceedings of the International Conference , “Computational Systems and Communication
Technology” Jan.,9,2009 - by Lord Venkateshwaraa Engineering College, Kanchipuram Dt.PIN-631
605,INDIA
4.3 Similarity Measure Used in Clustering
The similarity score between an input
session s and the ith profile pi can be
computed using the cosine similarity as
follows (where Nu is the total number of
URLs):
If a hierarchical Web site structure is to be
taken into account, then a modification of the
cosine similarity, which we introduced in, and
can take the Web site structure into account,
can be used to yield the following similarity
measure:
where Su(i,j) is a URL to the URL
similarity function that is computed based
on the amount of overlap between the
paths Pi and Pj leading from the root of the
Web site (the main page) to any two URLs
i and j. This is given by
4.4 Postprocessing and Enrichment of
Session Clusters into Multifaceted User
Profiles
In addition to the viewed Web pages, the
profile properties include the following facets
1. Search queries. These are queries
submitted to search engines before
visiting the Web site for sessions that
belong to this profile.
2. Inquiring
companies.
These
are
companies/organizations of registered
users or unregistered users whose IP
addresses can be mapped.
3. Inquired companies. These are companies/
organizations that have been inquired about
during the sessions belonging to this profile
sessions belonging to this profile.
5 TRACKING EVOLVING USER
PROFILES
Tracking different profile events across
different time periods can generate a better
understanding of the evolution of user access
patterns and seasonality. Note that both
profiles and clickstreams are typically
evolving, since the profiles are nothing more
than summaries of the clickstreams, which are
themselves evolving. Each profile pi is
discovered along with an automatically
determined measure of scale i that represents
the amount of variance or dispersion of the
user sessions in a given cluster around the
cluster representative. This measure is used to
determine the boundary around each cluster
(an area located at a distance i from the
profile pi) and thus allows us to automatically
determine whether two profiles are
compatible. Two profiles are compatible if
their boundaries overlap. The notion of
compatibility between profiles is essential for
tracking evolving profiles. After mining the
Web log of a given period, we perform an
automated comparison between all the profiles
discovered in the current batch and the profiles
discovered in the previous batch by a sequence
of SQL queries on the profiles that have been
stored in a database, as shown in the
“TrackProfiles” Algorithm. A typical query
for retrieving corresponding profiles between
Periods T1 and T1+1 is “SELECT ThisProfile,
TothisProfile\ FROM ProfileTrail WHERE
Period =T1.”
We define a profile evolution event as a
coarse categorization of possible real
evolution scenarios that relate how profiles
that are discovered during a certain period
relate to profiles discovered in another
Copy Right @CSE/IT/ECE/MCA-LVEC-2009
Proceedings of the International Conference , “Computational Systems and Communication
Technology” Jan.,9,2009 - by Lord Venkateshwaraa Engineering College, Kanchipuram Dt.PIN-631
605,INDIA
period. The above comparison process
determines which new profiles are
compatible with the old profiles and which
new profiles are incompatible with any
previous profile. These last two cases,
respectively, give rise to two kinds of
events: Persistence and Birth. A third
event Death arises in case an old profile
does not find a compatible profile from the
new batch. It is also possible to track
profile reemergence in the long term. This
is the case of an old profile that disappears
and then reappears when it is found to be
compatible with a new profile in the
current batch. His event is labeled as
Atavism.
fig2.Visualization of the profile evolution
6. CONCLUSION
This paper presents a framework for mining,
tracking, and validating evolving multifaceted
user profiles on Web sites that have all the
challenging aspects of real-life Web usage
mining, including evolving user profiles and
access patterns, dynamic Web pages, and
external data describing an ontology of the
Web content. A multifaceted user profile
summarizes a group of users with similar
access activities and consists of their viewed
pages, search engine queries and inquiring and
inquired companies. Here web clickstreams
are considered as an evolving data stream, or
by mapping some new sessions to persistent
profiles and updating these profiles, hence
eliminating most sessions from further
analysis and focusing the mining on truly new
sessions
REFERENCES
[1] R. Cooley, B. Mobasher, and J. Srivastava, “Web
Mining: Information and Pattern Discovery on the World
Wide Web,” Proc. Ninth IEEE Int’l Conf. Tools with AI
(ICTAI ’97), pp. 558-567, 1997.
[2] O. Nasraoui, R. Krishnapuram, and A. Joshi,
“Mining Web Access Logs Using a Relational Clustering
Algorithm Based on a Robust Estimator,” Proc. Eighth
Int’l World Wide Web Conf. (WWW ’99), pp. 40-41,
1999.
[3] O. Nasraoui, R. Krishnapuram, H. Frigui, and A.
Joshi, “Extracting Web User Profiles Using Relational
Competitive Fuzzy Clustering,” Int’l J. Artificial
Intelligence Tools, vol. 9, no. 4, pp. 509-526, 2000.
[4] J. Srivastava, R. Cooley, M. Deshpande, and P.-N.
Tan, “Web Usage Mining: Discovery and Applications of
Usage Patterns from Web Data,” SIGKDD Explorations,
vol. 1, no. 2, pp. 1-12, Jan. 2000.
[5] M. Spiliopoulou and L.C. Faulstich, “WUM: A Web
Utilization Miner,” Proc. First Int’l Workshop Web and
Databases (WebDB ’98), 1998.
[6] T. Yan, M. Jacobsen, H. Garcia-Molina, and U.
Dayal, “From User Access Patterns to Dynamic
Hypertext Linking,” Proc. Fifth Int’l World Wide Web
Conf. (WWW ’96), 1996.
[7] J. Borges and M. Levene, “Data Mining of User
Navigation Patterns,” Web Usage Analysis and User
Profiling, LNCS, H.A. Abbass, R.A. Sarker, and C.S.
Newton, eds. pp. 92-111, Springer-Verlag, 1999.
[8] O. Nasraoui and R. Krishnapuram, “A New
Evolutionary Approach to Web Usage and Context
Sensitive Associations Mining,” Int’l J. Computational
Intelligence and Applications, special issue on Internet
intelligent systems, vol. 2, no. 3, pp. 339-348, Sept. 2002.
[9] O. Nasraoui, C. Cardona, C. Rojas, and F. Gonzalez,
“Mining Evolving User Profiles in Noisy Web
Clickstream Data with a Scalable Immune System
Clustering Algorithm,” Proc. Workshop Web Mining as
a Premise to Effective and Intelligent Web Applications
(WebKDD ’03), pp. 71-81, Aug. 2003.
[10] P. Desikan and J. Srivastava, “Mining Temporally
Evolving Graphs,” Proc. Workshop Web Mining and
Web Usage Analysis (WebKDD’ 04), 2004.
[11] M.A. Maloof and R.S. Michalski, “Learning
Evolving Concepts Using Partial Memory Approach,”
Working Notes AAAI Fall Symp. Active Learning 1995,
pp. 70-73, 1995.
[12] M.A. Maloof and R.S. Michalski, “Selecting
Examples for Partial Memory Learning,” Machine
Learning, vol. 41, no. 11, pp. 27-52, 2000.
[13] I. Grabtree and S. Soltysiak, “Identifying and
Tracking Changing
Interests,” Int’l J. Digital Libraries, vol. 2, pp. 38-53,
Copy Right @CSE/IT/ECE/MCA-LVEC-2009