Download Data Mining - Emory Math/CS Department

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Data Mining:
Concepts and Techniques
Web Mining
Li Xiong
Slides credits: Jiawei Han and Micheline Kamber;
Anand Rajaraman, Jeffrey D. Ullman
Olfa Nasraoui
Bing Liu
4/9/2008
1
Web Mining
„
Web mining vs. data mining
„ Structure (or lack of it)
„
„
Scale
„
„
Linkage structure and lack of structure in textual
information
Data generated per day is comparable to largest
conventional data warehouses
Speed
„
Often need to react to evolving usage patterns in
real-time (e.g., merchandising)
Web Mining
„
Structure Mining
„ Extracting info from topology of the Web (links among
pages)
„
Content Mining
„ Extracting info from page content (text, images, audio
or video, etc)
Natural language processing and information retrieval
Usage Mining
„ Extracting info from user’s usage data on the web
(how user visits the pages or makes transactions)
„
„
4/9/2008
Li Xiong
3
Web Mining
4/9/2008
4
Web Mining
„
„
„
Web structure mining
„ Web graph structure and link analysis
Web text mining
„ Text representation and IR models
Web usage mining
„ Collaborative filtering
4/9/2008
Li Xiong
5
Structure of Web Graph
„
„
„
Web as a directed graph
„ Pages = nodes, hyperlinks = edges
Problem: Understand the macroscopic structure
and evolution of the web graph
Practical implications
„ Crawling, browsing, computation of link
analysis algorithms
Power-law degree distribution
Source: Broder et al, 00
Bow-tie Structure (Broder et al. 00)
The Daisy Structure (Donato et al. 05)
4/9/2008
9
Link Analysis
„
„
„
Problem: exploit the link structure of a graph to order or
prioritize the set of objects within the graph
Application of social network analysis at actor level:
centrality and prestige
Algorithms
„
„
April 9, 2008
PageRank
HITS
Li Xiong
10
PageRank (Brin & Page’98)
„
Intuition
„ Web pages are not equally “important”
„
„
Links as citations: a page cited often is more important
„
„
www.stanford.edu has 23,400 inlinks
www.joe-schmoe.com has 1 inlink
Recursive model: links from heavily linked pages
weighted more
PageRank is essentially the eigenvector prestige in social
network
„
„
www.joe-schmoe.com v www.stanford.edu
Simple Recursive Flow Model
„
„
„
Each link’s vote is proportional to the importance of its
source page
If page P with importance x has n outlinks, each link gets
x/n votes
Page P’s own importance is the sum of the votes on its
inlinks
y = y /2 + a /2
y/2
Yahoo
y
a = y /2 + m
m = a /2
a/2
y/2
Solving the equation with
constraint: y+a+m = 1
m
y = 2/5, a = 2/5, m = 1/5
Amazon
M’soft
a/2
m
a
Matrix formulation
„
„
„
„
Web link matrix M: one row and one column per web page
⎧1
if (i, j ) ∈ E
⎪
M ij = ⎨ O j
⎪⎩ 0 otherwise
Rank vector r: one entry per web page
Flow equation: r = Mr
r is an eigenvector of the M
j
i
i
=
j
M
r
r
Matrix formulation Example
y a
y 1/2 1/2
a 1/2 0
m 0 1/2
Yahoo
m
0
1
0
r = Mr
Amazon
M’soft
y = y /2 + a /2
a = y /2 + m
m = a /2
y
1/2 1/2 0 y
a = 1/2 0 1 a
m
0 1/2 0 m
Power Iteration method
Solving equation: r = Mr
„
„
„
„
Suppose there are N web pages
Initialize: r0 = [1/N,….,1/N]T
Iterate: rk+1 = Mrk
Stop when |rk+1 - rk|1 < ε
„ |x|1 = ∑1≤i≤N|xi| is the L1 norm
„ Can use any other vector norm e.g., Euclidean
Power Iteration Example
y a
y 1/2 1/2
a 1/2 0
m 0 1/2
Yahoo
Amazon
y
a =
m
m
0
1
0
M’soft
1/3
1/3
1/3
1/3
1/2
1/6
5/12
1/3
1/4
3/8
11/24 . . .
1/6
2/5
2/5
1/5
Random Walk Interpretation
„
„
Imagine a random web surfer
„ At any time t, surfer is on some page P
„ At time t+1, the surfer follows an outlink from P uniformly at
random
„ Ends up on some page Q linked from P
„ Process repeats indefinitely
p(t) is the probability distribution whose ith component is the
probability that the surfer is at page i at time t
The stationary distribution
„
„
„
Where is the surfer at time t+1?
„ p(t+1) = Mp(t)
Suppose the random walk reaches a state such
that p(t+1) = Mp(t) = p(t)
„ Then p(t) is a stationary distribution for the
random walk
Our rank vector r satisfies r = Mr
Existence and Uniqueness of the Solution
„
Theory of random walks (aka Markov processes):
A finite Markov chain defined by the stochastic
matrix has a unique stationary probability
distribution if the matrix is irreducible and
aperiodic.
April 9, 2008
Mining and Searching Graphs in Graph Databases
19
M is a not stochastic matrix
„
M is the transition matrix of the Web graph
⎧1
⎪
M ij = ⎨ O j
⎪⎩ 0
„
if (i, j ) ∈ E
otherwise
It does not satisfy
n
∑M
i =1
„
ij
=1
Many web pages have no out-links
„
Such pages are called the dangling pages.
CS583, Bing Liu, UIC
20
M is a not irreducible
„
„
Irreducible means that the Web graph G is
strongly connected.
Definition: A directed graph G = (V, E) is
strongly connected if and only if, for each pair of
nodes u, v ∈ V, there is a path from u to v.
A general Web graph is not irreducible because
„ for some pair of nodes u and v, there is no
path from u to v.
CS583, Bing Liu, UIC
21
M is a not aperiodic
A state i in a Markov chain being periodic means
that there exists a directed cycle that the chain
has to traverse.
Definition: A state i is periodic with period k > 1
if k is the smallest number such that all paths
leading from state i back to state i have a length
that is a multiple of k.
„ If a state is not periodic (i.e., k = 1), it is
aperiodic.
„ A Markov chain is aperiodic if all states are
aperiodic.
„
CS583, Bing Liu, UIC
22
Solution: Random teleports
„
„
Add a link from each page to every page
At each time step, the random surfer has a small
probability teleporting to those links
„ With probability β, follow a link at random
„ With probability 1-β, jump to some page uniformly at
random
„ Common values for β are in the range 0.8 to 0.9
Random teleports Example (β = 0.8)
1/2 1/2 0
0.8 1/2 0 0
0 1/2 1
Yahoo
Amazon
y
a =
m
y 7/15 7/15 1/15
a 7/15 1/15 1/15
m 1/15 7/15 13/15
M’soft
1
1
1
1.00
0.60
1.40
1/3 1/3 1/3
+ 0.2 1/3 1/3 1/3
1/3 1/3 1/3
0.84
0.60
1.56
0.776
0.536 . . .
1.688
7/11
5/11
21/11
Matrix formulation
„
Matrix vector A
„ Aij = βMij + (1-β)/N
„ Mij = 1/|O(j)| when j→i and Mij = 0 otherwise
Verify that A is a stochastic matrix
The page rank vector r is the principal
eigenvector of this matrix
„ satisfying r = Ar
Equivalently, r is the stationary distribution of the
random walk with teleports
„
„
„
Advantages and Limitations of PageRank
„
„
„
„
Fighting spam
PageRank is a global measure and is query
independent
Computed offline
Criticism: query-independence.
„ It could not distinguish between pages that are
authoritative in general and pages that are
authoritative on the query topic.
CS583, Bing Liu, UIC
26
HITS: Capturing Authorities & Hubs (Kleinberg’98)
„
„
Intuitions
„
Pages that are widely cited are good authorities
„
Pages that cite many other pages are good hubs
HITS (Hypertext-Induced Topic Selection)
„ When the user issues a search query, HITS expands the list of
relevant pages returned by a search engine and produces two
rankings
Hubs
Authorities
1. Authorities are pages containing useful information and linked by
Hubs
„
„
2.
course home pages
home pages of auto manufacturers
Hubs are pages that link to Authorities
„
„
April 9, 2008
course bulletin
list of US auto manufacturers
Data Mining: Concepts and Techniques
27
Matrix Formulation
„
„
„
Transition (adjacency) matrix A
„ A[i, j] = 1 if page i links to page j, 0 if
Hubs
not
The hub score vector h: score is
proportional to the sum of the authority
scores of the pages it links to
„ h = λAa
„ Constant λ is a scale factor
The authority score vector a: score is
proportional to the sum of the hub scores
of the pages it is linked from
T
„ a = μA h
„ Constant μ is scale factor
Authorities
Transition Matrix Example
Yahoo
A=
Amazon
M’soft
y
y 1
a 1
m 0
a m
1 1
0 1
1 0
Iterative algorithm
„
„
„
„
„
„
Initialize h, a to all 1’s
h = Aa
Scale h so that its max entry is 1.0
a = ATh
Scale a so that its max entry is 1.0
Continue until h, a converge
Iterative Algorithm Example
111
A= 101
010
110
AT = 1 0 1
110
a(yahoo) =
a(amazon) =
a(m’soft) =
1
1
1
1
1
1
...
1
0.75 . . .
...
1
1
0.732
1
h(yahoo) =
h(amazon) =
h(m’soft) =
1
1
1
...
1
1
1
2/3 0.71 0.73 . . .
1/3 0.29 0.27 . . .
1.000
0.732
0.268
1
4/5
1
Existence and Uniqueness of the Solution
h = λAa
a = μAT h
h = λμAAT h
a = λμATA a
Under reasonable assumptions about A,
the dual iterative algorithm converges to vectors
h* and a* such that:
T
• h* is the principal eigenvector of the matrix AA
T
• a* is the principal eigenvector of the matrix A A
Strengths and weaknesses of HITS
„
„
Strength: its ability to rank pages according to the
query topic, which may be able to provide more
relevant authority and hub pages.
Weaknesses:
„
Easily spammed
Topic drift
„
Inefficiency at query time
„
33
PageRank and HITS
„
„
„
Model
„ PageRank: depends on the links into S
„ HITS: depends on the value of the other links out of S
Characteristics
„ Spam resistance
„ Query independence
Destinies post-1998
„ PageRank: trademark of Google
„ HITS: not commonly used by search engines
(Ask.com?)
Web Mining
„
„
„
Web structure mining
„ Web graph structure
„ Link analysis
Web text mining
Web usage mining
„ Collaborative filtering
4/9/2008
35
Text Mining
„
„
„
Text mining refers to data mining using text
documents as data.
Tasks
„ Text summarization
„ Text classification
„ Text clustering
„ …
Intersection with Information Retrieval and
Natural Language Processing
Li Xiong
Levels of text representations
„
„
„
„
„
„
„
„
„
„
„
„
Character (character n-grams and sequences)
Words (stop-words, stemming, lemmatization)
Phrases (word n-grams, proximity features)
Part-of-speech tags
Taxonomies / thesauri
Vector-space model
Language models
Full-parsing
Cross-modality
Collaborative tagging / Web2.0
Templates / Frames
Ontologies / First order theories
N-Gram
„
„
N-gram: a sub-sequence of n items from a given
sequence.
„ The items can be characters, words or base pairs
according to the application.
„ Unigram, bigram, trigram
Example: Google n-gram corpus
4-grams
serve
serve
serve
serve
serve
serve
as
as
as
as
as
as
the
the
the
the
the
the
incoming (92)
incubator (99)
independent (794)
index (223)
indication (72)
indicator (120)
Bag-of-Words Document
Representation
Vector space model
„
„
„
Each document is represented as a vector.
Given a collection of documents D, let V = {t1,
t2, ..., t|V|} be the set of distinctive words/terms
in the collection. V is called the vocabulary.
A weight wij > 0 is associated with each term ti
of a document dj. For a term that does not
appear in document dj, wij = 0.
dj = (w1j, w2j, ..., w|V|j)
40
TFIDF Weighting
„
„
TF (Term frequency)
IDF (Inverse Document Frequency)
N
tfidf ( w) = tf . log(
)
df ( w)
„
„
„
„
Tf(w) – term frequency (number of word occurrences in a document)
Df(w) – document frequency (number of documents containing the word)
N – number of all documents
TfIdf(w) – relative importance of the word in the document
Similarity between document
vectors
„
„
Each document is represented as a vector of weights D
= <x>
Cosine similarity (dot product) is the most widely used
similarity measure between two document vectors
„
„
„
…calculates cosine of the angle between document vectors
…efficient to calculate (sum of products of intersecting words)
…similarity value between 0 (different) and 1 (the same)
Sim( D1 , D2 ) =
∑x
x
1i 2i
i
2
x
∑j j
2
x
∑k k
Web Mining
„
„
„
Web structure mining
„ Web graph structure
„ Link analysis
Web text mining
Web usage mining
„ Collaborative filtering
4/9/2008
Li Xiong
43
Web Usage Data
„
„
Web Logs: Low level
„ Tracks queries, individual pages/items
requested by a Web browser
Application logs: Higher level
„ When customers check in and check out, items
placed or removed from shopping cart, …etc
4/9/2008
44
Web Usage Mining
„
„
„
„
Association rule mining
„ Discovered associations between pages and products
Sequential pattern discovery
„ Help to discover visit patterns and make predictions
about visit patterns
Clustering
„ Group similar sessions into clusters which may
correspond to user profiles / modes of usage of the
website
Collaborative Filtering
„ Filter/recommend pages and products based on similar
users
4/9/2008
45
Collaborative Filtering: Motivation
„
„
User Perspective
„ Lots of web pages, online products, books,
movies, etc.
„ Reduce my choices…please…
Manager Perspective
“ if I have 3 million customers on the web, I should have
3 million stores on the web.”
CEO of Amazon.com [SCH01]
4/9/2008
Data Mining: Principles and Algorithms
46
Basic Approaches
„
„
Collaborative Filtering (CF)
„ Based on the active user’s history
„ Based on other users’ collective behavior
Content-based Filtering
„ Based on keywords and other features
4/9/2008
Data Mining: Principles and Algorithms
47
Collaborative Filtering: A Framework
Items: I
i1
u1
u2
3
…
2
ui
1
i2
… i j … in
1.5 …. …
2
rij=?
...
um
Users: U
4/9/2008
3
The task:
Q1: Find Unknown ratings?
Q2: Which items should we
recommend to this user?
.
.
.
Unknown function
f: U x I→ R
Data Mining: Principles and Algorithms
48
Collaborative Filtering: Main Methods
„
„
User-User Methods
„ Memory-based: K-NN
„ Model-based: Clustering
Item-Item Method
„ Correlation Analysis
„ Linear Regression
„ Belief Network
„ Association Rule Mining
4/9/2008
Data Mining: Principles and Algorithms
49
User-User method: Intuition
Target
Customer
Q3: How to combine?
Q1: How to measure
similarity?
Q2: How to select
neighbors?
4/9/2008
Data Mining: Principles and Algorithms
50
How to Measure Similarity?
i1
„
Pearson correlation coefficient
w p ( a, i ) =
∑ (r
j∈Commonly Rated Items
„
ui
− r )(rij − ri )
aj
a
j∈ Commonly Rated Items
2
(
r
−
r
)
∑ aj a
in
ua
2
(
r
−
r
)
∑ ij i
j∈Commonly Rated Items
Cosine measure
„
Users are vectors in product-dimension space
ra .ri
wc (a, i) =
r a 2 * ri
4/9/2008
2
Data Mining: Principles and Algorithms
51
Nearest Neighbor Approaches [SAR00a]
„
„
Offline phase:
„ Do nothing…just store transactions
Online phase:
„ Identify highly similar users to the active one
„
„
„
Best K ones
All with a measure greater than a threshold
Prediction
raj = ra
User a’s neutral
∑ w(a, i)(r − r )
+
∑ w(a, i)
ij
i
i
i
User i’s deviation
User a’s estimated deviation
4/9/2008
Data Mining: Principles and Algorithms
52
Clustering [BRE98]
„
„
Offline phase:
„ Build clusters: k-mean, k-medoid, etc.
Online phase:
„ Identify the nearest cluster to the active user
„ Prediction:
„
„
Use the center of the cluster
Weighted average between cluster members
„
4/9/2008
Weights depend on the active user
Data Mining: Principles and Algorithms
53
Clustering vs. k-NN Approaches
„
„
K-NN using Pearson measure is slower but more
accurate
Clustering is more scalable
Active user
Bad recommendations
4/9/2008
Data Mining: Principles and Algorithms
54
Reference: Link Analysis
„
„
„
„
Brin, S. and Page, L. The anatomy of a large-scale
hypertextual Web search engine (PageRank). In Computer
Networks and ISDN Systems, 1998
J. Kleinberg. Authoritative sources in a hyperlinked
environment (HITS). In ACM-SIAM Symp. Discrete
Algorithms, 1998
S. Chakrabarti, B. Dom, D. Gibson, J. Kleinberg, S.R.
Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins,
Mining the link structure of the World Wide Web. IEEE
Computer’99
D. Cai, X. He, J. Wen, and W. Ma, Block-level Link Analysis.
SIGIR'2004.
References
1.
5.
6.
C. D. Manning and H. Schutze, “Foundations of Natural Language
Processing”, MIT Press, 1999.
S. Russell and P. Norvig, “Artificial Intelligence: A Modern Approach”,
Prentice Hall, 1995.
S. Chakrabarti, “Mining the Web: Statistical Analysis of Hypertext and SemiStructured Data”, Morgan Kaufmann, 2002.
G. Miller, R. Beckwith, C. FellBaum, D. Gross, K. Miller, and R. Tengi. Five
papers on WordNet. Princeton University, August 1993.
C. Zhai, Introduction to NLP, Lecture Notes for CS 397cxz, UIUC, Fall 2003.
M. Hearst, Untangling Text Data Mining, ACL’99, invited paper.
7.
http://www.sims.berkeley.edu/~hearst/papers/acl99/acl99-tdm.html
R. Sproat, Introduction to Computational Linguistics, LING 306, UIUC, Fall
2.
3.
4.
8.
9.
2003.
A Road Map to Text Mining and Web Mining, University of Texas resource
page. http://www.cs.utexas.edu/users/pebronia/text-mining/
Computational Linguistics and Text Mining Group, IBM Research,
http://www.research.ibm.com/dssgrp/
References
„
Fabrizio Sebastiani, “Machine Learning in Automated Text
Categorization”, ACM Computing Surveys, Vol. 34, No.1, March 2002
„
Soumen Chakrabarti, “Data mining for hypertext: A tutorial survey”,
ACM SIGKDD Explorations, 2000.
„
Cleverdon, “Optimizing convenient online accesss to bibliographic
databases”, Information Survey, Use4, 1, 37-47, 1984
„
Yiming Yang, “An evaluation of statistical approaches to text
categorization”, Journal of Information Retrieval, 1:67-88, 1999.
„
Yiming Yang and Xin Liu “A re-examination of text categorization
methods”. Proceedings of ACM SIGIR Conference on Research and
Development in Information Retrieval (SIGIR'99, pp 42--49), 1999.
References: Collaborative Filtering
„
„
„
„
„
„
Charu C. Aggarwal, Joel L. Wolf, Kun-Lung Wu, Philip S. Yu: Horting Hatches
an Egg: A New Graph-Theoretic Approach to Collaborative Filtering. KDD
1999: 201-212
J. Breese, D. Heckerman, C. Kadie Empirical Analysis of Predictive Algorithms
for Collaborative Filtering. In Proc. 14th Conf. Uncertainty in Artificial
Intelligence, Madison, July 1998.
Yoon Ho Cho and Jae Kyeong Kim: Application of Web usage mining and
product taxonomy to collaborative recommendations in e-commerce. Expert
Systems with Applications, 26(2), 2003
William W. Cohen, Robert E. Schapire, and Yoram Singer. Learning to order
things. In Advances in Neural Processing Systems 10, Denver, CO, 1997
Toshihiro Kamishima: Nantonac collaborative filtering: recommendation
based on order responses. KDD 2003: 583-588
Lee, C.-H, Kim, Y.-H., Rhee, P.-K. Web personalization expert with combining
collaborative filtering and association rule mining technique. Expert Systems
with Applications, v 21, n 3, October, 2001, p 131-137
4/9/2008
Data Mining: Principles and Algorithms
58
References: Collaborative Filtering
„
„
„
„
„
W. Lin, 2001P, online presentation available at: http://www.wiwi.huberlin.de/~myra/WEBKDD2000/WEBKDD2000_ARCHIVE/LinAlvarezRuiz_Web
KDD2000.ppt
Weiyang Lin, Sergio A. Alvarez, and Carolina Ruiz. Efficient adaptive-support
association rule mining for recommender systems. Data Mining and
Knowledge Discovery, 6:83--105, 2002
G. Linden, B. Smith, and J. York, "Amazon.com Recommendations Iemto item collaborative filtering", IEEE Internet Computing, Vo. 7, No. 1, pp. 7680,
Jan. 2003.
Badrul M. Sarwar, George Karypis, Joseph A. Konstan, John Riedl: Analysis of
recommendation algorithms for e-commerce. ACM Conf. Electronic
Commerce 2000: 158-167
B. Sarwar, G. Karypis, J. Konstan, and J. Riedl: Application of dimensionality
reduction in recommender systems--a case study. In ACM WebKDD 2000
Web Mining for E-Commerce Workshop, 2000.
B. M. Sarwar, G. Karypis, J. A. Konstan, and J. Riedl. Item-based
collaborative filtering recommendation algorithms. WWW’01
4/9/2008
Data Mining: Principles and Algorithms
59
References: Collaborative Filtering
„
„
„
„
„
„
„
B. Sarwar, 2000P, online presentation available at: http://www.wiwi.huberlin.de/~myra/WEBKDD2000/WEBKDD2000_ARCHIVE/badrul.ppt
J. Ben Schafer, Joseph A. Konstan, John Riedl: E-Commerce
Recommendation Applications. Data Mining and Knowledge Discovery 5(1/2):
115-153, 2001
L.H. Ungar and D.P. Foster: Clustering Methods for Collaborative Filtering,
AAAI Workshop on Recommendation Systems, 1998.
Yi-Fan Wang, Yu-Liang Chuang, Mei-Hua Hsu and Huan-Chao Keh: A
personalized recommender system for the cosmetic business. Expert Systems
with Applications, v 26, n 3, April, 2004 Pages 427-434
S. Vucetic and Z. Obradovic. A regression-based approach for scaling-up
personalized recommender systems in e-commerce. In ACM WebKDD 2000
Web Mining for E-Commerce Workshop, 2000.
Kai Yu, Xiaowei Xu, Martin Ester, and Hans-Peter Kriegel: Selecting relevant
instances for efficient accurate collaborative filtering. In Proceedings of the
10th CIKM, pages 239--246. ACM Press, 2001.
Cheng Zhai, Spring 2003 online course notes available at:
http://sifaka.cs.uiuc.edu/course/2003-497CXZ/loc/cf.ppt
4/9/2008
Data Mining: Principles and Algorithms
60