Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Disambiguating Web Appearances of
People in a Social Network
Ron Bekkerman, Andrew McCallum
University of Massachusetts
WWW 2005 (Chiba, Japan)
Abstract
Looking for info. about a particular person
namesakes problem
multiple people who are related in some ways
Two unsupervised methods
One based on link structure of the Web pages
Another using Agglomerative/Conglomerative Double
Clustering (A/CDC)
Dataset
Over 1000 Web pages retrieved from Google queries on 12
personal names appearing together in someone’s email
folder
Outperform traditional agglomerative clustering by more
than 20%, achieving over 80% F-measure
Introduction
Personalized tool that manage our social network
Track people we know already
Tell us about new people we meet
e.g. receive email messages form people, it can rapidly
search for any public facts about them
Public info. about a person
Useful summary of public info. about a person could gather
from the Web: news articles, corporate pages, university
pages, discussion forums, etc.
But how to identify whether certain Web pages are about
the person (relevant pages) or different person with the
same name
Introduction (cont.)
Example:
David Mulford, the US Ambassador
most of pages retrieved are actually related to the
Ambassador; however, there are also two business
managers, a musician, a student, a scientist , and a few
others
filter out info. about other namesakes
Previous Work
Automatically populating a database of contact info. of
people in a use’s social network
Homepage finding to extract institution, job title, address,
phone, fax, email, and use simple heuristic for
disambiguating person names, sometimes failed
Introduction (cont.)
This paper
Finding all search engine hits about a person, and separate
them from namesakes
Look beyond homepages
Present two statistical frameworks: one based on linkage
structure, another based on the recently introduced multiway distributional clustering method
Rather than searching for people individually, we leverage an
existing social network of people who are known to be
somewhat connected, and use this extra info. to aid the
disambiguation
Problem statement and related work
Problem statement
Provides a function f answering whether or not Web page d
refers to a particular person h, given a model M and
background knowledge K
Background Knowledge
Perfect background knowledge is unavailable
K can include training data – page that are related to or
unrelated to the person, but obtaining negative instances
could be much more difficult
Problem statement and related work
(cont.)
Related Work
The problem of disambiguating collections of Web
appearances has been explored surprisingly little (??)
Homepage finding
AHOY! (1997)
TREC homepage finding competition in 2002
primarily use heuristics and pattern matching
Cross-document co-reference and name resolution
All use average-link clustering methods
Bagga and Baldwin use agglomerative clustering over
VSM
Fleischman and Hovy construct MaxEnt classifier to learn
distance between documents that are then clustered
Methods - Link Structure Model
Application scenario
Given a group of people H = {h1,…,hn} who are related to
each other, it would like to identify the Web presence of all
of them simultaneously
Important observation
Web pages of a group of acquaintances are likely to be
interconnected, the term “interconnectedness” should be
defined
Construct a model M given the set of Web pages D
D is constructed by providing a search engine with queries
th1,…,thN and retrieving top K hits of each query, so that N*K
Web pages overall
Methods - Link Structure Model (cont.)
GLS=(V,E): Link Structure Graph
Maximal Connected Component
(MCC) is the core of model
Central cluster C0: the largest
connected component that consists
of pages retrieved by more than one query
Link Structure Model MLS is a pair (C, δ)
C: {C1,…,CM} (note that C0εC)
δ: distance threshold
Discrimination function f is defined as
1, if d Ci : Ci C0 , i 0...M
f (d , h | M ( K ))
0, otherwise
Methods - Link Structure Model (cont.)
Particular design choices (vary from system to system)
How to decide whether two pages are linked or not
Directly link, can be reached by three links, or in the
same organization
How to decide a suitable value of δ
How to calculate the distance between two cluster C0 and Ci
Cosine similarity or Kullback-Leibler divergence
Methods - Link Structure Model (cont.)
In their experiment
Linked pages
url(d): output domain name with its first directory name
links(d): a set of URLs that occur in d
Trusted URLs, TR(D): {url(di)}\POP
Link structure, LS(d)=(links(d) ∩TR(D)) ∪url(d)
Two pages d1 and d2 are linked to each other if their link
structures intersect, that is LS(d1) ∩LS(d2)
Distance threshold
Set it so that one third of the pages in the dataset within
the threshold
tf ( w)
tfidf ( w)
Distance measure between clusters
log google _ df ( w)
Cosine similarity with a variation of tfidf term weighting
Methods - Agglomerative/Conglomerative
Double Clustering (A/CDC) Model
Clustering Model MCL
a pair (C,L(.)), where C is the set of clusters of documents
in D, and L(.) is the interconnectedness measure of a
cluster
Discrimination function f is defined as follows
1, if d C * : C * arg max i 1..M L(Ci )
f (d , h | M ( K ))
0, otherwise
Apply the A/CDC algorithm – an instance of the new multiway distributional clustering method to clustering method
Method - Agglomerative/Conglomerative
Double Clustering (A/CDC) Model (cont.)
Main idea of A/CDC
Employ the fact that similar documents have similar
distributions over words, while similar words are similarly
distributed over documents
Starting with one cluster containing all words and many
clusters with one document each, we iteratively spilt word
clusters and merge document clusters, while conditioning
one clustering system on the other, until meaningful clusters
are obtained
Multi-way distributional clustering stands in close
correspondence with Multivariate Information Bottleneck
(MIB) method
Method - Agglomerative/Conglomerative
Double Clustering (A/CDC) Model (cont.)
Background
Information Bottleneck (IB)
a convenient information-theoretic framework for solving
clustering, information retrieval
Main idea lies behind the IB clustering is in constructing
an assignment of data points X into clusters X^ that will
maximize info. about entities Y that are interdependent
with X
^ is
The info. about Y gained from X
~
~
~
I ( X ; Y ) P( X , Y ) log
~
X ,Y
P( X , Y )
~
P( X ) P(Y )
Add compression
constraint,
thus the IB is stated as
~
~
arg max
I ( X ;Y ) I ( X ; X )
~
X
Method - Agglomerative/Conglomerative
Double Clustering (A/CDC) Model (cont.)
Background
Slonim and Tishby: proposed a greedy agglomerative
algorithm for document clustering based on IB, where X
stands for documents and Y stands for words in the doc.,
and propose double clustering technique
Friedman et al. propose the Multivariate Information
Bottleneck (MIB) framework: they consider clustering
instances of a set of variables X=(X1,…,Xn) into a set of
clustering systems X^=(X1^,…, Xn^)
The double clustering problem thus becomes a partial case
of MIB and can be derived as
~
~
~
~
arg max
I ( X ; Y ) ( I ( X ; X ) I (Y ; Y ))
~ ~
X ,Y
Method - Agglomerative/Conglomerative
Double Clustering (A/CDC) Model (cont.)
Motivation
Set the Lagrange multiplier β to zero, the double clustering
objective is~ then
derived as ~
~
~
arg max
I ( X ; Y ), subject to X =N ~ , Y =N ~
~ ~
X
X ,Y
Y
Explore different possibilities while employing the
hierarchical structure of the clusters: agglomerative (bottomup) and conglomerative (top-down) clustering
Two top-down schemes
Two bottom-up schemes
Bad, lead to a completely random split
Bad, Because of the computational issues
One top-down, another bottom-up
Feasible, it can “bootstrap” each other
Method - Agglomerative/Conglomerative
Double Clustering (A/CDC) Model (cont.)
Motivation
Use one top-down, another
bottom-up
A/CDC is the simultaneous
clustering of X by a topdown scheme and Y by
bottom-up scheme
Method - Agglomerative/Conglomerative
Double Clustering (A/CDC) Model (cont.)
Overview of algorithm
Break objective function to two parts
~ ~
~ ~
arg max
I ( X ; Y ), arg max
I ( X ;Y )
~
~
X
Y
Initiate the two clustering systems with one cluster x^ that
contains all data points x, and one data point yi per each cluster
yi^, and calculate the initial Mutual Information I(X^,Y^), at
each iteration of the algorithm, we perform four operations
^ to two equally sized
Split step: randomly split xi
^ ^
Sequential pass: pick each data point xj to maximize I(X ,Y )
^ to find its best
Merge step: randomly select each cluster yi
mate while applying a criterion for minimizing Bayes
classification error
Another sequential pass: the same sequential pass as step 2
Method - Agglomerative/Conglomerative
Double Clustering (A/CDC) Model (cont.)
Overview of algorithm
In order to get the global maximum, perform a number of
random restarts of steps 1-2 and the of steps 3-4
Computational complexity is O(NxNylogNy)
In this case, we use the top-down scheme for clustering
words and the bottom-up scheme for clustering documents
Continue the process until we have three document clusters
(one of which is then chosen to be the class of relevant
pages)
Method - LS+A/CDC Hybrid Model
Hybrid model
Overlap the groups built by the two methods
Compose a new central cluster C0* by uniting all the
connected components that overlap with C*
C
*
0
U
Ci
Ci C ,i 0.. M
*
Discrimination function f is defined as
*
1, if d Ci : Ci C0 , i 0..M
f (d , h | M ( K )
0, otherwise
Dataset
1085 Web pages
From Melinda Gervasio’s
email directory and
extracted 12 person
name, then issue as
queries to Google and for
each query the first 100
were retrieved
The dataset is publicly
available
Results and Discussion
Baseline model
greedy agglomerative clustering based on cosine similarity
between clusters and the tfidf weighting
A/CDC method
The relatively high deviation in precision and recall is
caused by the fact that it never ends up with clusters of the
exactly same size
Results and Discussion (cont.)
Bill Mark and Fernando
Pereira because both of
them have “double”
problem
Steve Hardt and Adam
Cheyer: the worst recall
Adam’s name often
appears in an industrial
context
Steve: most of his page
refer to an online game
he created
Results and Discussion (cont.)
This result shows that
algorithm stopped with 3, 5,
9 or 17 clusters
When 5 clusters, the
“doubles” can be handled
within the A/CDC framework
Constructing clustering
systems with all possible
granularity levels is an
important feature of the
A/CDC algorithm
It can also solve “homepage
finding”
Conclusions and Future Work
The first attempt to approach the problem of finding Web
appearances of a group of people
Proposed two methods, purely unsupervised – involve minimum
of prior knowledge about people
Built a large annotated dataset that is public
Working on more sophisticated probabilistic models for solving
this problem that would capture the relational structure
Web appearance disambiguation is novel and poses a lot of
exciting challenges