Download ppt

Disambiguating Web Appearances of People in a Social Network Ron Bekkerman, Andrew McCallum University of Massachusetts WWW 2005 (Chiba, Japan) Abstract    Looking for info. about a particular person  namesakes problem  multiple people who are related in some ways Two unsupervised methods  One based on link structure of the Web pages  Another using Agglomerative/Conglomerative Double Clustering (A/CDC) Dataset  Over 1000 Web pages retrieved from Google queries on 12 personal names appearing together in someone’s email folder  Outperform traditional agglomerative clustering by more than 20%, achieving over 80% F-measure Introduction   Personalized tool that manage our social network  Track people we know already  Tell us about new people we meet  e.g. receive email messages form people, it can rapidly search for any public facts about them Public info. about a person  Useful summary of public info. about a person could gather from the Web: news articles, corporate pages, university pages, discussion forums, etc.  But how to identify whether certain Web pages are about the person (relevant pages) or different person with the same name Introduction (cont.)   Example:  David Mulford, the US Ambassador  most of pages retrieved are actually related to the Ambassador; however, there are also two business managers, a musician, a student, a scientist , and a few others  filter out info. about other namesakes Previous Work  Automatically populating a database of contact info. of people in a use’s social network  Homepage finding to extract institution, job title, address, phone, fax, email, and use simple heuristic for disambiguating person names, sometimes failed Introduction (cont.)  This paper  Finding all search engine hits about a person, and separate them from namesakes  Look beyond homepages  Present two statistical frameworks: one based on linkage structure, another based on the recently introduced multiway distributional clustering method  Rather than searching for people individually, we leverage an existing social network of people who are known to be somewhat connected, and use this extra info. to aid the disambiguation Problem statement and related work   Problem statement  Provides a function f answering whether or not Web page d refers to a particular person h, given a model M and background knowledge K Background Knowledge  Perfect background knowledge is unavailable  K can include training data – page that are related to or unrelated to the person, but obtaining negative instances could be much more difficult Problem statement and related work (cont.)  Related Work  The problem of disambiguating collections of Web appearances has been explored surprisingly little (??)  Homepage finding  AHOY! (1997)  TREC homepage finding competition in 2002  primarily use heuristics and pattern matching  Cross-document co-reference and name resolution  All use average-link clustering methods  Bagga and Baldwin use agglomerative clustering over VSM  Fleischman and Hovy construct MaxEnt classifier to learn distance between documents that are then clustered Methods - Link Structure Model    Application scenario  Given a group of people H = {h1,…,hn} who are related to each other, it would like to identify the Web presence of all of them simultaneously Important observation  Web pages of a group of acquaintances are likely to be interconnected, the term “interconnectedness” should be defined Construct a model M given the set of Web pages D  D is constructed by providing a search engine with queries th1,…,thN and retrieving top K hits of each query, so that N*K Web pages overall Methods - Link Structure Model (cont.)  GLS=(V,E): Link Structure Graph  Maximal Connected Component (MCC) is the core of model  Central cluster C0: the largest connected component that consists of pages retrieved by more than one query  Link Structure Model MLS is a pair (C, δ) C: {C1,…,CM} (note that C0εC) δ: distance threshold  Discrimination function f is defined as  1, if d  Ci : Ci  C0   , i  0...M f (d , h | M ( K ))   0, otherwise Methods - Link Structure Model (cont.)  Particular design choices (vary from system to system)  How to decide whether two pages are linked or not  Directly link, can be reached by three links, or in the same organization  How to decide a suitable value of δ  How to calculate the distance between two cluster C0 and Ci  Cosine similarity or Kullback-Leibler divergence Methods - Link Structure Model (cont.)  In their experiment  Linked pages  url(d): output domain name with its first directory name  links(d): a set of URLs that occur in d  Trusted URLs, TR(D): {url(di)}\POP  Link structure, LS(d)=(links(d) ∩TR(D)) ∪url(d)  Two pages d1 and d2 are linked to each other if their link structures intersect, that is LS(d1) ∩LS(d2)  Distance threshold  Set it so that one third of the pages in the dataset within the threshold tf ( w) tfidf ( w)   Distance measure between clusters log google _ df ( w)  Cosine similarity with a variation of tfidf term weighting Methods - Agglomerative/Conglomerative Double Clustering (A/CDC) Model  Clustering Model MCL  a pair (C,L(．)), where C is the set of clusters of documents in D, and L(．) is the interconnectedness measure of a cluster  Discrimination function f is defined as follows 1, if d  C * : C *  arg max i 1..M L(Ci ) f (d , h | M ( K ))   0, otherwise  Apply the A/CDC algorithm – an instance of the new multiway distributional clustering method to clustering method Method - Agglomerative/Conglomerative Double Clustering (A/CDC) Model (cont.)  Main idea of A/CDC  Employ the fact that similar documents have similar distributions over words, while similar words are similarly distributed over documents  Starting with one cluster containing all words and many clusters with one document each, we iteratively spilt word clusters and merge document clusters, while conditioning one clustering system on the other, until meaningful clusters are obtained  Multi-way distributional clustering stands in close correspondence with Multivariate Information Bottleneck (MIB) method Method - Agglomerative/Conglomerative Double Clustering (A/CDC) Model (cont.)  Background  Information Bottleneck (IB)  a convenient information-theoretic framework for solving clustering, information retrieval  Main idea lies behind the IB clustering is in constructing an assignment of data points X into clusters X^ that will maximize info. about entities Y that are interdependent with X ^ is  The info. about Y gained from X ~ ~ ~ I ( X ; Y )   P( X , Y ) log ~ X ,Y  P( X , Y ) ~ P( X ) P(Y ) Add compression constraint, thus the IB is stated as ~ ~ arg max I ( X ;Y )   I ( X ; X ) ~ X Method - Agglomerative/Conglomerative Double Clustering (A/CDC) Model (cont.)  Background  Slonim and Tishby: proposed a greedy agglomerative algorithm for document clustering based on IB, where X stands for documents and Y stands for words in the doc., and propose double clustering technique  Friedman et al. propose the Multivariate Information Bottleneck (MIB) framework: they consider clustering instances of a set of variables X=(X1,…,Xn) into a set of clustering systems X^=(X1^,…, Xn^)  The double clustering problem thus becomes a partial case of MIB and can be derived as ~ ~ ~ ~ arg max I ( X ; Y )   ( I ( X ; X )  I (Y ; Y )) ~ ~ X ,Y Method - Agglomerative/Conglomerative Double Clustering (A/CDC) Model (cont.)  Motivation  Set the Lagrange multiplier β to zero, the double clustering objective is~ then derived as ~ ~ ~ arg max I ( X ; Y ), subject to X =N ~ , Y =N ~ ~ ~ X X ,Y  Y Explore different possibilities while employing the hierarchical structure of the clusters: agglomerative (bottomup) and conglomerative (top-down) clustering  Two top-down schemes   Two bottom-up schemes   Bad, lead to a completely random split Bad, Because of the computational issues One top-down, another bottom-up  Feasible, it can “bootstrap” each other Method - Agglomerative/Conglomerative Double Clustering (A/CDC) Model (cont.)  Motivation   Use one top-down, another bottom-up A/CDC is the simultaneous clustering of X by a topdown scheme and Y by bottom-up scheme Method - Agglomerative/Conglomerative Double Clustering (A/CDC) Model (cont.)  Overview of algorithm  Break objective function to two parts ~ ~ ~ ~ arg max I ( X ; Y ), arg max I ( X ;Y ) ~ ~ X  Y Initiate the two clustering systems with one cluster x^ that contains all data points x, and one data point yi per each cluster yi^, and calculate the initial Mutual Information I(X^,Y^), at each iteration of the algorithm, we perform four operations ^ to two equally sized  Split step: randomly split xi ^ ^  Sequential pass: pick each data point xj to maximize I(X ,Y ) ^ to find its best  Merge step: randomly select each cluster yi mate while applying a criterion for minimizing Bayes classification error  Another sequential pass: the same sequential pass as step 2 Method - Agglomerative/Conglomerative Double Clustering (A/CDC) Model (cont.)  Overview of algorithm  In order to get the global maximum, perform a number of random restarts of steps 1-2 and the of steps 3-4  Computational complexity is O(NxNylogNy)  In this case, we use the top-down scheme for clustering words and the bottom-up scheme for clustering documents  Continue the process until we have three document clusters (one of which is then chosen to be the class of relevant pages) Method - LS+A/CDC Hybrid Model  Hybrid model  Overlap the groups built by the two methods  Compose a new central cluster C0* by uniting all the connected components that overlap with C* C  * 0  U Ci Ci C  ,i 0.. M * Discrimination function f is defined as *  1, if d  Ci : Ci  C0   , i  0..M f (d , h | M ( K )    0, otherwise Dataset   1085 Web pages  From Melinda Gervasio’s email directory and extracted 12 person name, then issue as queries to Google and for each query the first 100 were retrieved The dataset is publicly available Results and Discussion   Baseline model  greedy agglomerative clustering based on cosine similarity between clusters and the tfidf weighting A/CDC method  The relatively high deviation in precision and recall is caused by the fact that it never ends up with clusters of the exactly same size Results and Discussion (cont.)   Bill Mark and Fernando Pereira because both of them have “double” problem Steve Hardt and Adam Cheyer: the worst recall  Adam’s name often appears in an industrial context  Steve: most of his page refer to an online game he created Results and Discussion (cont.)     This result shows that algorithm stopped with 3, 5, 9 or 17 clusters When 5 clusters, the “doubles” can be handled within the A/CDC framework Constructing clustering systems with all possible granularity levels is an important feature of the A/CDC algorithm It can also solve “homepage finding” Conclusions and Future Work      The first attempt to approach the problem of finding Web appearances of a group of people Proposed two methods, purely unsupervised – involve minimum of prior knowledge about people Built a large annotated dataset that is public Working on more sophisticated probabilistic models for solving this problem that would capture the relational structure Web appearance disambiguation is novel and poses a lot of exciting challenges

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download ppt