Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Probabilistic Models in Information Retrieval SI650: Information Retrieval Winter 2010 School of Information University of Michigan Most slides are from Prof. ChengXiang Zhai’s lectures 2010 © University of Michigan 1 Essential Probability & Statistics 2010 © University of Michigan Prob/Statistics & Text Management • Probability & statistics provide a principled way to quantify the uncertainties associated with natural language • Allow us to answer questions like: – Given that we observe “baseball” three times and “game” once in a news article, how likely is it about “sports”? (text categorization, information retrieval) – Given that a user is interested in sports news, how likely would the user use “baseball” in a query? (information retrieval) 2010 © University of Michigan Basic Concepts in Probability • Random experiment: an experiment with uncertain outcome (e.g., flipping a coin, picking a word from text) • Sample space: all possible outcomes, e.g., – Tossing 2 fair coins, S ={HH, HT, TH, TT} • Event: a subspace of the sample space – – – – ES, E happens iff outcome is in E, e.g., E={HH} (all heads) E={HH,TT} (same face) Impossible event ({}), certain event (S) • Probability of Event : 0 P(E) ≤1, s.t. – P(S)=1 (outcome always in S) – P(A B)=P(A)+P(B), if (AB)= (e.g., A=same face, B=different face) 2010 © University of Michigan Example: Flip a Coin • Sample space: {Head, Tail} • Fair coin: – p(H) = 0.5, p(T) = 0.5 • Unfair coin, e.g.: – p(H) = 0.3, p(T) = 0.7 • Flipping two fair coins: – Sample space: {HH, HT, TH, TT} • Example in modeling text: – Flip a coin to decide whether or not to include a word in a document – Sample space = {appear, absence} 2010 © University of Michigan 5 Example: Toss a Dice • Sample space: S = {1,2,3,4,5,6} • Fair dice: – p(1) = p(2) = p(3) = p(4) = p(5) = p(6) = 1/6 • Unfair dice: p(1) = 0.3, p(2) = 0.2, ... • N-dimensional dice: – S = {1, 2, 3, 4, …, N} • Example in modeling text: – Toss a die to decide which word to write in the next position – Sample space = {cat, dog, tiger, …} 2010 © University of Michigan 6 Basic Concepts of Prob. (cont.) • Joint probability: P(AB), also written as P(A, B) • Conditional Probability: P(B|A)=P(AB)/P(A) – P(AB) = P(A)P(B|A) = P(B)P(A|B) – So, P(A|B) = P(B|A)P(A)/P(B) (Bayes’ Rule) – For independent events, P(AB) = P(A)P(B), so P(A|B)=P(A) • Total probability: If A1, …, An form a partition of S, then – P(B) = P(BS) = P(B, A1) + … + P(B, An) (why?) – So, P(Ai|B) = P(B|Ai)P(Ai)/P(B) = P(B|Ai)P(Ai)/[P(B|A1)P(A1)+…+P(B|An)P(An)] – This allows us to compute P(Ai|B) based on P(B|Ai) 2010 © University of Michigan 7 Revisit: Bayes’ Rule Hypothesis space: H={H1 , …, Hn} P( H i | E ) Evidence: E P( E | H i )P( H i ) P( E ) In text classification: H: class space; E: data (features) If we want to pick the most likely hypothesis H*, we can drop P(E) Posterior probability of Hi Prior probability of Hi P ( H i | E ) P ( E | H i ) P( H i ) Likelihood of data/evidence if Hi is true 2010 © University of Michigan 8 Random Variable • X: S (“measure” of outcome) – E.g., number of heads, all same face?, … • Events can be defined according to X – E(X=a) = {si|X(si)=a} – E(Xa) = {si|X(si) a} • So, probabilities can be defined on X – P(X=a) = P(E(X=a)) – P(Xa) = P(E(Xa)) • Discrete vs. continuous random variable (think of “partitioning the sample space”) 2010 © University of Michigan 9 Getting to Statistics ... • We are flipping an unfair coin, but P(Head)=? (parameter estimation) – If we see the results of a huge number of random experiments, then count ( Heads ) count ( Heads ) Pˆ ( Head ) count ( Flips ) count ( Heads ) count (Tails ) – But, what if we only see a small sample (e.g., 2)? Is this estimate still reliable? We flip twice and got two tails, does it mean P(Head) = 0? • In general, statistics has to do with drawing conclusions on the whole population based on observations of a sample (data) 2010 © University of Michigan Parameter Estimation • General setting: – Given a (hypothesized & probabilistic) model that governs the random experiment – The model gives a probability of any data p(D|) that depends on the parameter – Now, given actual sample data X={x1,…,xn}, what can we say about the value of ? • Intuitively, take your best guess of -- “best” means “best explaining/fitting the data” • Generally an optimization problem 2010 © University of Michigan 11 Maximum Likelihood vs. Bayesian • Maximum likelihood estimation – “Best” means “data likelihood reaches maximum” ˆ arg max P ( X | ) – Problem: small sample • Bayesian estimation – “Best” means being consistent with our “prior” knowledge and explaining data well – Problem: how to define prior? ˆ arg max P( | X ) arg max P( X | ) P( ) 2010 © University of Michigan 12 Illustration of Bayesian Estimation Posterior: p(|X) p(X|)p() Likelihood: p(X|) X=(x1,…,xN) Prior: p() : prior mode : posterior mode 2010 © University of Michigan ml: ML estimate 13 Maximum Likelihood Estimate Data: a document d with counts c(w1), …, c(wN), and length |d| Model: multinomial distribution M with parameters {p(wi)} Likelihood: p(d|M) Maximum likelihood estimator: M=argmax M p(d|M) c( wi ) p ( wi ) d Problem: a document is short if c(wi) = 0, does this mean p(wi) = 0? 2010 © University of Michigan 14 If you want to know how we get it… Data: a document d with counts c(w1), …, c(wN), and length |d| Model: multinomial distribution M with parameters {p(wi)} Likelihood: p(d|M) Maximum likelihood estimator: M=argmax M p(d|M) N |d | N c ( wi ) p(d | M ) i c ( wi ) i i 1 c( w1 )...c( wN ) i 1 where,i p ( wi ) N i 1 i 1 N l (d | M ) log p(d | M ) c( wi ) log i i 1 N N i 1 i 1 l (d | M ) c( wi ) log i ( i 1) ' l ' c( wi ) 0 i i N Since i 1 i N i Use Lagrange multiplier approach c( wi ) Set partial derivatives to zero 1, c( wi ) | d | i 1 We’ll tune p(wi) to maximize l(d|M) So, i p( wi ) c( wi ) |d | 2010 © University of Michigan ML estimate 15 Statistical Language Models 2010 © University of Michigan 16 What is a Statistical Language Model? • A probability distribution over word sequences – p(“Today is Wednesday”) 0.001 – p(“Today Wednesday is”) 0.0000000000001 – p(“The eigenvalue is positive”) 0.00001 • Context-dependent! • Can also be regarded as a probabilistic mechanism for “generating” text, thus also called a “generative” model 2010 © University of Michigan 17 Why is a Language Model Useful? • Provides a principled way to quantify the uncertainties associated with natural language • Allows us to answer questions like: – Given that we see “John” and “feels”, how likely will we see “happy” as opposed to “habit” as the next word? (speech recognition) – Given that we observe “baseball” three times and “game” once in a news article, how likely is it about “sports”? (text categorization, information retrieval) – Given that a user is interested in sports news, how likely would the user use “baseball” in a query? (information retrieval) 2010 © University of Michigan 18 Source-Channel Framework (Communication System) Transmitter (encoder) Source X P(X) Noisy Channel Y Receiver (decoder) X’ Destination P(X|Y)=? P(Y|X) Xˆ arg max p( X | Y ) arg max p(Y | X ) p( X ) X X When X is text, p(X) is a language model 2010 © University of Michigan 19 Speech Recognition Acoustic signal (A) Words (W) Speaker Noisy Channel P(W) P(A|W) Recognizer Recognized words P(W|A)=? Wˆ arg max p(W | A) arg max p( A | W ) p(W ) W W Acoustic model Language model Given acoustic signal A, find the word sequence W 2010 © University of Michigan 20 Machine Translation English Speaker English Words (E) P(E) Chinese Words(C) Noisy Channel Translator P(C|E) P(E|C)=? English Translation Eˆ arg max p( E | C ) arg max p(C | E) p( E) E E English->Chinese English Translation model Language model Given Chinese sentence C, find its English translation E 2010 © University of Michigan 21 Spelling/OCR Error Correction Original Text Original Words (O) P(O) “Erroneous” Words(E) Noisy Channel Corrector P(E|O) P(O|E)=? Corrected Text Oˆ arg max p(O | E ) arg max p( E | O) p(O) O O Spelling/OCR Error model “Normal” Language model Given corrupted text E, find the original text O 2010 © University of Michigan 22 Basic Issues • Define the probabilistic model – Event, Random Variables, Joint/Conditional Prob’s – P(w1 w2 ... wn)=f(1, 2 ,…, m) • Estimate model parameters – Tune the model to best fit the data and our prior knowledge – i=? • Apply the model to a particular task – Many applications 2010 © University of Michigan 23 The Simplest Language Model (Unigram Model) • Generate a piece of text by generating each word INDEPENDENTLY • Thus, p(w1 w2 ... wn)=p(w1)p(w2)…p(wn) • Parameters: {p(wi)} p(w1)+…+p(wN)=1 (N is voc. size) • Essentially a multinomial distribution over words • A piece of text can be regarded as a sample drawn according to this word distribution 2010 © University of Michigan 24 Modeling Text As Tossing a Die • Sample space: S = {w1,w2,w3,… wN} • Die is unfair: – P(cat) = 0.3, P(dog) = 0.2, ... – Toss a die to decide which word to write in the next position • E.g., Sample space = {cat, dog, tiger} • P(cat) = 0.3; P(dog) = 0.5; P(tiger) = 0.2 • Text = “cat cat dog” • P(Text) = P(cat) * P(cat) * P(dog) = 0.3 * 0.3 * 0.5 = 0.045 2010 © University of Michigan 25 Text Generation with Unigram LM (Unigram) Language Model p(w| ) Sampling Document … Topic 1: Text mining text 0.2 mining 0.1 association 0.01 clustering 0.02 … food 0.00001 Text mining paper … … Topic 2: Health food 0.25 nutrition 0.1 healthy 0.05 diet 0.02 Food nutrition paper … 2010 © University of Michigan 26 Estimation of Unigram LM (Unigram) Language Model p(w| )=? Estimation … 10/100 5/100 3/100 3/100 1/100 Document text 10 mining 5 association 3 database 3 algorithm 2 … query 1 efficient 1 text ? mining ? association ? database ? … query ? … A “text mining paper” (total #words=100) 2010 © University of Michigan 27 Empirical distribution of words • There are stable language-independent patterns in how people use natural languages • A few words occur very frequently; most occur rarely. E.g., in news articles, – Top 4 words: 10~15% word occurrences – Top 50 words: 35~40% word occurrences • The most frequent word in one corpus may be rare in another 2010 © University of Michigan 28 Revisit: Zipf’s Law • rank * frequency constant Word Freq. F ( w) C r ( w) 1, C 0.1 Most useful words (Luhn 57) Is “too rare” a problem? Biggest data structure (stop words) Word Rank (by Freq) Generalized Zipf’s law: C F ( w) [r ( w) B] Applicable in many domains 2010 © University of Michigan 29 Problem with the ML Estimator • What if a word doesn’t appear in the text? • In general, what probability should we give a word that has not been observed? • If we want to assign non-zero probabilities to such words, we’ll have to discount the probabilities of observed words • This is what “smoothing” is about … 2010 © University of Michigan 30 Language Model Smoothing (Illustration) P(w) Max. Likelihood Estimate pML ( w ) count of w count of all words Smoothed LM w 2010 © University of Michigan 31 How to Smooth? • All smoothing methods try to – discount the probability of words seen in a document – re-allocate the extra counts so that unseen words will have a non-zero count • Method 1 (Additive smoothing): Add a constant to the counts of each word Counts of w in d p( w | d ) c( w, d ) 1 | d | |V | “Add one”, Laplace smoothing Vocabulary size Length of d (total counts) • Problems? 2010 © University of Michigan 32 How to Smooth? (cont.) • Should all unseen words get equal probabilities? • We can use a reference model to discriminate unseen words Discounted ML estimate if w is seen in d pseen ( w | d ) p( w | d ) otherwise d p( w | REF ) Reference language model p (w | d ) p(w | REF ) 1 d seen w is seen w is unseen 2010 © University of Michigan 33 Other Smoothing Methods • Method 2 (Absolute discounting): Subtract a constant from the counts of each word # uniq words p (w | d ) max( c ( w;d ) ,0) |d |u p ( w| REF ) |d | αd = δ|d|u/|d| • Method 3 (Linear interpolation, Jelinek-Mercer): “Shrink” uniformly toward p(w|REF) c( w, d ) p( w | d ) (1 ) p( w | REF ) |d | ML estimate αd = λ parameter 2010 © University of Michigan 34 Other Smoothing Methods (cont.) • Method 4 (Dirichlet Prior/Bayesian): Assume pseudo counts p(w|REF) αd = μ/(|d|+ μ) p (w | d ) c ( w;d ) p ( w| REF ) |d | |d | |d | c( w, d ) |d | p( w | REF ) |d | parameter • Method 5 (Good Turing): Assume total # unseen events to be n1 (# of singletons), and adjust the seen events in the same way p (w | d ) 0* c*( w;d ) |d | ; c *( w, d ) r* r 1 nr 1 , where r c( w, d ) nr n1 2* n2 ,1* ,..... What if nr 0? What about p w | REF ? n0 n1 2010 © University of Michigan 35 So, which method is the best? It depends on the data and the task! Many other sophisticated smoothing methods have been proposed… Cross validation is generally used to choose the best method and/or set the smoothing parameters… For retrieval, Dirichlet prior performs well… Smoothing will be discussed further in the course… 2010 © University of Michigan 36 Language Models for IR 2010 © University of Michigan 37 Text Generation with Unigram LM (Unigram) Language Model p(w| ) Sampling Document … Topic 1: Text mining text 0.2 mining 0.1 assocation 0.01 clustering 0.02 … food 0.00001 Text mining paper … … Topic 2: Health food 0.25 nutrition 0.1 healthy 0.05 diet 0.02 Food nutrition paper … 2010 © University of Michigan 38 Estimation of Unigram LM (Unigram) Language Model p(w| )=? Estimation … 10/100 5/100 3/100 3/100 1/100 Document text 10 mining 5 association 3 database 3 algorithm 2 … query 1 efficient 1 text ? mining ? assocation ? database ? … query ? … A “text mining paper” (total #words=100) 2010 © University of Michigan 39 Language Models for Retrieval (Ponte & Croft 98) Document Language Model … Text mining paper text ? mining ? assocation ? clustering ? … food ? … … Food nutrition paper Query = “data mining algorithms” ? Which model would most likely have generated this query? food ? nutrition ? healthy ? diet ? … 2010 © University of Michigan 40 Ranking Docs by Query Likelihood Doc LM Query likelihood d1 d1 p(q| d1) d2 d2 p(q| d2) q p(q| dN) dN dN 2010 © University of Michigan 41 Retrieval as Language Model Estimation • Document ranking based on query likelihood log p(q | d ) log p(w i | d ) i where , q w 1w 2 ...w n • Retrieval Document language model problem Estimation of p(wi|d) • Smoothing is an important issue, and distinguishes different approaches 2010 © University of Michigan 42 How to Estimate p(w|d)? • Simplest solution: Maximum Likelihood Estimator – P(w|d) = relative frequency of word w in d – What if a word doesn’t appear in the text? P(w|d)=0 • In general, what probability should we give a word that has not been observed? • If we want to assign non-zero probabilities to such words, we’ll have to discount the probabilities of observed words • This is what “smoothing” is about … 2010 © University of Michigan 43 A General Smoothing Scheme • All smoothing methods try to – discount the probability of words seen in a doc – re-allocate the extra probability so that unseen words will have a non-zero probability • Most use a reference model (collection language model) to discriminate unseen words pseen (w | d ) p(w | d ) d p(w | C ) Discounted ML estimate if w is seen in d otherwise Collection language model 2010 © University of Michigan 44 Smoothing & TF-IDF Weighting • Plug in the general smoothing scheme to the query likelihood retrieval formula, we obtain Doc length normalization (long doc is expected to have a smaller d) TF weighting log p(q | d ) wi d wi q [log pseen ( wi | d ) ] n log d d p( wi | C ) IDF weighting log p(w | C ) i i Ignore for ranking • Smoothing with p(w|C) TF-IDF + length norm. 2010 © University of Michigan 45 Three Smoothing Methods (Zhai & Lafferty 01) • Simplified Jelinek-Mercer: Shrink uniformly toward p(w|C) p(w | d ) (1 )pml (w | d ) p(w | C) • Dirichlet prior (Bayesian): Assume pseudo counts p(w|C) c ( w;d ) p ( w|C ) |d | p (w | d ) |d | pml ( w | d ) |d | p( w | C ) |d | • Absolute discounting: Subtract a constant p (w | d ) max( c ( w;d ) , 0 ) |d |u p ( w|C ) |d | 2010 © University of Michigan 47 Assume We Use Dirichlet Smoothing p (w | d ) c ( w;d ) p ( w| REF ) |d | |d | |d | c( w, d ) |d | p( w | REF ) |d | log p (q | d ) Doc. Term frequency Query term frequency Score(d , q) c( w, q) log( 1 wq , wd c( w, d ) ) | q | log p( w | C ) | d | Similar to IDF Only consider words in the query 2010 © University of Michigan Doc. length 48 The Notion of Relevance Relevance (Rep(q), Rep(d)) Similarity Different rep & similarity P(d q) or P(q d) Probabilistic inference P(r=1|q,d) r {0,1} Probability of Relevance Regression Model (Fox 83) … Vector space Prob. distr. model model (Salton et al., 75) (Wong & Yao, 89) Generative Model Doc generation Classical prob. Model (Robertson & Sparck Jones, 76) Query generation Different inference system Prob. concept Inference network space model (Wong & Yao, 95) model (Turtle & Croft, 91) LM approach (Ponte & Croft, 98) (Lafferty & Zhai, 01a) 2010 © University of Michigan 49 Query Generation and Document Generation • Document ranking based on the conditional probability of relevance P ( R 1 | D, Q ) Ranking based on O( D, Q) R {0,1} P( R 1 | D, Q) P( R 0 | D, Q) P ( D | Q, R ) P ( R | D, Q ) P ( D, Q | R ) Document generation Okapi/BM25 P(Q | D, R) Query generation Language models 2010 © University of Michigan 50 The Notion of Relevance Relevance (Rep(q), Rep(d)) Similarity Different rep & similarity P(r=1|q,d) r {0,1} Probability of Relevance Regression Model (Fox 83) … Vector space Prob. distr. model model (Salton et al., 75) (Wong & Yao, 89) Generative Model Doc generation P(d q) or P(q d) Probabilistic inference Different inference system Query generation Prob. concept Inference network space model Classical LM (Wong & Yao, 95) model (Turtle & Croft, 91) prob. Model approach (Robertson & (Ponte & Croft, 98) Sparck Jones, 76)(Lafferty & Zhai, 01a) 2010 © University of Michigan 51 The Okapi/BM25 Retrieval Formula • Classical probabilistic model – among the best performance Doc. Term frequency Score(d , q) (log wq , wd Only consider words in the query Query term frequency (k 1) c( w, q) N df ( w) 0.5 (k1 1) c( w, d ) 3 ) df ( w) 0.5 k ((1 b) b | d | ) c( w, d ) k3 c( w, q) 1 avdl IDF Doc. length 2010 © University of Michigan 52 What are the Essential Features in a Retrieval Formula? • Document TF – how many occurrence of each query term do we observe in a document • Query TF – how many occurrence of each term do we observe in a query • IDF – how distinctive/important is each query term (term with a higher IDF is more distinctive) • Document length – how long the document is (we want to penalize long documents, but don’t want to overpenalize them!) • Others? 2010 © University of Michigan 53