Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Latent Semantic Analysis A Gentle Tutorial Introduction Tutorial Resources http://cis.paisley.ac.uk/giro-ci0/GU_LSA_TUT M.A. Girolami 25/05/2017 University of Glasgow DCS Tutorial Contents Latent Semantic Analysis Probabilistic Views on LSA Motivation Singular Value Decomposition Term Document Matrix Structure Query and Document Similarity in Latent Space Factor Analytic Model Generative Model Representation Alternate Basis to the Principal Directions Latent Semantic & Document Clustering (In the Bar later) Principal Direction Clustering Hierarchic Clustering with LSA 25/05/2017 University of Glasgow DCS Tutorial Latent Semantic Analysis Motivation Lexical matching at term level inaccurate (claimed) Polysemy – words with number of ‘meanings’ – term matching returns irrelevant documents – impacts precision Synonomy – number of words with same ‘meaning’ – term matching misses relevant documents – impacts recall LSA assumes that there exists a LATENT structure in word usage – obscured by variability in word choice Analogous to signal + additive noise model in signal processing 25/05/2017 University of Glasgow DCS Tutorial Latent Semantic Analysis Word usage defined by term and document cooccurrence – matrix structure Latent structure / semantics in word usage Clustering documents or words – no shared space Two mode factor analysis – dyadic decomposition into ‘latent semantic’ factor space - employing - Singular Value Decomposition Cubic Computational Scaling – reasonable ! 25/05/2017 University of Glasgow DCS Tutorial Singular Value Decomposition M × N, Term × Document matrix (M >> N) D = [d1, d2, …, dN] and d = [t1, t2, …, tM]T Consider linear combination of terms u1t1+ u2t2+ … + uMtM = uTd which maximises E{(uTd)2} = E{uTddTu} = uT E{ddT} u ≈ uTDDTu Subject to uTu = 1 25/05/2017 University of Glasgow DCS Tutorial Singular Value Decomposition Maximise uTDDTu s.t uTu = 1 Construct Langrangian uTDDTu – λuTu Vector of partial derivatives set to zero DDTu – λu = (DDT – λI) u = 0 As u ≠ 0 then DDT – λI must be singular i.e |DDT – λI|= 0 This is a polynomial in λ of degree M with characteristic roots – called the eigenvalues (German eigen = own, unique to, particular to) 25/05/2017 University of Glasgow DCS Tutorial Singular Value Decomposition The first root is called the prinicipal eigenvalue which has an associated orthonormal (uTu = 1) eigenvector u Subsequent roots are ordered such that λ1> λ2 >… > λM with rank(D) non-zero values. Eigenvectors form an orthonormal basis i.e. uiTuj = δij The eigenvalue decomposition of DDT = UΣUT where U = [u1, u2, …, uM] and Σ = diag[λ 1, λ 2, …, λ M] Similarly the eigenvalue decomposition of DTD = VΣVT The SVD is closely related to the above D=U Σ1/2 VT The left eigenvectors U, right eigenvectors V, singular values = square root of eigenvalues. 25/05/2017 University of Glasgow DCS Tutorial SVD Properties D=U S VT = ∑i=1..N σiuiviT and DK= ∑i=1..K σiuiviT = UK SK VK T and K<N : UK T UK = IK = VK T VK Then DK is best rank K approximation to D, in F norm sense K-dim orthonormal projections S-1K UK T D =VK T preserve the maximum amount of variability Under the assumption that columns of D are multivariate Gaussian then V defines principal axes of ellipse of constant variance λi in original space 25/05/2017 University of Glasgow DCS Tutorial SVD Example D -- 10 x 2 2.9002 4.0860 1.9954 3.5069 4.4620 -2.9444 -4.1132 -3.6208 -3.0558 -6.1204 25/05/2017 3.6790 5.2366 3.3687 1.6748 2.7684 -4.6447 -4.7043 -5.0181 -4.1821 -2.4790 U -- 10 x 2 -0.2750 -0.3896 -0.2247 -0.2150 -0.3005 0.3177 0.3682 0.3613 0.3027 0.3563 -0.1242 -0.1846 -0.2369 0.3514 0.3318 0.2906 0.0833 0.2319 0.1861 -0.6935 S -- 2 x 2 16.9491 0 0 3.8491 University of Glasgow DCS Tutorial V T -- 2 x 2 -0.6960 -0.7181 0.7181 -0.6960 SVD Properties There is an implicit assumption that the observed data distribution is multivariate Gaussian Can consider as a probabilistic generative model – latent variables are Gaussian – sub-optimal in likelihood terms for non-Gaussian distribution Employed in signal processing for noise filtering – dominant subspace contains majority of information bearing part of signal Similar rationale when applying SVD to LSI 25/05/2017 University of Glasgow DCS Tutorial Computing SVD Power Method one numerical approach Random Set then Then initialisation of vector u0 u1u = DDTu0 and u1 = u1u / √ (u1u)T u1u u2u = DDTu1 and u2 = u2u / √ (u2u)T u2u uiu = DDTui-1 and ui = uiu / √ (uiu)T uiu As i ∞, ui u1, √ (uiu)T uiu λ1 Subsequent EV’s use deflation u1u = (DDT - λ1 u1 u1T) u0 Note for term document matrix computation of u1 Inexpensive – subsequent ev’s require matrix-vector operations on dense matrix. 25/05/2017 University of Glasgow DCS Tutorial Term Document Matrix Structure Create artificially heterogeneous collection 100 documents from 3 distinct newsgroups Indexed using standard stop word list 12418 distinct terms Term × Document Matrix (12418 × 300) 8% fill of sparse matrix Sort terms by rank – structure apparent Matrix of cosine similarity between documents Clear structure apparent 25/05/2017 University of Glasgow DCS Tutorial Term Document Matrix Structure 25/05/2017 University of Glasgow DCS Tutorial Query and Document Similarity in Latent Space •Rank 3 D3 = σ1u1v1T+ σ2u2 v2T + σ3u3 v3T • Projection into 3-d Latent Semantic Space of all documents achieved by S3-1U3TD • A query q in the LSA space S3-1U3Tq • Similarity in LSA space (S3-1U3Tq)T S3-1U3TD = qTU3S3-1S3-1U3TD = qTU3∑3-1U3TD = qT exp D = qT Θ D • LSA similarity metric Θ – term expansion University of Glasgow 25/05/2017 DCS Tutorial Query and Document Similarity in Latent Space • Project documents into 3-D latent space • Project query 25/05/2017 University of Glasgow DCS Tutorial Random Projections Important theoretical result Random projection from M - dim to L - dim space Where L << M then Euclidean distance and angles (norms and inner products) are preserved with high probability LSA can then be performed using SVD on the reduced dimensional L × N matrix (less costly) 25/05/2017 University of Glasgow DCS Tutorial 25/05/2017 University of Glasgow DCS Tutorial LSA Performance LSA consistently improves recall on standard test collections (precision/recall generally improved) Variable performance on larger TREC collections Dimensionality of Latent Space – a magic number – 300 – 1000 seems to work fine – no satisfactory way of assessing value. Computational cost – at present – prohibitive 25/05/2017 University of Glasgow DCS Tutorial Probabilistic Views on LSA Factor Analytic Model Generative Model Representation Alternate Basis to the Principal Directions 25/05/2017 University of Glasgow DCS Tutorial Factor Analytic Model d = Af + n p(d) = ∑f p(d|f)p(f) This probabilistic representation underlies LSA where prior and likelihood are both multivariate Gaussian. 25/05/2017 University of Glasgow DCS Tutorial Generative Model Representation Generate a document d with probability p(d) Having observed d generate a semantic factor with probability p(f|d) Having observed a semantic factor generate a word with probability p(w|f) 25/05/2017 University of Glasgow DCS Tutorial Generative Model Representation P(d) Factor 3 Documents The cat sat on the mat and the quick brown fox jumped… Factor 2 P(f|d) Factor 1 25/05/2017 University of Glasgow DCS Tutorial P(w|f) spider Generative Model Representation Model representation as joint probability p(d,w) = p(d)p(w|d) = p(d)∑f p(w|f)p(f|d) w and d conditionally independent given f p(d,w) = ∑f p(w|f)p(f)p(d|f) Note similarity with DK= ∑i=1..K σiuiviT 25/05/2017 University of Glasgow DCS Tutorial p(d,w) P(w=spider|f4)=0.6 25/05/2017 p(d)∑f p(w|f)p(f|d) = 0.001 P(w=spider|f4)=0.02 p(f=1|d)=0.6 Documents = P(w=spider|f4)=0.01 p(f=2|d)=0.1 P(d) = 0.003 The cat sat on the mat and the quick brown fox jumped… University of Glasgow DCS Tutorial p(f=3|d)=0.25 P(w=spider|f4)=0.1 p(f=4|d)=0.05 Generative Model Representation Distributions of p(f|d) and p(w|f) are multinomial – counts in successive trials More appropriate than Gaussian Note that Term × Document matrix is a sample from the true distribution pt(d, w) ∑ijD(i,j) log p(dj, wi) – cross-entropy between model and realisation – maximise likelihood that the model p(dj, wi) generated the realisation D – subject to conditions on p(f|d) and p(w|f) 25/05/2017 University of Glasgow DCS Tutorial Generative Model Representation Estimation of p(f|d) and p(w|f) requires use of a standard EM algorithm. Expectation Maximisation General iterative method for ML parameter estimation Ideal for ‘missing variable’ problems Estimate p(f|d,w) using current estimates of p(w|f) and p(f|d) Estimate new values of p(w|f) and p(f|d) using current estimate of p(f|d,w) 25/05/2017 University of Glasgow DCS Tutorial Generative Model Representation Once parameters estimated p(f|d) gives posterior probability that Semantic factor ‘f’ is associated with d p(w|f) gives the probability of word ‘w’ being generated from Semantic factor ‘f’ Nice clear interpretation unlike U and V terms in SVD ‘Sparse’ representation – unlike SVD 25/05/2017 University of Glasgow DCS Tutorial Generative Model Representation Take the toy collection generated – estimate p(f|d) and p(w|f) Graphical representation of p(f|d) 25/05/2017 University of Glasgow DCS Tutorial Generative Model Representation Ordered representation of p(w|f) 25/05/2017 image people people files science hockey graphics god game format objective year ftp koresh good net religious team email jesus time University of Glasgow DCS Tutorial Alternate Basis to the Principal Directions Similarity between query and documents can be assessed in ‘factor’ space – vis. LSA Sim = ∑f p(f|q) p(f|D) averaged product of query and doc posterior probabilities over all ‘factors’ – latent space Alternately note that D and q are sample instances from an unknown distribution All probabilities – word counts – estimated from D ‘noisy’ Employ p(dj, wi) as ‘smoothed’ version of tf and use ‘cosine’ measure ∑i p(D, wi) × qi ‘query expansion’ 25/05/2017 University of Glasgow DCS Tutorial Alternate Basis to the Principal Directions Both forms of matching shown to improve on LSA (MED,CRAN,CACM) Elegant statistically principled approach – can employ (in theory) Bayesian model assessment techniques. Likelihood nonlinear function of parameters p(f|d) and p(w|f) – Huge parameter space – small number of relative samples – high bias and variance expected Assessment of correlation with likelihood and P/R – yet to be studied in depth 25/05/2017 University of Glasgow DCS Tutorial Conclusions SVD defined basis provide P/R improvements over term matching Probabilistic approaches provide improvements over SVD Interpretation difficult Optimal dimension – open question Variable performance on LARGE coll’s Supercomputing muscle required Clear interpretation of decomposition Optimal dimension – open question High variability of results due to nonlinear optimisation over HUGE parameter space Improvements marginal in relation to cost 25/05/2017 University of Glasgow DCS Tutorial Latent Semantic & Hierarchic Document Clustering Had enough ? …. ….. To the Bar… 25/05/2017 University of Glasgow DCS Tutorial