Download Latent Semantic Analysis A Gentle Tutorial Introduction

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Latent Semantic Analysis
A Gentle Tutorial Introduction
Tutorial Resources
http://cis.paisley.ac.uk/giro-ci0/GU_LSA_TUT
M.A. Girolami
25/05/2017
University of Glasgow
DCS Tutorial
Contents

Latent Semantic Analysis





Probabilistic Views on LSA




Motivation
Singular Value Decomposition
Term Document Matrix Structure
Query and Document Similarity in Latent Space
Factor Analytic Model
Generative Model Representation
Alternate Basis to the Principal Directions
Latent Semantic & Document Clustering (In the Bar later)


Principal Direction Clustering
Hierarchic Clustering with LSA
25/05/2017
University of Glasgow
DCS Tutorial
Latent Semantic Analysis

Motivation





Lexical matching at term level inaccurate (claimed)
Polysemy – words with number of ‘meanings’ – term
matching returns irrelevant documents – impacts precision
Synonomy – number of words with same ‘meaning’ – term
matching misses relevant documents – impacts recall
LSA assumes that there exists a LATENT structure in
word usage – obscured by variability in word choice
Analogous to signal + additive noise model in signal
processing
25/05/2017
University of Glasgow
DCS Tutorial
Latent Semantic Analysis





Word usage defined by term and document cooccurrence – matrix structure
Latent structure / semantics in word usage
Clustering documents or words – no shared space
Two mode factor analysis – dyadic decomposition into
‘latent semantic’ factor space - employing - Singular
Value Decomposition
Cubic Computational Scaling – reasonable !
25/05/2017
University of Glasgow
DCS Tutorial
Singular Value Decomposition
M × N, Term × Document matrix (M >> N)
D = [d1, d2, …, dN] and d = [t1, t2, …, tM]T
Consider linear combination of terms
u1t1+ u2t2+ … + uMtM = uTd
which maximises
E{(uTd)2} = E{uTddTu} = uT E{ddT} u ≈ uTDDTu
Subject to uTu = 1

25/05/2017
University of Glasgow
DCS Tutorial
Singular Value Decomposition
Maximise uTDDTu s.t uTu = 1
Construct Langrangian uTDDTu – λuTu
Vector of partial derivatives set to zero
DDTu – λu = (DDT – λI) u = 0
As u ≠ 0 then DDT – λI must be singular i.e
|DDT – λI|= 0
This is a polynomial in λ of degree M with characteristic
roots – called the eigenvalues
(German eigen = own, unique to, particular to)
25/05/2017
University of Glasgow
DCS Tutorial
Singular Value Decomposition
The first root is called the prinicipal eigenvalue which has an
associated orthonormal (uTu = 1) eigenvector u
Subsequent roots are ordered such that λ1> λ2 >… > λM with
rank(D) non-zero values.
Eigenvectors form an orthonormal basis i.e. uiTuj = δij
The eigenvalue decomposition of DDT = UΣUT
where U = [u1, u2, …, uM] and Σ = diag[λ 1, λ 2, …, λ M]
Similarly the eigenvalue decomposition of DTD = VΣVT
The SVD is closely related to the above D=U Σ1/2 VT
The left eigenvectors U, right eigenvectors V,
singular values = square root of eigenvalues.
25/05/2017
University of Glasgow
DCS Tutorial
SVD Properties




D=U S VT = ∑i=1..N σiuiviT and DK= ∑i=1..K σiuiviT = UK
SK VK T and K<N : UK T UK = IK = VK T VK
Then DK is best rank K approximation to D, in F norm
sense
K-dim orthonormal projections S-1K UK T D =VK T
preserve the maximum amount of variability
Under the assumption that columns of D are
multivariate Gaussian then V defines principal axes of
ellipse of constant variance λi in original space
25/05/2017
University of Glasgow
DCS Tutorial
SVD Example
D -- 10 x 2
2.9002
4.0860
1.9954
3.5069
4.4620
-2.9444
-4.1132
-3.6208
-3.0558
-6.1204
25/05/2017
3.6790
5.2366
3.3687
1.6748
2.7684
-4.6447
-4.7043
-5.0181
-4.1821
-2.4790
U -- 10 x 2
-0.2750
-0.3896
-0.2247
-0.2150
-0.3005
0.3177
0.3682
0.3613
0.3027
0.3563
-0.1242
-0.1846
-0.2369
0.3514
0.3318
0.2906
0.0833
0.2319
0.1861
-0.6935
S -- 2 x 2
16.9491
0
0 3.8491
University of Glasgow
DCS Tutorial
V T -- 2 x 2
-0.6960 -0.7181
0.7181 -0.6960
SVD Properties




There is an implicit assumption that the observed
data distribution is multivariate Gaussian
Can consider as a probabilistic generative model –
latent variables are Gaussian – sub-optimal in
likelihood terms for non-Gaussian distribution
Employed in signal processing for noise filtering –
dominant subspace contains majority of
information bearing part of signal
Similar rationale when applying SVD to LSI
25/05/2017
University of Glasgow
DCS Tutorial
Computing SVD

Power Method one numerical approach
Random
Set
then
Then
initialisation of vector u0
u1u = DDTu0 and u1 = u1u / √ (u1u)T u1u
u2u = DDTu1 and u2 = u2u / √ (u2u)T u2u
uiu = DDTui-1 and ui = uiu / √ (uiu)T uiu
As i  ∞, ui  u1, √ (uiu)T uiu  λ1

Subsequent EV’s use deflation
u1u = (DDT - λ1 u1 u1T) u0
Note for term document matrix computation of u1
Inexpensive – subsequent ev’s require matrix-vector
operations on dense matrix.

25/05/2017
University of Glasgow
DCS Tutorial
Term Document Matrix Structure









Create artificially heterogeneous collection
100 documents from 3 distinct newsgroups
Indexed using standard stop word list
12418 distinct terms
Term × Document Matrix (12418 × 300)
8% fill of sparse matrix
Sort terms by rank – structure apparent
Matrix of cosine similarity between documents
Clear structure apparent
25/05/2017
University of Glasgow
DCS Tutorial
Term Document Matrix Structure
25/05/2017
University of Glasgow
DCS Tutorial
Query and Document Similarity
in Latent Space
•Rank 3 D3 = σ1u1v1T+ σ2u2 v2T + σ3u3 v3T
• Projection into 3-d Latent Semantic Space
of all documents achieved by S3-1U3TD
• A query q in the LSA space S3-1U3Tq
• Similarity in LSA space
(S3-1U3Tq)T S3-1U3TD
= qTU3S3-1S3-1U3TD
= qTU3∑3-1U3TD
= qT exp D = qT Θ D
• LSA similarity metric
Θ – term expansion
University of Glasgow
25/05/2017
DCS Tutorial
Query and Document Similarity
in Latent Space
• Project documents into 3-D latent space
• Project query
25/05/2017
University of Glasgow
DCS Tutorial
Random Projections

Important theoretical result




Random projection from M - dim to L - dim space
Where L << M then
Euclidean distance and angles (norms and inner
products) are preserved with high probability
LSA can then be performed using SVD on the
reduced dimensional L × N matrix (less costly)
25/05/2017
University of Glasgow
DCS Tutorial
25/05/2017
University of Glasgow
DCS Tutorial
LSA Performance




LSA consistently improves recall on standard
test collections (precision/recall generally
improved)
Variable performance on larger TREC
collections
Dimensionality of Latent Space – a magic
number – 300 – 1000 seems to work fine – no
satisfactory way of assessing value.
Computational cost – at present – prohibitive
25/05/2017
University of Glasgow
DCS Tutorial
Probabilistic Views on LSA
Factor Analytic Model
 Generative Model Representation
 Alternate Basis to the Principal
Directions

25/05/2017
University of Glasgow
DCS Tutorial
Factor Analytic Model
d = Af + n
 p(d) = ∑f p(d|f)p(f)
 This probabilistic
representation underlies LSA
where prior and likelihood are
both multivariate Gaussian.

25/05/2017
University of Glasgow
DCS Tutorial
Generative Model
Representation
Generate a document d with
probability p(d)
 Having observed d generate a
semantic factor with probability
p(f|d)
 Having observed a semantic factor
generate a word with probability
p(w|f)

25/05/2017
University of Glasgow
DCS Tutorial
Generative Model
Representation
P(d)
Factor 3
Documents
The cat sat
on the mat
and the
quick
brown fox
jumped…
Factor 2
P(f|d)
Factor 1
25/05/2017
University of Glasgow
DCS Tutorial
P(w|f)
spider
Generative Model
Representation

Model representation as joint probability
p(d,w) =
p(d)p(w|d)
=
p(d)∑f p(w|f)p(f|d)
w and d conditionally independent given f


p(d,w) = ∑f p(w|f)p(f)p(d|f)
Note similarity with DK= ∑i=1..K σiuiviT
25/05/2017
University of Glasgow
DCS Tutorial
p(d,w)
P(w=spider|f4)=0.6
25/05/2017
p(d)∑f p(w|f)p(f|d) = 0.001
P(w=spider|f4)=0.02
p(f=1|d)=0.6
Documents
=
P(w=spider|f4)=0.01
p(f=2|d)=0.1
P(d) = 0.003
The cat sat
on the mat
and the
quick brown
fox
jumped…
University of Glasgow
DCS Tutorial
p(f=3|d)=0.25
P(w=spider|f4)=0.1
p(f=4|d)=0.05
Generative Model
Representation




Distributions of p(f|d) and p(w|f) are
multinomial – counts in successive trials
More appropriate than Gaussian
Note that Term × Document matrix is a
sample from the true distribution pt(d, w)
∑ijD(i,j) log p(dj, wi) – cross-entropy between
model and realisation – maximise likelihood
that the model p(dj, wi) generated the
realisation D – subject to conditions on p(f|d)
and p(w|f)
25/05/2017
University of Glasgow
DCS Tutorial
Generative Model
Representation


Estimation of p(f|d) and p(w|f) requires use of
a standard EM algorithm.
Expectation Maximisation




General iterative method for ML parameter estimation
Ideal for ‘missing variable’ problems
Estimate p(f|d,w) using current estimates of
p(w|f) and p(f|d)
Estimate new values of p(w|f) and p(f|d) using
current estimate of p(f|d,w)
25/05/2017
University of Glasgow
DCS Tutorial
Generative Model
Representation



Once parameters estimated
 p(f|d) gives posterior probability that
Semantic factor ‘f’ is associated with d
 p(w|f) gives the probability of word ‘w’
being generated from Semantic factor ‘f’
Nice clear interpretation unlike U and V terms
in SVD
‘Sparse’ representation – unlike SVD
25/05/2017
University of Glasgow
DCS Tutorial
Generative Model
Representation


Take the toy collection generated – estimate
p(f|d) and p(w|f)
Graphical representation of p(f|d)
25/05/2017
University of Glasgow
DCS Tutorial
Generative Model
Representation

Ordered representation of p(w|f)
25/05/2017
image
people
people
files
science
hockey
graphics
god
game
format
objective
year
ftp
koresh
good
net
religious
team
email
jesus
time
University of Glasgow
DCS Tutorial
Alternate Basis to the
Principal Directions





Similarity between query and documents can be
assessed in ‘factor’ space – vis. LSA
Sim = ∑f p(f|q) p(f|D) averaged product of query
and doc posterior probabilities over all ‘factors’ –
latent space
Alternately note that D and q are sample instances
from an unknown distribution
All probabilities – word counts – estimated from D
‘noisy’
Employ p(dj, wi) as ‘smoothed’ version of tf and use
‘cosine’ measure ∑i p(D, wi) × qi ‘query expansion’
25/05/2017
University of Glasgow
DCS Tutorial
Alternate Basis to the
Principal Directions




Both forms of matching shown to improve on
LSA (MED,CRAN,CACM)
Elegant statistically principled approach –
can employ (in theory) Bayesian model
assessment techniques.
Likelihood nonlinear function of parameters
p(f|d) and p(w|f) – Huge parameter space – small
number of relative samples – high bias and variance
expected
Assessment of correlation with likelihood
and P/R – yet to be studied in depth
25/05/2017
University of Glasgow
DCS Tutorial
Conclusions

SVD defined basis provide P/R improvements over term
matching





Probabilistic approaches provide improvements over SVD




Interpretation difficult
Optimal dimension – open question
Variable performance on LARGE coll’s
Supercomputing muscle required
Clear interpretation of decomposition
Optimal dimension – open question
High variability of results due to nonlinear optimisation over HUGE
parameter space
Improvements marginal in relation to cost
25/05/2017
University of Glasgow
DCS Tutorial
Latent Semantic & Hierarchic
Document Clustering
Had enough ? ….
 ….. To the Bar…

25/05/2017
University of Glasgow
DCS Tutorial