Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Painless embeddings of distributions:
the function space view
Part 3 - Conditional independence with kernels
Arthur Gretton (MPI), Alex Smola (NICTA) , Kenji Fukumizu (ISM)
fukumizu@ism.ac.jp
ICML 2008 Tutorial
July 5, Helsinki, Finland
3-1
Outline of Part 3
I.
Introduction
II. Conditional independence with kernels
III. Application to causal inference
IV. Summary
3-2
I. Introduction
3-3
Functional Space View
Embedding into RKHS
Φ (X) = k( , X)
X
Ω (original space)
Φ
feature map
H (RKHS)
Basic statistics on RKHS
– Mean element
Æ characterizes a probability
– Covariance operator
Æ independence/dependence
– Conditional covariance operator
Æ conditional independence/dependence
3-4
Conditional Independence
Definition
X, Y, Z: random variables with joint probability density p XYZ ( x, y, z )
X and Y are conditionally independent given Z, if
pY |ZX ( y | z , x) = pY |Z ( y | z )
or
X
Z
Y
If Z is known, the information
of X is not necessary to predict Y.
p XY |Z ( x, y | z ) = p X |Z ( x | z ) pY |Z ( y | z )
Z
X
Y
3-5
Example
Applications in statistical inference
– Graphical modeling:
Separability in a graph implies conditional independence
– Causal inference
A formulation of causality is given by conditional independence
Example: Time series
– X causes Y?
Yt-1 Yt
Y
Non-causality
p(Yt | Yt-1 , Xt-1) = p(Yt | Yt-1) ?
X
Xt-1 Xt
Yt
Xt-1 | Yt-1
?
3-6
II. Conditional independence
with kernels
3-7
Review: Conditional Independence
for Gaussian Variables
Conditional covariance of Gaussian variables
– ( X , Y , Z ) : multidimensional jointly Gaussian variable
– Conditional covariance matrix
−1
VYX |Z ≡ Cov[Y , X | Z = z ] = VYX − VYZVZZ
VZX
VXY etc.: covariance matrix
Note: VYX|Z does not depend on the value of z
Conditional independence for Gaussian variables
X
Y |Z
⇔
VXY |Z = O
i.e.
−1
VYX − VYZVZZ
VZX = O
3-8
Conditional Covariance on RKHS
Conditional Cross-covariance operator
X, Y, Z : random variables taking values on X, Y, Z (resp.).
(HX, kX), (HY , kY), (HZ , kZ) : RKHS defined on X, Y, Z (resp.).
– Conditional cross-covariance operator H X → H Y
ΣYX |Z ≡ ΣYX − ΣYZ Σ −1
ZZ Σ ZX
( ΣYX etc.: covariance operators)
−1
c.f. VYX |Z = VYX − VYZVZZ VZX
−1
Note: Σ ZZ may not exist. But, we have the decomposition
/2
ΣYX = Σ1YY/ 2WYX Σ1XX
Rigorously, define
with || WYX ||≤ 1
/2
ΣYX |Z ≡ ΣYX − Σ1YY/ 2WYZWZX Σ1XX
1/ 2
1/ 2 T
T
* A = UΛ U if A = UΛU .
3-9
Conditional Covariance and
Conditional Covariance Operator
Cond. cov. operator expresses cond. covariance
Theorem (FBJ’06, Sun et al. ’07)
X, Y, Z : random variables taking values on X, Y, Z (resp.).
(HX, kX), (HY , kY), (HZ , kZ) : RKHS defined on X, Y, Z (resp.).
Assume k Z is a characteristic kernel.
g , ΣYX |Z f = E [Cov[ g (Y ), f ( X ) | Z ]]
or
[
]
ΣYX|Z = EZ ∫ ΦY (Y ) ⊗ ΦX ( X )dP( X ,Y | Z )
− EZ [∫ ΦY (Y ) ⊗ ΦX ( X )dP( X | Z )dP(Y | Z )]
= μ XY − μ EZ [Y
X |Z ]
– c.f. for Gaussian variables
aT VXY |Z b = Cov[aT X , bT Y | Z ]
(not dependent on the value of z)
3-10
Conditional Independence with Kernels
(FBJ2004, FBJ2006, Sun et al. 2007)
Extended variables are used.
X&& = ( X , Z ), Y&& = (Y , Z )
kX&& = kX k Z , kY&& = kY k Z
Theorem (FBJ’06, Sun et al. ’07)
Assume the kernels kX&& , kY , and k Z are characteristic. Then,
X
Y |Z
⇔
ΣYX&&|Z = O
– c.f. for Gaussian variables,
X
(⇔
Y |Z
ΣY&&X |Z = O ⇔ ΣY&&X&&|Z = O
⇔
)
VXY |Z = O
– With characteristic kernels, comparison between the (conditional)
mean elements on RKHS characterizes conditional independence.
– Why is the “extended variable” needed?
ΣYX |Z = O ⇒ p ( x, y ) = ∫ p ( x | z ) p ( y | z ) p ( z )dz
ΣY [ X ,Z ]|Z = O ⇒
p ( x, y, z ' ) = ∫ p ( x, z '| z ) p ( y | z ) p ( z )dz
where
p ( x, z '| z ) = p( x | z )δ ( z '− z )
3-11
Measure of Conditional Independence
Hilbert-Schmidt norm of cond. cov. operator
HSCIC = Σ X&&Y&&|Z
With characteristic kernels,
2
X&& = ( X , Z ), Y&& = (Y , Z )
HS
HSCIC = 0 ⇔
Y |Z
X
Empirical estimation is painless!
(X1, Y1, Z1), ... , (XN, YN, ZN) : data
(
Σ X&&Z → Σˆ (&N& ) = 1 ∑iN=1 (kX&& ( ⋅ , X&&i ) − mˆ X&& ) ⊗ (k Z ( ⋅ , Z i ) − mˆ Z ), Σ −1 → Σˆ ( N ) + ε I
XZ
ZZ
N
ZZ
N
2
−1 ( N ) 2
N
N
N
(
)
(
)
(
)
HSCIC = Σ && &&
Æ HSCICemp = Σˆ &&&& − Σˆ && Σˆ ZZ + ε N I Σˆ &&
XY |Z HS
[
YX
YZ
(
−1 ~ ~
~ ~
~ ~
HSCICemp = Tr K X&& KY&& − 2 K X&& (K Z + Nε N I N ) K Z KY&&
)
ZX
)
−1
HS
−1
−1 ~ ~
~ ~
~
+ K Z (K Z + Nε N I N ) K X&& (K Z + Nε N I N ) K Z KY&&
]
3-12
Normalized Cond. Covariance
Normalized conditional cross-covariance operator
(
)
−1 / 2
1/ 2
−1 / 2
1/ 2
WYX |Z = ΣYY
ΣYX |Z Σ −XX
= ΣYY
ΣYX − ΣYZ Σ −ZZ1 Σ ZX Σ −XX
/2
Recall: ΣYX = Σ1YY/ 2WYX Σ1XX
HSNCIC = WX&&Y&&|Z
2
HS
HSNCICemp = Tr [RX&& RY&& − 2 RX&& RY&& RZ + RX&& RZ RY&& RZ ]
−1
~ ~
RX&& ≡ K X&& (K X&& + Nε N I N ) etc.
Kernel-free expression.
|| WY&&X&&|Z ||2HS
=
∫∫
With characteristic kernels,
2
⎛ p XYZ ( x, y, z ) − p X |Z ( x | z ) pY |Z ( y | z ) pZ ( z ) ⎞
⎜⎜
⎟⎟ p XZ ( x, z ) pYZ ( y, z )dxdydz
p XZ ( x, z ) pYZ ( y, z )
⎝
⎠
(“Conditional” mean square contingency)
3-13
Conditional Independence Test
Background
– There are no good methods for conditional independence test on
non-Gaussian continuous variables. (e.g. discretizing all variables).
TN = HSNCICemp
– Partition the values of Z into C1, …, CL,
and define Al = {i | Z i ∈ Cl } (l = 1,..., L).
– Resampling (for b = 1,2,…)
1. Generate pseudo cond. indep. sample D(b)
by permuting X within each Al .
2. Compute TN(b) for the sample D(b) .
Approximate the null distribution by samples.
permute
or
permute
TN = HSCICemp
permute
Permutation test with the kernel measure
{
{
{
X 1,i1 Y1,i1
X 1,i2 Y1,i2 C1
X 1,i3 Y1,i3
X 2,i4 Y2,i4
X 2,i2 Y2,i2 C2
X 2,i6 Y2,i6
…
X L ,i7 YL ,i7
X L ,i8 YL ,i8 C L
X L ,i9 YL ,i9
– Set the threshold for the significance level (e.g. 5%).
3-14
Application to Graphical Modeling
– Three continuous variables of medical measurements. N = 35.
(Edwards 2000, Sec.3.1.4)
Creatinine clearance (C), Digoxin clearance (D), Urine flow (U)
Kernel mehod (permut. test)
HSN(C)IC
D U| C
C D
C U
D U
Linear method
P-val.
(partial) cor.
P-val.
1.458
0.924
Parcor(D,U|C)
0.4847
0.0037
0.776
<0.001
Cor(C,D)
0.7754
0.0000
0.194
0.117
Cor(C,U)
0.3092
0.0707
0.343
0.023
Cor(D,U)
0.5309
0.0010
– Suggested undirected graphical model by kernel method
D
U
The conditional independence D U | C
coincides with the medical knowledge.
C
3-15
III. Application to causal
inference
3-16
Causal Inference
With manipulation – intervention
X is a cause of Y?
X
manipulate
Y
observation
Easier. (do-calculus, Pearl 1995)
No manipulation / with temporal information
X (t ) Y (t ) : observed time series
X(1), …, X(t) are a cause of Y(t+1)?
No manipulation / no temporal information
X
Causal inference is harder.
Y
3-17
Causality of Time Series
Causality by conditional independence
– Extended notion of Granger causality (linear AR)
X is NOT a cause of Y if
p (Yt | Yt −1 ,..., Yt − p , X t −1 ,..., X t − p ) = p (Yt | Yt −1 ,..., Yt − p )
Yt-1 Yt
Yt
X t −1 ,..., X t − p | Yt −1 ,..., Yt − p
X
– Kernel measures for causality
HSCIC = Σˆ Y(&&NXp−|Yp +p 1)
HSNCIC =
Y
2
Xt-1 Xt
HS
2
( N − p +1)
ˆ
WY&&Xp |Yp
HS
X p = {( X t −1, X t −2 ,L, X t − p ) ∈ R p | t = p + 1,..., N }
Yp = {(Yt −1,Yt −2 ,L, Yt − p ) ∈ R p | t = p + 1,..., N }
3-18
Example
Coupled Hénon map
Yt-1 Yt
– X, Y:
Y
γ
⎧ x1 (t + 1) = 1.4 − x1 (t ) 2 + 0.3 x2 (t )
X
⎨
⎩ x2 (t + 1) = x1 (t )
⎧⎪ y1 (t + 1) = 1.4 − γ x1 (t ) y1 (t ) + (1 − γ ) y1 (t ) 2 + 0.1 y2 (t )
⎨
⎪⎩ y2 (t + 1) = y1 (t )
{
2
}
2
x1-x2
Xt-1 Xt
x1-y1
2
2.5
2
1.5
1.5
1.5
1.5
1
1
1
1
0.5
0.5
0.5
0.5
0
0
0
0
-0.5
-0.5
-0.5
-1
-0.5
-1
-1.5
-2
-2
-1
0
1
2
-1
-2
-1
0
γ=0
1
2
-1.5
-2
-1
-1
0
γ = 0.25
1
2
-1.5
-2
-1
0
γ = 0.8
1
3-19
2
Causality of coupled Hénon map
– X is a cause of Y if γ > 0.
Yt
– Y is not a cause of X for all γ.
X t −1 ,..., X t − p | Yt −1 ,..., Yt − p
Xt
Yt −1 ,..., Yt − p | X t −1 ,..., X t − p
– Permutation tests for non-causality with HSNCIC = WˆY&(&XN −|Yp +1)
p p
– N = 100. Significance level α = 5%
2
HS
Ratio of accepting Non-Causality (/100 experiments)
XÆY
100
90
80
70
60
50
40
30
20
10
0
Y Æ X (non-causal)
HSNCIC
Granger
causal area
0
0.1 0.2 0.3 0.4 0.5 0.6
γ
100
90
80
70
60
50
40
30
20
10
0
HSNCIC
Granger
0
0.1 0.2 0.3 0.4 0.5 0.6
γ
3-20
Causal Inference from
Non-experimental Data
Why is it possible?
X
V-structure
X
Y
Y
and
Z
– All the directions are not distinguishable.
Z
X
Z
Y
Y
X
p(x|z)p(y|z)p(z) = p(x|z)p(z|y)p(y) =
X
Y |Z
X
Z
Y
p(x|z)p(z|y)p(x) = p(x,y,z)
– Constraint-based causal learning
• Determine the cond. independence of the underlying probability.
• Markov assumption: data is generated by a DAG.
3-21
Causal Leaning
Inductive causation (IC, Verma&Pearl 90)
– Basic idea:
• Make a list of all conditional independence /dependence
relations among variables.
• Make an undirected graph under Markov assumption.
• Make directions of the edges by finding V-structure.
– PC algorithm (Peter Sprites & Clark Glymour 1991)
• Efficient implementation of IC.
• Gaussian or discrete assumptions for the cond. indep. tests.
Kernel Causal Learning (KCL, Sun et al. ICML2007)
– Kernel test for conditional independence for both of continuous and
discrete variables.
– Make directions by voting.
3-22
Experiment: Montana Economic Outlook Poll
– Data: 7 discrete variables, N = 209
AGE (3), SEX (2), INCOME (3), POLITICAL (3), AREA (3),
FINANCIAL status (3, better/same/worse than a year ago), OUTLOOK (2)
SEX
SEX
AGE
INCOME FINANCIAL AREA
INCOME FINANCIAL AREA
SEX
OUTLOOK
FCI
AGE
AGE
POLITICAL
INCOME FINANCIAL AREA
OUTLOOK
OUTLOOK
POLITICAL
KCL
POLITICAL
BN-PC
BN-PC is a constraint-based method using mutual information (Chen et al. 2002)
FCI is the fast IC algorithm which allows hidden variables. (Spirtes et al 1993)
3-23
Summary of Part 3
Conditional independence with kernels
– Conditional covariance on RKHS characterizes conditional
independence.
– HS-norm for finite sample gives a kernel measure of conditional
independence
– Kernel method gives a unified method of conditional independence
test for continuous and discrete variables.
Causal inference with kernels
– Kernel conditional independence test are applied to causal
inference, such as
• causality of time series (extension of Granger causality)
• causal inference from non-experimental data
(constrained-based causal learning).
3-24
References
Berlinet, A. and Thomas-Agnan, C. Reproducing Kernel Hilbert Spaces in Probability and
Statistics. Kluwer Academic Publishers. (2004).
Cheng, J., R. Greiner, J. Kelly, D. A. Bell, and W. Liu. Learning Bayesian networks from data:
An information-theory based approach. Artificial Intelligence Journal, 137:43–90, 2002.
Fukumizu, K., A. Gretton, X. Sun., and B. Schölkopf. Kernel Measures of Conditional
Dependence. Advances in NIPS 20:489-496 (2008)
Fukumizu, K., F. Bach and M. Jordan. Kernel dimension reduction in regression. Tech. Report
715, Dept. Statistics, University of California, Berkeley, 2006.
Granger, C. W. J. Investigating causal relations by econometric models and cross-spectral
methods. Econometrica, 37:424-438 (1969).
Spirtes, P. and C. Glymour. An algorithm for fast recovery of sparse causal graphs. Social
Science Computer Review 9:62-72.
Spirtes, P., C. Glymour, and R. Scheines. Causation, prediction, and search. Springer-Verlag,
New York, NY, 1993.
Sun, X., D. Janzing, B. Schölkopf, and K. Fukumizu. A kernel-based causal learning algorithm.
Proc. 24th Intern. Conf. Machine Learning (ICML2007), pp.855-862. (2007)
Verma, T., J. Pearl. Equivalence and synthesis of causal models. Proc. 6th Conf. Uncertainty
in Artificial Intelligence (UAI1990) pp.220-227 (1990)
Pearl, J. Causality. Cambridge University Press (2000)
Edwards, D. Introduction to graphical modelling. Springer verlag, New York (2000).
3-25