Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
THE MATHEMATICS OF CAUSE AND EFFECT Judea Pearl UCLA November 8, 2012 OUTLINE 1. From Turing test to Bayes networks 2. From Bayes networks to do-calculus 3. From messy science to counterfactuals 4. From counterfactuals to practical victories a) policy evaluation b) attribution c) mediation d) generalizability CAN MACHINE THINK? Alan M. Turing (1912 – 1954) • • • The Turing Test “Computing Machinery and Intelligence” (1950) Turing: Yes, if it acts like it thinks. Acts = It answers non-trivial questions about a story, a topic or a situation? HOW TURING ENVISIONED THE TEST CONVERSATION Q A Please write me a sonnet on the subject of the Forth Bridge. Count me out on this one. I never could write poetry. Q A Add 34,957 and 70,764. (Pause about 30 seconds and then give an answer) 105,721. Q A Do you play chess? Yes. Q I have K at my K1, and no other pieces. You have only K at K6 and R at R1. It is your move. What do you play? (After a pause of 15 seconds) R-R8: mate! A A CHILD MACHINE AND EVOLUTION • The survival of the fittest is a slow method for measuring advantages. • The experimenter, by exercise of intelligence, should be able to speed it up. • If he can trace a cause for some weakness he can probably think of the kind of mutation which will improve it. (A.M. Turing, 1950) A “MINI” TURING TEST IN CAUSAL CONVERSATION The Story Input: Story Question: What is? What if? Why? Answers: I believe that... Image adapted from Saygin, 2000. Q1: If the season is dry, and the pavement is slippery did it rain? A1: Unlikely, it is more likely the sprinkler was ON. with a very slight possibility that it is not even wet. A “MINI” TURING TEST IN CAUSAL CONVERSATION The Story Image adapted from Saygin, 2000. Q2: But what if we SEE that the sprinkler is OFF? A2: Then it is more likely that it rained. A “MINI” TURING TEST IN CAUSAL CONVERSATION The Story Image adapted from Saygin, 2000. Q3: Do you mean that if we actually turn the sprinkler ON, the rain will be less likely? A3: No, the likelihood of rain would remain the same but the pavement will surely get wet. A “MINI” TURING TEST IN CAUSAL CONVERSATION The Story Image adapted from Saygin, 2000. Q4: Suppose we SEE that the sprinkler is ON and the pavement wet. What if the sprinkler were OFF? A4: The pavement would be dry, because the season is likely dry. SEARLE’S CHINESE ROOM ARGUMENT WHAT’S IN SEARLE’S RULE BOOK? Searle's oversight: there are not enough molecules in the universe to make the book. Even for the sprinkler example. Why causal conversation. IS PARSIMONY NECESSARY (SUFFICIENT) FOR UNDERSTANDING? Understanding requires translating world constraints into a grammar (contraints over symbol strings) and harnessing it to answer queries swiftly and reliably. Parsimony can only be achieved by exploiting the constraints in the world to beat the combinatorial explosion. THE PLURALITY OF MINI TURING TESTS Poetry Arithmetic Turing Test Data-intensive Scientific applications Thousands of Hungry and aimless customers ... Causal Reasoning . Chess Stock market Human Cognition and Ethics Robotics Scientific thinking THE PLURALITY OF MINI TURING TESTS Poetry Arithmetic Turing Test ... Causal Reasoning . Chess Stock market Human Cognition and Ethics Causal Explanation “She handed me the fruit and I ate” “The serpent deceived me, and I ate” COUNTERFACTUALS AND OUR SENSE OF JUSTICE Abraham: Are you about to smite the righteous with the wicked? What if there were fifty righteous men in the city? And the Lord said, “If I find in the city of Sodom fifty good men, I will pardon the whole place for their sake.” Genesis 18:26 THE PLURALITY OF MINI TURING TESTS Poetry Arithmetic Turing Test ... Causal Reasoning . Chess Stock market Human Cognition and Ethics Scientific thinking WHY PHYSICS IS COUNTERFACTUAL Scientific Equations (e.g., Hooke’s Law) are non-algebraic e.g., Length (Y) equals a constant (2) times the weight (X) Correct notation: Y :==2X 2X X=3 X=1 Process information X=1 Y=2 The solution Had X been 3, Y would be 6. If we raise X to 3, Y would be 6. Must “wipe out” X = 1. X=½Y X=3 Y = X+1 Alternative WHY PHYSICS IS COUNTERFACTUAL Scientific Equations (e.g., Hooke’s Law) are non-algebraic e.g., Length (Y) equals a constant (2) times the weight (X) Correct notation: (or) Y 2X X=3 X=1 Process information X=1 Y=2 The solution Had X been 3, Y would be 6. If we raise X to 3, Y would be 6. Must “wipe out” X = 1. X=½Y X=3 Y = X+1 Alternative THE PLURALITY OF MINI TURING TESTS Poetry Arithmetic Turing Test ... Causal Reasoning . Chess Stock market Human Cognition and Ethics Robotics Scientific thinking CAUSATION AS A PROGRAMMER'S NIGHTMARE Input: 1. “If the grass is wet, then it rained” 2. “if we break this bottle, the grass will get wet” Output: “If we break this bottle, then it rained” WHAT KIND OF QUESTIONS SHOULD THE ROBOT ANSWER? • • • • Observational Questions: “What if we see A” (What is?) Action Questions: “What if we do A?” (What if?) Counterfactuals Questions: “What if we did things differently?” Options: “With what probability?” THE CAUSAL HIERARCHY (Why?) THE PLURALITY OF MINI TURING TESTS Poetry Arithmetic Turing Test Data-intensive Scientific applications Thousands of Hungry and aimless customers ... Causal Reasoning . Chess Stock market Human Cognition and Ethics Robotics Scientific thinking THE FIVE NECESSARY STEPS FOR EFFECT ESTIMATION Define: Express the target quantity Q as a property of the model M. P(Yx y ) or P( y | do( x)) Assume: Express causal assumptions in structural or graphical form. Identify: Determine if Q is identifiable. Estimate: Estimate Q if it is identifiable; approximate it, if it is not. Test: If M has testable implications THE FIVE NECESSARY STEPS FOR AVERAGE TREATMENT EFFECT Define: Express the target quantity Q as a property of the model M. ATE E (Y | do( x1)) E (Y | do( x0 )) Assume: Express causal assumptions in structural or graphical form. Identify: Determine if Q is identifiable. Estimate: Estimate Q if it is identifiable; approximate it, if it is not. Test: If M has testable implications THE FIVE NECESSARY STEPS FOR DYNAMIC POLICY ANALYSIS Define: Express the target quantity Q as a property of the model M. P( y | do( x g ( z )) Assume: Express causal assumptions in structural or graphical form. Identify: Determine if Q is identifiable. Estimate: Estimate Q if it is identifiable; approximate it, if it is not. Test: If M has testable implications THE FIVE NECESSARY STEPS FOR TIME VARYING POLICY ANALYSIS Define: Express the target quantity Q as a property of the model M. P( y | do( X x, Z z,W w)) Assume: Express causal assumptions in structural or graphical form. Identify: Determine if Q is identifiable. Estimate: Estimate Q if it is identifiable; approximate it, if it is not. Test: If M has testable implications THE FIVE NECESSARY STEPS FOR TREATMENT ON TREATED Define: Express the target quantity Q a property of the model M. ETT P(Yx y | X x' ) Assume: Express causal assumptions in structural or graphical form. Identify: Determine if Q is identifiable. Estimate: Estimate Q if it is identifiable; approximate it, if it is not. Test: If M has testable implications THE FIVE NECESSARY STEPS FOR INDIRECT EFFECTS Define: Express the target quantity Q a property of the model M. IE E[Yx, Z ( x') ] E[Yx ] Assume: Express causal assumptions in structural or graphical form. Identify: Determine if Q is identifiable. Estimate: Estimate Q if it is identifiable; approximate it, if it is not. Test: If M has testable implications THE FIVE NECESSARY STEPS FROM DEFINITION TO ASSUMPTIONS Define: Express the target quantity Q as a property of the model M. Assume: Express causal assumptions in structural or graphical form. Identify: Determine if Q is identifiable. Estimate: Estimate Q if it is identifiable; approximate it, if it is not. Test: If M has testable implications THE LOGIC OF CAUSAL ANALYSIS A - CAUSAL ASSUMPTIONS CAUSAL MODEL (MA) A* - Logical implications of A Causal inference Q Queries of interest Q(P) - Identified estimands T(MA) - Testable implications Statistical inference Data (D) Q - Estimates of Q(P) Q( D, A) Provisional claims g (T ) Model testing Goodness of fit STRUCTURAL CAUSAL MODELS: THE WORLD AS A COLLECTION OF SPRINGS Definition: A structural causal model is a 4-tuple V,U, F, P(u), where • V = {V1,...,Vn} are endogeneas variables • U = {U1,...,Um} are background variables • F = {f1,..., fn} are functions determining V, vi = fi(v, u) e.g., y x uY • P(u) is a distribution over U P(u) and F induce a distribution P(v) over observable variables COUNTERFACTUALS ARE EMBARRASINGLY SIMPLE Definition: The sentence: “Y would be y (in situation u), had X been x,” denoted Yx(u) = y, means: The solution for Y in a mutilated model Mx, (i.e., the equations for X replaced by X = x) with input U=u, is equal to y. The Fundamental Equation of Counterfactuals: Yx (u ) YM x (u ) COUNTERFACTUALS ARE EMBARRASINGLY SIMPLE Definition: The sentence: “Y would be y (in situation u), had X been x,” denoted Yx(u) = y, means: The solution for Y in a mutilated model Mx, (i.e., the equations for X replaced by X = x) with input U=u, is equal to y. • Joint probabilities of counterfactuals: P(Yx y, Z w z ) In particular: u:Yx (u ) y, Z w (u ) z P( y | do(x ) ) P(Yx y ) P(Yx ' y '| x, y ) u:Yx (u ) y P (u ) P(u | x, y ) u:Yx ' (u ) y ' P(u ) THE MIRACLE OF UNIVERSAL CONSTRAINTS E PLURIBUS UNUM – OUT OF MANY, ONE C (Climate) C f C (U C ) S (Sprinkler) R (Rain) W (Wetness) S f S (C ,U S ) R f R (C ,U R ) W fW ( S , R,U W ) Each function summarizes millions of micro processes. U3 U2 U4 S C U1 THE MIRACLE OF UNIVERSAL CONSTRAINTS E PLURIBUS UNUM – OUT OF MANY, ONE C (Climate) C f C (U C ) S (Sprinkler) R (Rain) W (Wetness) S f S (C ,U S ) R f R (C ,U R ) W fW ( S , R,U W ) Each function summarizes millions of micro processes. U3 Still, if the U 's are U2 independent, the observed U4 distribution P(C,R,S,W) must satisfy certain constraints C that are: S (1) independent of the f ‘s and of P(U) and (2) can be read from the structure of the graph. U1 D-SEPARATION: NATURE’S LANGUAGE FOR COMMUNICATING ITS STRUCTURE C (Climate) C f C (U C ) S (Sprinkler) R (Rain) W (Wetness) S f S (C ,U S ) R f R (C ,U R ) W fW ( S , R,U W ) Every missing arrow advertises an independency, conditional on a separating set. e.g., C W | (S,R) S R|C Applications 1. Structure learning 2. Model testing 3. Reducing "what if I do" questions to symbolic calculus 4. Reducing scientific questions to symbolic calculus SEEING VS. DOING P( x1,..., xn ) P( xi | pai ) i P( x1, x2 , x3 , x4 , x5 ) P( x1) P( x2 | x1) P( x3 | x1) P( x4 | x2 , x3 ) P( x5 | x4 ) Effect of turning the sprinkler ON PX 3 ON ( x1, x2 , x4 , x5 ) P( x1) P( x2 | x1) P( x4 | x2 , X 3 ON) P( x5 | x4 ) P( x1, x2 , x4 , X 5 | X 3 ON) THE MACHINERY OF CAUSAL CALCULUS Rule 1: Ignoring observations P(y | do{x}, z, w) = P(y | do{x}, w) if (Y Z|X,W )G Rule 2: Action/observation exchange P(y | do{x}, do{z}, w) = P(y | do{x},z,w) if (Y Z|X,W )G X XZ Rule 3: Ignoring actions P(y | do{x}, do{z}, w) = P(y | do{x}, w) if (Y Z|X,W )G X Z(W) Completeness Theorem (Shpitser, 2006) “WHAT IF I SMOKE?” REDUCED TO CALCULUS Genotype (Unobserved) Smoking Tar Cancer P (c | do{s}) = t P (c | do{s}, t) P (t | do{s}) Probability Axioms = t P (c | do{s}, do{t}) P (t | do{s}) Rule 2 = t P (c | do{s}, do{t}) P (t | s) Rule 2 = t P (c | do{t}) P (t | s) Rule 3 = st P (c | do{t}, s) P (s | do{t}) P(t |s) Probability Axioms Rule 2 = P (c | t, s) P (s | do{t}) P(t |s) s t = s t P (c | t, s) P (s) P(t |s) Rule 3 EFFECT OF WARM-UP ON INJURY (After Shrier & Platt, 2008) No, no! DETERMINING CAUSES OF EFFECTS A COUNTERFACTUAL VICTORY • • Your Honor! My client (Mr. A) died BECAUSE he used that drug. Court to decide if it is MORE PROBABLE THAN NOT that A would be alive BUT FOR the drug! PN = P(? | A is dead, took the drug) > 0.50 THE ATTRIBUTION PROBLEM Definition: 1. What is the meaning of PN(x,y): “Probability that event y would not have occurred if it were not for event x, given that x and y did in fact occur.” Answer: PN ( x, y ) P(Yx' y ' | x, y ) Computable from M THE ATTRIBUTION PROBLEM Definition: 1. What is the meaning of PN(x,y): “Probability that event y would not have occurred if it were not for event x, given that x and y did in fact occur.” Identification: 2. Under what condition can PN(x,y) be learned from statistical data, i.e., observational, experimental and combined. ATTRIBUTION MATHEMATIZED (Tian and Pearl, 2000) • Bounds given combined nonexperimental and experimental data (P(y,x), P(yx), for all y and x) 0 1 P( y ) P( y ) P( y' ) x' x' max PN min P( x,y ) P( x,y ) • Identifiability under monotonicity (Combined data) P( y|x ) P( y|x' ) P( y|x' ) P( y x' ) PN P( y|x ) P( x,y ) CAN FREQUENCY DATA DECIDE LEGAL RESPONSIBILITY? Deaths (y) Survivals (y) • • • • Experimental do(x) do(x) 16 14 984 986 1,000 1,000 Nonexperimental x x 2 28 998 972 1,000 1,000 Nonexperimental data: drug usage predicts longer life Experimental data: drug has negligible effect on survival Plaintiff: Mr. A is special. 1. He actually died 2. He used the drug by choice Court to decide (given both data): Is it more probable than not that A would be alive but for the drug? PN P(Yx' y' | x, y ) 0.50 SOLUTION TO THE ATTRIBUTION PROBLEM • • WITH PROBABILITY ONE 1 P(yx | x,y) 1 Combined data tell more that each study alone MEDIATION: ANOTHER COUNTERFACTUAL TRIUMPH Why decompose effects? 1. To understand how Nature works 2. To comply with legal requirements 3. To predict the effects of new type of interventions: Signal re-routing and mechanism deactivating, rather than variable fixing COUNTERFACTUAL DEFINITION OF INDIRECT EFFECTS X Z Y z = f (x, u) y = g (x, z, u) No Controlled Indirect Effect Indirect Effect of X on Y: IE ( x0 , x1;Y ) The expected change in Y when we keep X constant, say at x0, and let Z change to whatever value it would have attained had X changed to x1. E[Yx0 Z x Yx0 ] 1 In linear models, IE = TE - DE POLICY IMPLICATIONS OF INDIRECT EFFECTS What is the indirect effect of X on Y? The effect of Gender on Hiring if sex discrimination is eliminated. GENDER X IGNORE Z QUALIFICATION f Y HIRING Deactivating a link – a new type of intervention MEDIATION FORMULAS IN UNCONFOUNDED MODELS Z X Y z = f (x, u1) y = g (x, z, u2) u1 independent of u2 DE [ E (Y | x1, z ) E (Y | x0 , z )]P( z | x0 ) z IE [ E (Y | x0 , z )[ P( z | x1) P( z | x0 )] z TE E (Y | x1) E (Y | x0 ) TE DE IE IE Fraction of responses explained by mediation (sufficient) TE DE Fraction of responses owed to mediation (necessary) TRANSPORTABILITY OF KNOWLEDGE ACROSS DOMAINS (with E. Bareinboim) 1. A Theory of causal transportability When can causal relations learned from experiments be transferred to a different environment in which no experiment can be conducted? 2. A Theory of statistical transportability When can statistical information learned in one domain be transferred to a different domain in which a. only a subset of variables can be observed? Or, b. only a few samples are available? MOTIVATION WHAT CAN EXPERIMENTS IN LA TELL ABOUT NYC? Z (Age) Z (Age) X (Intervention) Y (Outcome) Experimental study in LA Measured: P ( x, y, z ) * X (Observation) Observational study in NYC Measured: P* ( x, y, z ) P* ( z ) P (z ) P( y | do( x), z ) Needed: Y (Outcome) P* ( y | do( x)) ? P( y | do( x), z ) P* ( z ) z Transport Formula (calibration): F ( P, Pdo , P*) TRANSPORT FORMULAS DEPEND ON THE STORY Z S S S S Factors producing differences Z Y X Y X (a) (b) a) Z represents age P* ( y | do( x)) P( y | do( x), z ) P* ( z ) z b) Z represents language skill P* ( y | do( x)) ?P( y | do( x)) TRANSPORT FORMULAS DEPEND ON THE STORY Z S S S Z Y X Y X (a) X (b) Z (c) a) Z represents age P* ( y | do( x)) P( y | do( x), z ) P* ( z ) z b) Z represents language skill P* ( y | do( x)) P( y | do( x)) c) Z represents a bio-marker P* ( y | do( x)) ? P( y | do( x), z ) P* ( z | x ) z Y GOAL: ALGORITHM TO DETERMINE IF AN EFFECT IS TRANSPORTABLE S INPUT: Annotated Causal Graph S Factors creating differences U V T S X W Z Y OUTPUT: 1. Transportable or not? 2. Measurements to be taken in the experimental study 3. Measurements to be taken in the target population 4. A transport formula P* ( y | do( x)) f [ P( y, v, z, w, t , u | do( x)); P * ( y, v, z, w, t , u )] TRANSPORTABILITY REDUCED TO CALCULUS Theorem A causal relation R is transportable from to * if and only if it is reducible, using the rules of do-calculus, to an expression in which S is separated from do( ). R * P* ( y | do( x)) P( y | do( x), s) P( y | do( x), s, w) P( w | do( x), s) w P( y | do( x), w) P( w | s) w Z W w P( y | do( x), w) P* ( w) S X Y RESULT: ALGORITHM TO DETERMINE IF AN EFFECT IS TRANSPORTABLE S INPUT: Annotated Causal Graph S Factors creating differences U V OUTPUT: 1. Transportable or not? 2. Measurements to be taken in the experimental study 3. Measurements to be taken in the target population 4. A transport formula 5. Completeness (Bareinboim, 2012) T S X W Z Y P* ( y | do( x)) P ( y | do( x) , z) P* (z | w) P (w | do( x), t ) P* (t ) z w t WHICH MODEL LICENSES THE TRANSPORT OF THE CAUSAL EFFECT XY S External factors creating disparities Yes No S X S S (a) X Y Y X W Z (e) Y No S Y Z (c) Yes S W Z (d) X (b) Yes X Yes S Y X Z (f) Y STATISTICAL TRANSPORTABILITY (Transfer Learning) Why should we transport statistical information? i.e., Why not re-learn things from scratch ? 1. Measurements are costly. Limit measurements to a subset V * of variables called “scope”. 2. Samples are scarce. Pooling samples from diverse populations will improve precision, if differences can be filtered out. STATISTICAL TRANSPORTABILITY Definition: (Statistical Transportability) A statistical relation R(P) is said to be transportable from to * over V * if R(P*) is identified from P, P*(V *), and D where P*(V *) is the marginal distribution of P* over a subset of variables V *. R=P* (y | x) is transportable over V* = {X,Z}, i.e., R is estimable without re-measuring Y S X Z Y R P * ( z | x) P( z | y ) S X Z Y z Transfer Learning If few samples (N2) are available from * and many samples (N1) from , then estimating R = P*(y | x) by R P * ( y | x, z ) P ( z | x ) z achieves a much higher precision META-ANALYSIS OR MULTI-SOURCE LEARNING Target population * R = P*(y | do(x)) S (a) X (d) (b) Z W Y X (e) Z (c) Z W Y X (f) Z Z W Z S S X (g) W Y X (h) Z W Y X (i) Z W W Y X W Y Z S S X Y Y X W Y CAN WE GET A BIAS-FREE ESTIMATE OF THE TARGET QUANTITY? Target population * R = P*(y | do(x)) Is R identifiable from (d) and (h) ? (a) Z R P * ( y | do( x), w) P * ( w | do( x)) w X (d) W w P(h) ( y | do( x), w) P(d ) ( w | x) Z w S X P(h) ( y | do( x), w) P(d ) ( w | do( x)) Y W Y R(*) is identifiable from studies (d) and (h). R(*) is not identifiable from studies (d) and (i). (h) (i) Z Z S S X W Y X W Y FROM META-ANALYSIS TO META-SYNTHESIS The problem How to combine results of several experimental and observational studies, each conducted on a different population and under a different set of conditions, so as to construct an aggregate measure of effect size that is "better" than any one study in isolation. META-SYNTHESIS REDUCED TO CALCULUS Theorem {1, 2,..., K} – a set of studies. {D1, D2,...., Dk} – selection diagrams (relative to *). A relation R(*) is "meta estimable" if it can be decomposed into terms of the form: Qk P(Vk | do(Wk ), Z k ) such that each Qk is transportable from Dk. Open-problem: Systematic decomposition BIAS VS. PRECISION IN META-SYNTHESIS Principle 1: Calibrate estimands before pooling (to minimize bias) Principle 2: Decompose to sub-relations before calibrating (to improve precision) R( *) P * ( y | do( x)) (a) (g) Z (h) Z (i) Z (d) Z S X W Y X W Y X W S Y X W Y X Calibration P(*g ) ( y | do( x)) Pooling Z W Y P(* ) ( y | do( x)) d BIAS VS. PRECISION IN META-SYNTHESIS (a) Z (g) (h) Z (i) Z (d) Z S X W Y X R( *) P * ( y | do( x)) W P(*g ) ( y | do( x)) Y X W Z S Y X P(* ) ( y | w, do( x)) h W Y X W Y P(*) ( w | do( x)) P(* ) ( w | do( x)) d i Pooling * P(i, d ) ( w | do( x)) Composition P*(i, d , k ) ( y | do( x)) P*(h) ( y | w, do( x)) P*(i, d ) ( w | do( x)) w Pooling P* (all) ( y | do( x)) CONCLUSIONS • Counterfactuals are the building blocks of scientific thought, free will and moral behavior. • The algorithmization of counterfactuals has benefited several problem areas in the empirical sciences, including policy evaluation, mediation analysis, generalizability, and credit / blame determination. • This brings us a step closer to achieving cooperative behavior among computers and humans. CONCLUSIONS (cont.) What is "understanding"? Harnessing the grammars of science to answer questions that scientists wish to ask, and do not know how. What is fun? Seeing your intuition amplified through the microscope of formal analysis. Even more fun: Watching with amazement how you can do things today that you couldn't yesterday. Thank you Rumelhart (1976) Figure 3 Rumelhart (1976) Figure 10 Rumelhart (1976), p. 35 vk Ci, k vk Ci,j Pr(h j|hi ,hk ) j Rk otherwise Pearl (1982) Pearl (1982), (Belief Propogation) Kim & Pearl (1983) Explaining a way BELIEF PROPAGATION IN POLYTREES Bayes Net (1985) Bayes Net (1985) Breaking a loop BELIEF PROPAGATION WHEN THERE ARE LOOPS APPLICATIONS OF BAYESIAN NETWORKS 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. Medical Diagnosis Clinical Decision Support Complex Genetic Models Crime Risk Factors Analysis Spatial Dynamics in Geography Inference Problems in Forensic Science Conservation of a Threatened Bird Classifiers for Modelling of Mineral Potential Student Modelling Sensor Validation An Information Retrieval System Reliability Analysis of Systems Terrorism Risk Management Credit-Rating of Companies Classification of Wines Pavement and Bridge Management Complex Industrial Process Operation Probability of Default for Large Corporates Risk Management in Robotics LESSON #1 Do not underestimate what we can learn from fallible humans and from the AI paradigm that emulating humans is healthy and doable. BEYOND EVIDENCE, BELIEF, AND STATISTICS Data P Joint Distribution Q(P) (Aspects of P) Inference e.g., Infer whether customers who bought product A would also buy product B. Q = P(B | A) STATISTICS 1ST LIMITATION INTERVENTION Data P Joint Distribution Q(P) (Aspects of P) Inference e.g., Infer whether customers who bought product A would buy product B. If we double the price Q = P(B | A, do (price = 2p1)) Not an aspect of P. STATISTICS 2ND LIMITATION RETROSPECTION Data P Joint Distribution Q(P) (Aspects of P) Inference e.g., Infer whether Joe who bought product A would have bought A, had we doubled the price Not an aspect of P. Q P( Ap2 | Ap1 ) THE CAUSAL HIERARCHY 1. Associational (Statistical, Evidential) e.g., What if I see X=x? 2. Interventional ( Experimental, Causal) e.g., What if I do X=x? 3. Retrospectional (Counterfactual, token) e.g., What if I hadn't done X=x? No mixing: No claim at layer i without assumptions from layer i or higher. THE STRUCTURAL MODEL PARADIGM Data Joint Distribution Data Generating Model Q(M) (Aspects of M) M Inference M – Invariant strategy (mechanism, recipe, law, protocol) by which Nature assigns values to variables in the analysis. • “Think Nature, not experiment!” PHYSICS AND COUNTERFACTUALS, OR WHY PHYSICS DESERVES A NEW ALGEBRA? Scientific Equations (e.g., Hooke’s Law) are non-algebraic e.g., Length (Y) equals a constant (2) times the weight (X) Correct notation: Y :==2X 2X X=1 X=1 Y=2 Process information The solution Had X been 3, Y would be 6. If we raise X to 3, Y would be 6. Must “wipe out” X = 1. FAMILIAR CAUSAL MODEL ORACLE FOR COUNTERFACTUALS X Y Z INPUT OUTPUT THE FUNDAMENTAL THEOREM OF CAUSAL INFERENCE Causal Markov Theorem: Any distribution generated by Markovian structural model M (recursive, with independent disturbances) can be factorized as P(v1, v2,..., vn ) P(vi | pai ) i Where pai are the (values of) the parents of Vi in the causal diagram associated with M. Corollary: (Truncated factorization, Manipulation Theorem) The distribution generated by an intervention do(X=x) (in a Markovian model M) is given by the truncated factorization P(v1, v2 ,..., vn | do( x)) P(vi | pai ) | i|Vi X X x THE EVOLUTION OF CAUSAL CALCULUS • Haavelmo's surgery (1943): ri ui vi gi Add adjustable force ( gi ) • Strotz and Wold surgery (1960). “Wipe out” the equation ri ui vi , and replace it with ri constant • Graphical surgery (Spirtes et al., 1993; Pearl, 1993). Wipe out incoming arrows to r u v r y P(u, v, r , y) P(u ) P(v) P(r | u, v) P( y | r ) • do-calculus (Pearl, 1994) P(Y y | do(r )) new operator • Structural counterfactuals (Balke and Pearl, 1995) Yr(u) = Y(u) in the r-mutilatedmodel • Unification with Neyman-Rubin Yx(u) and Lewis (1973) HISTORICAL OBSERVATIONS "Development of Western science is based on two great achievements: the invention of the formal logical system (in Euclidean geometry) by the Greek philosophers, and the discovery of the possibility to find out causal relationships by systematic experiment (during the Renaissance)." (Albert Einstein, 1953) Inspired by Turing, I have tried to put the two together and base causal inference on a formal system that is reducible to algorithmic implementation. Mission largely accomplished – more to be done.