Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Introduction to Probabilistic Models for Computational Biology Lectures 2 – Oct 3, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall (JHN) 022 1 Review: Gene Regulation a switch! (“transcription factor binding site”) Gene regulation DNA AGATATGTGGATTGTTAGGATTTATGCGCGTCAGTGACTACGCATGTTACGCACCTACGACTAGGTAATGATTGATC Gene RNA Protein transcription AUGUGGAUUGUU MWIV AUGCGCGUC AUGCGCGUC MRV MRV AUGUUACGCACCUAC translation RNA degradation MLRTY AUGAUUGAU AUGAUUAU MID “Gene Expression” gene Genes regulate each others’ expression and activity. Genetic regulatory network Review: Variations in the DNA “Single nucleotide polymorphism (SNP)” C T X X A X T G X X AGATATGTGGATTGTTAGGATTTATGCGCGTCAGTGACTACGCATGTTACGCACCTACGACTAGGTAATGATTGATC RNA Protein C X AUGUGGAUUGUU X MWIV C X X AUGCGCGUC U X AUGUUACGCACCUAC T X MRV AUGAUUGAU X MLRTY MID L gene Sequence variations perturb the regulatory network. Genetic regulatory network Outline Probabilistic models in biology Model selection problems Mathematical foundations Bayesian networks Probabilistic Graphical Models: Principles and Techniques, Koller & Friedman, The MIT Press Learning from data Maximum likelihood estimation Expectation and maximization 4 Example 1 How a change in a nucleotide in DNA, blood pressure and heart disease are related? There can be several “models”… DNA alteration Blood pressure Heart disease DNA alteration Blood pressure Heart disease OR Blood pressure DNA alteration Heart disease 5 Example 2 How genes A, B and C regulate each other’s expression levels (mRNA levels) ? There can be several models… A B A C B A OR C B ? C 6 Model I Model II Model III A A A B C B Exp 1 Exp 2 OR C … Gene A C Exp N N instances Gene B Probabilistic graphical models B ? Gene C A graphical representation of statistical dependencies. Statistical dependencies between expression levels of genes A, B, C? Probability that model x is true given the data Model selection: argmaxx P(model x is true | Data) 7 Outline Probabilistic models in biology Model selection problem Mathematical foundations Bayesian networks Learning from data Maximum likelihood estimation Expectation and maximization 8 Probability Theory Review Assume random variables Val(A)={a1,a2,a3}, Val(B)={b1,b2} Conditional probability Definition Chain rule Bayes’ rule Probabilistic independence 9 Probabilistic Representation Joint distribution P over {x1,…, xn} xi is binary 2n-1 entries If x’s are independent P(x) = p(x1) … p(xn) 10 Conditional Parameterization The Diabetes example Genetic risk (G), Diabetes (D) Val (G) = {g1,g0}, Val (D) = {d1,d0} P(G,D) = P(G) P(D|G) P(G): Prior distribution P(D|G): Conditional probabilistic distribution (CPD) Genetic risk Diabetes 11 Naïve Bayes Model - Example Elaborating the diabetes example, Genetic Risk (G), Diabetes (D), Hypertension (H) Val (G) = {g1,g0}, Val (D) = {d1,d0}, Val (H) = {h1,h0} 8 entries If S and G are independent given I, P(G,D,H) = P(G)P(D|G)P(H|G) 5 entries; more compact than joint Genetic risk Diabetes Hypertension 12 Naïve Bayes Model A class C where Val (C) = {c1,…,ck}. Finding variables x1,…,xn Naïve Bayes assumption The findings are conditionally independent given the individual’s class. The model factorizes as: The Diabetes example class: Genetic risk, findings: Diabetes, Hypertension 13 Naïve Bayes Model - Example Medical diagnosis system Class C: disease Findings X: symptoms Computing the confidence: Drawbacks Strong assumptions 14 Bayesian Network Directed acyclic graph (DAG) Node: a random variable Edge: direct influence of one node on another The Diabetes example revisited Genetic risk (G), Diabetes (D), Hypertension (H) Val (G) = {g1,g0}, Val (D) = {d1,d0}, Val (H) = {h1,h0} Genetic risk Diabetes Hypertension 15 Bayesian Network Semantics A Bayesian network structure G is a directed acyclic graph whose nodes represent random variables X1,…,Xn. PaXi: parents of Xi in G NonDescendantsXi: variables in G that are not descendants of Xi. G encodes the following set of conditional independence assumptions, called the local Markov assumptions, and denoted by IL(G): x2 For each variable Xi: x1 x3 x4 x11 x3 x10 x7 x5 x6 x8 x9 16 The Genetics Example Variables B: blood type (a phenotype) G: genotype of the gene that encodes a person’s blood type; <A,A>, <A,B>, <A,O>, <B,B>, <B,O>, <O,O> 17 Bayesian Network Joint Distribution Let G be a Bayesian network graph over the variables X1,…,Xn. We say that a distribution P factorizes according to G if P can be expressed as: A Bayesian network is a pair (G,P) where P factorizes over G, and where P is specified as a set of CPDs associated with G’s nodes. 18 The Student Example More complex scenario Course difficulty (D), quality of the recommendation letter (L), Intelligence (I), SAT (S), Grade (G) Val(D) = {easy, hard}, Val(L) = {strong, weak}, Val(I) = {i1,i0}, Val (S) = {s1,s0}, Val (G) = {g1,g2,g3} Joint distribution requires 47 entries 19 The Student Bayesian network Joint distribution P(I,D,G,S,L) = from Koller & Friedman 20 Parameter Estimation Assumptions For example, {i0,d1,g1,l0,s0} Fixed network structure Fully observed instances of the network variables: D={d[1],…,d[M]} Maximum likelihood estimation (MLE)! “Parameters” of the Bayesian network from Koller & Friedman 21 Outline Probabilistic models in biology Model selection problem Mathematical foundations Bayesian networks Learning from data Maximum likelihood estimation Expectation and maximization 22 Acknowledgement Profs Daphne Koller & Nir Friedman, “Probabilistic Graphical Models” 23