Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Author: David Heckerman Presented By: Yan Zhang - 2006 Jeremy Gould – 2013 1 Outline Bayesian Approach Bayesian vs. classical probability methods Examples Bayesian Network Structure Inference Learning Probabilities Learning the Network Structure Two coin toss – an example Conclusions Exam Questions 2 Bayesian vs. the Classical Approach The Bayesian probability of an event x, represents the person’s degree of belief or confidence in that event’s occurrence based on prior and observed facts. Classical probability refers to the true or actual probability of the event and is not concerned with observed behavior. 3 Example – Is this Man a Martian Spy? 4 Example We start with two concepts: 1. Hypothesis (H) – He either is or is not a Martian spy. 2. Data (D) – Some set of information about the subject. Perhaps financial data, phone records, maybe we bugged his office… 5 Example Frequentist Says Given a hypothesis (He IS a Martian) there is a probability P of seeing this data: P( D | H ) (Considers absolute ground truth, the uncertainty/noise is in the data.) Bayesian Says Given this data there is a probability P of this hypothesis being true: P( H | D ) (This probability indicates our level of belief in the hypothesis.) 6 Bayesian vs. the Classical Approach Bayesian approach restricts its prediction to the next (N+1) occurrence of an event given the observed previous (N) events. Classical approach is to predict likelihood of any given event regardless of the number of occurrences. NOTE: The Bayesian approach can be updated as new data is observed. 7 Bayes Theorem where 𝑃 𝐵 = 𝑃 𝐵 = 𝑃 𝐵 𝐴 𝑃(𝐴) 𝑃 𝐴𝐵 = 𝑃(𝐵) 𝑃 𝐵 𝐴 𝑃 𝐴 𝑑𝐴 𝑖 𝑃 𝐵 𝐴𝑖 𝑃(𝐴𝑖 ) Continuous Realm Discrete Realm For the continuous case imagine an infinite number of infinitesimally small partitions. 8 Example – Coin Toss I want to toss a coin n = 100 times. Let’s denote the random variable X as the outcome of one flip: p(X=head) = θ p(X=tail) =1- θ Before doing this experiment we have some belief in our mind: the prior probability ξ. Let’s assume that this event will have a Beta distribution (a common assumption): Sample Beta Distributions: Example – Coin Toss 𝑃 𝜃|𝜉 = 𝐵𝑒𝑡𝑎 𝜃|𝛼, 𝛽 = 𝛼 𝐸𝜃 = 𝛼+𝛽 Γ 𝛼+𝛽 𝜃 𝛼−1 1 − 𝜃 Γ 𝛼 Γ 𝛽 𝑉𝐴𝑅 𝜃 = 𝛽−1 𝛼𝛽 𝛼+𝛽 2+ 𝛼+𝛽+1 If we assume a 50-50 coin we can use α = β = 5 which gives: 𝛼 5 𝐸𝜃 = = = 0.5 𝛼+𝛽 10 (Hopefully, what you were expecting!) Example – Coin Toss Now I can run my experiment. As I go I can update my beliefs based on the observed heads (h) and tails (t) by applying Bayes Law to the Beta Distribution: 𝑃 𝜃|𝜉 𝑃 𝐷|𝜃, 𝜉 𝑃 𝜃|𝜉 𝜃 ℎ 1 − 𝜃 𝑃 𝜃|𝐷, 𝜉 = = 𝑃 𝐷|𝜉 𝑃 𝐷|𝜉 𝑡 11 Example – Coin Toss Since we’re assuming a Beta distribution this becomes: 𝑃 𝜃|𝐷, 𝜉 = 𝐵𝑒𝑡𝑎 𝜃|𝛼 + ℎ, 𝛽 + 𝑡 = Γ 𝛼 + 𝛽 + ℎ + 𝑡 𝛼+ℎ−1 𝜃 1−𝜃 Γ 𝛼+ℎ Γ 𝛽+𝑡 𝛽+𝑡−1 …our posterior probability. Supposing that we observed h = 65, t = 35, we would get: 𝛼+ℎ 5 + 65 𝐸 𝜃|𝐷, 𝜉 = = = 0.64 𝛼+ℎ+𝛽+𝑡 6 + 65 + 5 + 35 12 Example – Coin Toss 13 Integration To find the probability that Xn+1= heads, we could also integrate over all possible values of θ to find the average value of θ which yields: 𝑃 𝑋𝑛+1 = ℎ𝑒𝑎𝑑𝑠 | 𝐷, 𝜉 = = 𝑃 𝑋𝑛+1 = ℎ𝑒𝑎𝑑𝑠 | 𝐷, 𝜉 𝑃 𝜃 𝐷, 𝜉 𝑑𝜃 𝜃𝑃 𝜃|𝐷, 𝜉 𝑑𝜃 = 𝐸 𝜃|𝐷, 𝜉 This might be necessary if we were working with a distribution with a less obvious Expected Value. 14 More than Two Outcomes In the previous example, we used a Beta distribution to encode the states of the random variable. This was possible because there were only 2 states/outcomes of the variable X. In general, if the observed variable X is discrete, having r possible states {1,…,r}, the likelihood function is given by: 𝑃 𝑋 = 𝑥𝑘 | 𝜃, 𝜉 = 𝜃𝑘 𝑤ℎ𝑒𝑟𝑒: 𝑘 = 1, 2, … . , 𝑟 𝜃 ∈ 𝜃1 , … . , 𝜃𝑟 and 𝜃𝑘 = 1 In this general case we can use a Dirichlet distribution instead: Γ 𝛼 𝛼𝑘 −1 𝑟 𝜃 𝑘=1 𝑘 Prior 𝑃 𝜃|𝜉 = Dir 𝜃| 𝛼1 , 𝛼2 , … , 𝛼𝑟 = Posterior 𝑃 𝜃|𝐷, 𝜉 = Dir 𝜃| 𝛼1 + 𝑛1 , 𝛼2 + 𝑛2 , … , 𝛼𝑟 + 𝑛𝑟 𝑟 𝑘=1 Γ 𝛼𝑘 15 Vocabulary Review Prior Probability, P( θ | ξ ): Prior Probability of a particular value of θ given no observed data (our previous “belief”) Posterior Probability, P(θ | D, ξ): Probability of a particular value of θ given that D has been observed (our final value of θ). Observed Probability or “Likelihood”, P(D|θ, ξ): Likelihood of sequence of coin tosses D being observed given that θ is a particular value. P(D|ξ): Raw probability of D 16 Outline Bayesian Approach Bayes Therom Bayesian vs. classical probability methods coin toss – an example Bayesian Network Structure Inference Learning Probabilities Learning the Network Structure Two coin toss – an example Conclusions Exam Questions 17 OK, But So What? That’s great but this is Data Mining not Philosophy of Mathematics. Why should we care about all of this ugly math? 18 Bayesian Advantages It turns out that the Bayesian technique permits us to do some very useful things from a mining perspective! 1. We can use the Chain Rule with Bayesian Probabilities: 𝑛 𝑛 𝑃 𝐴𝑘 = 𝑘=1 Ex. 𝑘−1 𝑃 𝐴𝑘 | 𝑘=1 𝑗=1 𝐴𝑗 𝑃 𝐴, 𝐵, 𝐶 = 𝑃 𝐴 𝐵, 𝐶 𝑃 𝐵|𝐶 𝑃 𝐶 This isn’t something we can’t easily do with classical probability! 2. As we’ve already seen using the Bayesian model permits us to update our beliefs based on new data. 19 Example Network To create a Bayesian network we will ultimately need 3 things: A set of Variables X={X1,…, Xn} A Network Structure Conditional Probability Table (CPT) Note that when we start we may not have any of these things or a given element may be incomplete! 20 Let’s start with a simple case where we are given all three things: a credit fraud network designed to determine the probability of credit fraud. 21 Set of Variables Each node represents a random variable. (Let’s assume discrete for now.) 22 Network Structure Each edge represents a conditional dependence between variables. 23 Conditional Probability Table Each rule represents the quantification of a conditional dependency. 24 Since we’ve been given the network structure we can easily see the conditional dependencies: P(A|F,A,S,G) = P(A) P(S|F,A,S,G) = P(S) P(G|F,A,S,G) = P(G|F) P(J|F,A,S,G) = P(J|F,A,S) 25 Note that the absence of an edge indicates conditional independence: P(A|G) = P(A) 26 Important Note: The presence of a of cycle will render one or more of the relationships intractable! 27 Inference Now suppose we want to calculate (infer) our confidence level in a hypothesis on the fraud variable f given some knowledge about the other variables. This can be directly calculated via: 𝑃 𝑓, 𝑎, 𝑠, 𝑔, 𝑗 𝑃 𝑓|𝑎, 𝑠, 𝑔, 𝑗 = = 𝑃 𝑎, 𝑠, 𝑔, 𝑗 𝑃 𝑓, 𝑎, 𝑠, 𝑔, 𝑗 ′ , 𝑎, 𝑠, 𝑔, 𝑗 𝑃 𝑓 𝑓′ (Kind of messy…) 28 Inference Fortunately, we can use the Chain Rule to simplify! 𝑃 𝑓|𝑎, 𝑠, 𝑔, 𝑗 = 𝑃 𝑓 𝑃 𝑎 𝑃 𝑠 𝑃 𝑔|𝑓 𝑃 𝑗|𝑓,𝑎,𝑠 𝑓′ 𝑃 𝑓′ 𝑃 𝑎 𝑃 𝑠 𝑃 𝑔|𝑓′ 𝑃 𝑗|𝑓′,𝑎,𝑠 = 𝑃 𝑓 𝑃 𝑔|𝑓 𝑃 𝑗|𝑓,𝑎,𝑠 𝑓′ 𝑃 𝑓′ 𝑃 𝑔|𝑓′ 𝑃 𝑗|𝑓′,𝑎,𝑠 This Simplification is especially powerful when the network is sparse which is frequently the case in real world problems. This shows how we can use a Bayesian Network to infer a probability not stored directly in the model. 29 Now for the Data Mining! So far we haven’t added much value to the data. So let’s take advantage of the Bayesian model’s ability to update our beliefs and learn from new data. First we’ll rewrite our joint probability distribution in a more compact form: 𝑛 𝑃 𝑥|𝜃𝑠 , 𝑆 ℎ = 𝑃 𝑥𝑖 |𝑝𝑎𝑖, 𝜃𝑖 , 𝑆 ℎ 𝑖=1 𝑆 ℎ 𝑖𝑠 𝑡ℎ𝑒 𝑐𝑢𝑟𝑟𝑒𝑛𝑡 ℎ𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠 (𝐵𝑎𝑠𝑖𝑐𝑎𝑙𝑙𝑦 𝑙𝑖𝑘𝑒 𝜉 𝑓𝑟𝑜𝑚 𝑏𝑒𝑓𝑜𝑟𝑒) 𝜃𝑖 𝑖𝑠 𝑡ℎ𝑒 𝑣𝑒𝑐𝑡𝑜𝑟 𝑜𝑓 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠 𝑓𝑜𝑟 𝑃 𝑥𝑖 |𝑝𝑎𝑖, 𝜃𝑖 , 𝑆 ℎ 𝑥 ∈ 𝑥1 , 𝑥2 , … , 𝑥𝑛 𝜃𝑠 = 𝜃1 , 𝜃2 , … , 𝜃𝑛 𝑝𝑎𝑖 𝑖𝑠 𝑡ℎ𝑒 𝑐𝑜𝑛𝑓𝑖𝑔𝑢𝑟𝑎𝑡𝑖𝑜𝑛 𝑣𝑒𝑐𝑡𝑜𝑟: 𝑗 𝑃 𝑥𝑖𝑘 |𝑝𝛼𝑖 , 𝜃𝑖 , 𝑆 ℎ = 𝜃𝑖𝑗𝑘 > 0 𝑭𝒓𝒐𝒎 𝒉𝒆𝒓𝒆 𝒘𝒆 𝒄𝒂𝒏 𝒇𝒊𝒏𝒅 𝑷 𝜽𝒔 |𝑺𝒉 30 Learning Probabilities in a Bayesian Network First we need to make two assumptions: There is no missing data (i.e. the data accurately describes the distribution) The parameter vectors are independent (generally a good assumption, at least locally). 31 Learning Probabilities in a Bayesian Network If these assumptions hold we can express the probabilities as: Prior 𝑃 𝜃𝑠 |𝑆 ℎ = 𝑛 𝑖=1 Posterior 𝑃 𝜃𝑠 |𝐷, 𝑆 ℎ = 𝑞𝑖 𝑗=1 𝑃 𝑛 𝑖=1 𝜃𝑖𝑗 |𝑆 ℎ 𝑞𝑖 𝑗=1 𝑃 𝜃𝑖𝑗 |𝐷, 𝑆 ℎ This means we can update each vector of parameters θij independently, just as one-variable case! • If each vector θij has the prior distribution Dir(θij |aij1,…, aijri)… • …Then the posterior distribution is: 𝑃 𝜃𝑠 |𝐷, 𝑆 ℎ = 𝐷𝑖𝑟 𝜃𝑖𝑗 |𝛼𝑖𝑗1 + 𝑛𝑖𝑗1 , … , 𝛼𝑖𝑗𝑟𝑖 + 𝑛𝑖𝑗𝑟𝑖 Where nijk is the number of cases in D in which Xi=xik and Pai=paij 32 Dealing with Unknowns Whew! Now we know how to use our network to infer conditional relationships and how to update our network with new data. But what if we aren’t given a well defined network? We could start with missing or incomplete: 1. Set of Variables 2. Conditional Relationship Data 3. Network Structure 33 Unknown Variable Set Our goal when choosing variables is to: “Organize…into variables having mutually exclusive and collectively exhaustive states.” This is a problem shared by all data mining algorithms: What should we measure and why? There is not and probably cannot be an algorithmic solution to this problem as arriving at any solution requires intelligent and creative thought. 34 Unknown Conditional Relationships This can be easy. So long as we can generate a plausible initial belief about a conditional relationship we can simply start with our assumption and let our data refine our model via the mechanism shown in the Learning Probabilities in a Bayesian Network slide. 35 Unknown Conditional Relationships However, when our ignorance becomes serious enough that we no longer even know what is dependent on what we segue into the Unknown Structure scenario. 36 Learning the Network Structure Sometimes the conditional relationships are not obvious. In this case we are uncertain with the network structure: we don’t know where the edges should be. 37 Learning the Network Structure Theoretically, we can use a Bayesian approach to get the posterior distribution of the network structure: 𝑃 𝑆 ℎ |𝐷 = 𝑃 𝐷|𝑆 ℎ 𝑃 𝑆 ℎ 𝑚 ℎ ℎ𝑖 𝑃 𝑆ℎ𝑖 𝑖=1 𝑃 𝐷|𝑆 = 𝑆 Unfortunately, the number of possible network structure increase exponentially with n – the number of nodes. We’re basically asking ourselves to consider every possible graph with n nodes! 38 Learning the Network Structure Heckerman describes two main methods for shortening the search for a network model: Model Selection To select a “good” model (i.e. the network structure) from all possible models, and use it as if it were the correct model. Selective Model Averaging To select a manageable number of good models from among all possible models and pretend that these models are exhaustive. The math behind both techniques is quite involved so I’m afraid we’ll have to content ourselves with a toy example today. 39 Two Coin Toss Example S h 1 X1 X2 p(H)=p(T)=0.5 S h 2 X1 p(H)=p(T)=0.5 X2 P(X2|X1) p(H|H) p(T|H) p(H|T) p(T|T) = = = = 0.1 0.9 0.9 0.1 Experiment: flip two coins and observe the outcome Propose two network structures: Sh1 or Sh2 Assume P(Sh1)=P(Sh2)=0.5 After observing some data, which model is more accurate for this collection of data? 40 Two Coin Toss Example X1 X2 𝑃 𝐷|𝑆 ℎ 𝑃 𝑆 ℎ ℎ 𝑃 𝑆 |𝐷 = 1 T T 2 T H 3 H T 4 H T 2 𝑖=1 𝑃 5 T H 10 6 H T 7 T H 8 T H 9 H T 10 H T 𝑃 𝑆 ℎ |𝐷 = 𝑚 𝑖=1 𝑃 𝐷|𝑆ℎ = 𝑆𝑖ℎ 𝑃 𝑆𝑖ℎ 𝑃 𝐷|𝑆 ℎ 𝑃 𝑆 ℎ 𝐷|𝑆ℎ = 𝑆𝑖ℎ 𝑃 𝑆𝑖ℎ 2 𝑃 𝐷|𝑆 ℎ = 𝑃 𝑋𝑑𝑖 |𝑃𝛼𝑖 , 𝑆 ℎ 𝑑=1 𝑖=1 41 Two Coin Toss Example X1 X2 1 T T 2 T H 3 H T 4 H T 5 T H 6 H T 7 T H 8 T H 9 H T 10 H T 𝐹𝑜𝑟 𝑆1ℎ : 10 2 𝑃 𝐷|𝑆 ℎ = 𝑃 𝑋𝑑𝑖 |𝑃𝛼𝑖 , 𝑆 ℎ 𝑑=1 𝑖=1 = 𝑃 𝑋1 = 𝑇 𝑃 𝑋2 = 𝑇 = 0.5 0.5 = 0.52 𝑃 𝑋1 = 𝑇 𝑃 𝑋2 = 𝐻 …. 0.5 0.5 …. 10 42 Two Coin Toss Example 10 𝐹𝑜𝑟 𝑆2ℎ : 2 𝑃 𝐷|𝑆 ℎ = 𝑃 𝑋𝑑𝑖 |𝑃𝛼𝑖 , 𝑆 ℎ 𝑑=1 𝑖=1 X1 X2 1 T T 𝑃 𝑋1 = 𝑇 𝑃 𝑋2 = 𝑇|𝑋1 = 𝑇 = 0.5 0.1 2 T H 𝑃 𝑋1 = 𝑇 𝑃 𝑋2 = 𝐻|𝑋1 = 𝑇 = 0.5 0.9 3 H T 𝑃 𝑋1 = 𝐻 𝑃 𝑋2 = 𝑇|𝑋1 = 𝐻 = 0.5 0.9 4 H T 5 T H 6 H T 7 T H 8 T H 9 H T 10 H T . . . 10 2 𝑃 𝐷|𝑆 ℎ = 𝑃 𝑋𝑑𝑖 |𝑃𝛼𝑖 , 𝑆 ℎ = 0.5 10 0.1 1 0.9 9 𝑑=1 𝑖=1 43 Two Coin Toss Example 𝑃 𝑆1ℎ |𝐷 = 𝑃 𝑆1ℎ |𝐷 = 𝑃 𝐷|𝑆1ℎ 𝑃 𝑆1ℎ 𝑃 𝐷|𝑆1ℎ 𝑃 𝑆1ℎ + 𝑃 𝐷|𝑆2ℎ 𝑃 𝑆2ℎ 0.52 10 0.52 10 0.5 0.5 + 0.5 10 0.1 10 0.5 𝑃 𝑆1ℎ |𝐷 = 0.5 10 + 0.1 1 0.9 1 0.9 9 0.5 9 𝑃 𝑆1ℎ |𝐷 ≈ 2.5% 44 Two Coin Toss Example 𝑃 𝑆2ℎ |𝐷 = 𝑃 𝑆2ℎ |𝐷 = 𝑃 𝐷|𝑆2ℎ 𝑃 𝑆2ℎ 𝑃 𝐷|𝑆1ℎ 𝑃 𝑆1ℎ + 𝑃 𝐷|𝑆2ℎ 𝑃 𝑆2ℎ 0.52 10 0.5 10 0.1 1 0.9 9 0.5 0.5 + 0.5 10 0.1 1 0.9 1 9 0.1 0.9 𝑃 𝑆2ℎ |𝐷 = 0.5 10 + 0.1 1 0.9 9 0.5 9 𝑃 𝑆2ℎ |𝐷 ≈ 97.5% 45 Two Coin Toss Example 𝑃 𝑆2ℎ |𝐷 ≈ 97.5% > 2.5% ≈ 𝑃 𝑆1ℎ |𝐷 𝑇ℎ𝑒𝑟𝑒𝑓𝑜𝑟𝑒 𝑤𝑒 𝑠ℎ𝑜𝑢𝑙𝑑 𝑝𝑟𝑒𝑓𝑒𝑟 𝑆2ℎ 𝑏𝑦 𝑠𝑐𝑜𝑟𝑒. 46 Outline Bayesian Approach Bayes Therom Bayesian vs. classical probability methods coin toss – an example Bayesian Network Structure Inference Learning Probabilities Learning the Network Structure Two coin toss – an example Conclusions Exam Questions 47 Conclusions Bayesian method Bayesian network Structure Inference Learn parameters and structure Advantages 48 Question1: What is Bayesian Probability? A person’s degree of belief in a certain event Your own degree of certainty that a tossed coin will land “heads” A degree of confidence in an outcome given some data. 49 Question 2: Compare the Bayesian and classical approaches to probability (any one point). Bayesian Approach: Classical Probability: +Reflects an expert’s +Objective and unbiased - Generally not available It takes a long time to measure the object’s physical characteristics Wants P( D | H ) knowledge +The belief is kept updating when new data item arrives - Arbitrary (More subjective) Wants P( H | D ) 50 Question 3: Mention at least 1 Advantage of Bayesian analysis Handle incomplete data sets Learning about causal relationships Combine domain knowledge and data Avoid over fitting 51 The End Any Questions? 52