Survey							
                            
		                
		                * Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
PGM: Tirgul 10 Parameter Learning and Priors . Why learning? Knowledge acquisition bottleneck  Knowledge acquisition is an expensive process  Often we don’t have an expert Data is cheap  Vast amounts of data becomes available to us  Learning allows us to build systems based on the data 2 Learning Bayesian networks B E Data + Prior information Inducer R A C E B P(A | E,B) e b .9 .1 e b .7 .3 e b .8 .2 e b .99 .01 3 Known Structure -- Complete Data E, B, A <Y,N,N> <Y,Y,Y> <N,N,Y> <N,Y,Y> . . <N,Y,Y> Inducer E B P(A | E,B) e b ? ? e b ? ? e b ? ? e b ? ? B E A A E B P(A | E,B) e b .9 .1 e b .7 .3 e b .8 .2 e b .99 .01  Network  B E structure is specified Inducer needs to estimate parameters  Data does not contain missing values 4 Unknown Structure -- Complete Data E, B, A <Y,N,N> <Y,Y,Y> <N,N,Y> <N,Y,Y> . . <N,Y,Y> Inducer E B P(A | E,B) e b ? ? e b ? ? e b ? ? e b ? ? B E A A E B P(A | E,B) e b .9 .1 e b .7 .3 e b .8 .2 e b .99 .01  Network  B E structure is not specified Inducer needs to select arcs & estimate parameters  Data does not contain missing values 5 Known Structure -- Incomplete Data E, B, A <Y,N,N> <Y,?,Y> <N,N,Y> <N,Y,?> . . <?,Y,Y> E B P(A | E,B) e b ? ? e b ? ? e b ? ? e b ? ? B E Inducer A B E A E B P(A | E,B) e b .9 .1 e b .7 .3 e b .8 .2 e b .99 .01  Network structure is specified  Data contains missing values  We consider assignments to missing values 6 Known Structure / Complete Data  Given  a network structure G And choice of parametric family for P(Xi|Pai)  Learn parameters for network Goal  Construct a network that is “closest” to probability that generated the data 7 Example: Binomial Experiment (Statistics 101) Head Tail  When tossed, it can land in one of two positions: Head or Tail  We denote by  the (unknown) probability P(H). Estimation task:  Given a sequence of toss samples x[1], x[2], …, x[M] we want to estimate the probabilities P(H)=  and P(T) = 1 -  8 Statistical Parameter Fitting  Consider instances x[1], x[2], …, x[M] such that    The set of values that x can take is known Each is sampled from the same distribution Each sampled independently of the rest i.i.d. samples task is to find a parameter  so that the data can be summarized by a probability P(x[j]|  ).  The   Depends on the given family of probability distributions: multinomial, Gaussian, Poisson, etc. For now, focus on multinomial distributions 9 The Likelihood Function  How good is a particular ? It depends on how likely it is to generate the observed data L( : D )  P (D | )   P (x [m ] | ) m likelihood for the sequence H,T, T, H, H is L( :D)  The L( : D )    (1  )  (1  )     0 0.2 0.4  0.6 0.8 1 10 Sufficient Statistics  To compute the likelihood in the thumbtack example we only require NH and NT (the number of heads and the number of tails) L( : D )  NH  (1  )NT  NH and NT are sufficient statistics for the binomial distribution 11 Sufficient Statistics A sufficient statistic is a function of the data that summarizes the relevant information for the likelihood  Formally, s(D) is a sufficient statistics if for any two datasets D and D’  s(D) = s(D’ )  L( |D) = L( |D’) Datasets Statistics 12 Maximum Likelihood Estimation MLE Principle: Choose parameters that maximize the likelihood function  This is one of the most commonly used estimators in statistics  Intuitively appealing 13 Example: MLE in Binomial Data  Applying the MLE principle we get NH ˆ  NH  NT (Which coincides with what one would expect) (NH,NT ) = (3,2) MLE estimate is 3/5 = 0.6 L( :D) Example: 0 0.2 0.4 0.6 0.8 1 14 Learning Parameters for a Bayesian Network  Training data has the form: B E A  E [1] B [1] A[1] C [1]         D         E [ M ] B [ M ] A [ M ] C [ M ]   C 15 Learning Parameters for a Bayesian Network  Since we assume i.i.d. samples, likelihood function is L( : D )   P (E [m], B [m], A[m],C [m] : ) B E A C m 16 Learning Parameters for a Bayesian Network  By B E definition of network, we get A L( : D )   P (E [m ], B [m ], A[m ], C [m ] : ) C m P (E [m ] : )  m P (B [m ] : ) P (A[m ] | B [m ], E [m ] : ) P (C [m ] | A[m ] : )  E [1] B [1] A[1] C [1]                  E [ M ] B [ M ] A [ M ] C [ M ]   17 Learning Parameters for a Bayesian Network  Rewriting B E terms, we get A L( : D )   P (E [m ], B [m], A[m ], C [m ] : ) C m   P (E [m ] : ) m  P (B [m] : ) m  P (A[m] | B [m], E [m] : ) m  P (C [m] | A[m] : ) m  E [1] B [1] A[1] C [1]                  E [ M ] B [ M ] A [ M ] C [ M ]   18 General Bayesian Networks Generalizing for any Bayesian network: L( : D )   P (x 1 [m ], , xn [m ] : ) i.i.d. samples m    P (xi [m ] | Pai [m ] : i ) m i i m Network factorization    P (xi [m ] | Pai [m ] : i )   Li (i : D ) i  The likelihood decomposes according to the structure of the network. 19 General Bayesian Networks (Cont.) Decomposition  Independent Estimation Problems If the parameters for each family are not related, then they can be estimated independently of each other. 20 From Binomial to Multinomial example, suppose X can have the values 1,2,…,K  We want to learn the parameters  1,  2. …,  K Sufficient statistics:  N1, N2, …, NK - the number of times each outcome is observed K L( : D )   k Nk Likelihood function:  For k 1 MLE: Nk ˆ k   N  21 Likelihood for Multinomial Networks we assume that P(Xi | Pai ) is multinomial, we get further decomposition:  When Li (i : D )   P (xi [m ] | Pai [m ] : i ) m   P (x [m] | pa paii m ,Paii [ m ]  paii i i : i )   P (xi | pai : i )N ( xii , paii ) paii xii pai xi   xi |pai N ( xi , pai ) 22 Likelihood for Multinomial Networks we assume that P(Xi | Pai ) is multinomial, we get further decomposition:  When Li (i : D )   xi |pai pai N ( xi , pai ) xi each value pai of the parents of Xi we get an independent multinomial problem  For N (xi , pai ) ˆ  The MLE is xi |pai  N ( pai ) 23 Maximum Likelihood Estimation Consistency  Estimate converges to best possible value as the number of examples grow  To make this formal, we need to introduce some definitions 24 KL-Divergence P and Q be two distributions over X  A measure of distance between P and Q is the Kullback-Leibler Divergence  Let P (x ) KL(P ||Q )   P (x ) log Q (x ) x  KL(P||Q) = 1 (when logs are in base 2) =  The probability P assigns to an instance is, on average, half the probability Q assigns to it  KL(P||Q)  0  KL(P||Q) = 0 iff are P and Q equal 25 Consistency  Let  P(X| ) be a parametric family We need to make various regularity condition we won’t go into now P*(X) be the distribution that generates the data ˆD  Let be the MLE estimate given a dataset D  Let Thm  As N , ˆ   * D where  *  arg min KL(P * (X ) || P (X :  ) with probability 1 26 Consistency -- Geometric Interpretation Distributions that can represented by P(X| ) P* P(X| * ) Space of probability distribution 27 Is MLE all we need?  Suppose   that after 10 observations, ML estimates P(H) = 0.7 for the thumbtack Would you bet on heads for the next toss? Suppose now that after 10 observations, ML estimates P(H) = 0.7 for a coin Would you place the same bet? 28 Bayesian Inference Frequentist Approach: Assumes there is an unknown but fixed parameter   Estimates  with some confidence  Prediction by using the estimated parameter value  Bayesian Approach: Represents uncertainty about the unknown parameter  Uses probability to quantify this uncertainty:  Unknown parameters as random variables  Prediction follows from the rules of probability:  Expectation over the unknown parameters  29 Bayesian Inference (cont.)  We can represent our uncertainty about the sampling process using a Bayesian network  X[1] X[2] X[m] Observed data X[m+1] Query values of X are independent given   The conditional probabilities, P(x[m] |  ), are the parameters in the model  Prediction is now inference in this network  The 30 Bayesian Inference (cont.) Prediction as inference in this network P (x [M  1] | x [1], , x [M ]) X[1]  X[2] X[m] X[m+1]   P (x [M  1] | , x [1], , x [M ])P ( | x [1], , x [M ])d   P (x [M  1] | )P ( | x [1], , x [M ])d where Likelihood Prior P (x [1],  x [M ] | )P () P ( | x [1],  x [M ])  P (x [1],  x [M ]) Posterior Probability of data 31 Example: Binomial Data Revisited uniform for  in [0,1]  P( ) = 1  Then P( |D) is proportional to the likelihood L( :D)  Prior: P ( | x [1],x [M])  P (x [1],x [M] |  )  P ( ) (NH,NT ) = (4,1)  MLE for P(X = H ) is 4/5 = 0.8  Bayesian prediction is 0 P (x [M  1]  H | D )    P ( | D )d  0.2 0.4 5  0.7142  7 0.6 0.8 1 32 Bayesian Inference and MLE  In our example, MLE and Bayesian prediction differ But… If prior is well-behaved  Does not assign 0 density to any “feasible” parameter value Then: both MLE and Bayesian prediction converge to the same value  Both are consistent 33 Dirichlet Priors  Recall that the likelihood function is K L (  : D )    k Nk k 1 Dirichlet prior with hyperparameters 1,…,K is defined as A K P ()   k k 1 for legal  1,…,  K k 1 Then the posterior has the same form, with hyperparameters 1+N 1,…,K +N K K K k 1 k 1 K P ( | D )  P ()P (D | )   k k 1  k Nk   k k Nk 1 k 1 34 Dirichlet Priors (cont.)  We can compute the prediction on a new event in closed form:  If P() is Dirichlet with hyperparameters 1,…,K then k P (X [1]  k )   k P ()d      Since the posterior is also Dirichlet, we get  k  Nk P (X [M  1]  k | D )   k P ( | D )d   (   N  )  35 Dirichlet Priors -- Example 5 Dirichlet(1,1) Dirichlet(2,2) Dirichlet(0.5,0.5) Dirichlet(5,5) 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 0 0.2 0.4 0.6 0.8 1 36 Prior Knowledge hyperparameters 1,…,K can be thought of as “imaginary” counts from our prior experience  The  Equivalent sample size = 1+…+K  The larger the equivalent sample size the more confident we are in our prior 37 Effect of Priors Prediction of P(X=H ) after seeing data with NH = 0.25•NT for different sample sizes 0.55 0.6 0.5 Different strength H + T Fixed ratio H / T 0.45 0.5 0.4 0.4 0.35 Fixed strength H + T Different ratio H / T 0.3 0.3 0.2 0.25 0.1 0.2 0.15 0 20 40 60 80 100 0 0 20 40 60 80 100 38 Effect of Priors (cont.)  In real data, Bayesian estimates are less sensitive to noise in the data P(X = 1|D) 0.7 MLE Dirichlet(.5,.5) Dirichlet(1,1) Dirichlet(5,5) Dirichlet(10,10) 0.6 0.5 0.4 0.3 0.2 0.1 N 5 10 15 20 25 30 35 40 45 50 1 Toss Result 0 N 39 Conjugate Families  The property that the posterior distribution follows the same parametric form as the prior distribution is called conjugacy  Dirichlet prior is a conjugate family for the multinomial likelihood  Conjugate families are useful since:  For many distributions we can represent them with hyperparameters  They allow for sequential update within the same representation  In many cases we have closed-form solution for prediction 40 Bayesian Networks and Bayesian Prediction Y|X X X m X[1] X[2] X[M] X[M+1] Y[1] Y[2] Y[M] Y[M+1] X[m] Y|X Y[m] Plate notation Observed data Query  Priors for each parameter group are independent  Data instances are independent given the unknown parameters 41 Bayesian Networks and Bayesian Prediction (Cont.) Y|X X X m X[1] X[2] X[M] X[M+1] Y[1] Y[2] Y[M] Y[M+1] X[m] Y|X Y[m] Plate notation Observed data Query  We can also “read” from the network: Complete data  posteriors on parameters are independent 42 Bayesian Prediction(cont.)  Since posteriors on parameters for each family are independent, we can compute them separately  Posteriors for parameters within families are also independent: X m X[m] Y[m] Y|X X Refined model m X[m] Y|X=0 Y|X=1 Y[m] data  independent posteriors on Y|X=0 and  Y|X=1  Complete 43 Bayesian Prediction(cont.)  Given these observations, we can compute the posterior for each multinomial  Xi | pai independently  The posterior is Dirichlet with parameters (Xi=1|pai)+N (Xi=1|pai),…, (Xi=k|pai)+N (Xi=k|pai)  The predictive distribution is then represented by the parameters ~  x |pa i i  (xi , pai )  N (xi , pai )   ( pai )  N ( pai ) 44 Assessing Priors for Bayesian Networks We need the(xi,pai) for each node xj can use initial parameters 0 as prior information  Need also an equivalent sample size parameter M0  Then, we let (xi,pai) = M0P(xi,pai|0)  We  This allows to update a network using new data 45 Learning Parameters: Case Study (cont.) Experiment:  Sample a stream of instances from the alarm network  Learn parameters using  MLE estimator  Bayesian estimator with uniform prior with different strengths 46 Learning Parameters: Case Study (cont.) MLE Bayes w/ Uniform Prior, M'=5 Bayes w/ Uniform Prior, M'=10 Bayes w/ Uniform Prior, M'=20 Bayes w/ Uniform Prior, M'=50 1.4 KL Divergence 1.2 1 0.8 0.6 0.4 0.2 0 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 M 47 Learning Parameters: Summary  Estimation relies on sufficient statistics  For multinomial these are of the form N (xi,pai)  Parameter estimation N (xi , pai ) ˆ x |pa  i i N ( pai ) MLE (xi , pai )  N (xi , pai ) ~ x |pa  i i ( pai )  N ( pai ) Bayesian (Dirichlet)  Bayesian methods also require choice of priors  Both MLE and Bayesian are asymptotically equivalent and consistent  Both can be implemented in an on-line manner by accumulating sufficient statistics 48