Download Maximum likelihood Interval estimation

E. Santovetti lesson 4 Maximum likelihood Interval estimation 1 Extended Maximum Likelihood Sometimes the number of total events measurements of the experiment n is not fixed, but, for example, is a Poisson r.v., mean . The extended likelihood function is then: If  is a function of  we have: Example: the expected number of events of a certain process ● ● Extended ML uses more informations, so the error on the parameters  will be smaller, compared to the case in which n independent In case n doesn't depend from  we have the usual likelihood 2 Extended ML example Consider two types of events (e.g., signal and background) each of which predict a given pdf for the variable x: fs(x) and fb(x). We observe a mixture of the two event types, signal fraction = θ, expected total number = ν, observed total number = n Let s =  and b = (1- be the number of signal and background that we want to evaluate 3 Extended ML example (2) Consider for signal a Gaussian pdf and for background an exponential Maximize the log L, to find s and b Here errors reflect total Poisson fluctuation as well as that in proportion of signal/background 4 Unphysical values for estimators Here the unphysical estimator is unbiased and should nevertheless be reported, since average of a large number of unbiased estimates converges to the true value (cf. PDG). Repeat entire MC experiment many times, allow unphysical estimates. 5 Extended ML example II LH does not provide any information on the Goodness of the fit. This has to be checked separately. ● ● Simulate toy MC according to estimated pdf (using fit results from data as “true” parameter values) compare max Likelihood value in toy to the one in data Draw data in (binned) histogram, “compare” distribution with result of ML fit. 6 Extended ML example II Again you want to distinguish (and count) signal events respect to background events. The signal is the process The background is combinatorial: vertex (and then particle) reconstructed with the wrong tracks. To select signal from background we can use two main things: 1) The invariant mass of the two daughter particles has to peak to the B meson mass; 2) The time of flight of the B meson candidate has to be of the order of the B meson life time. These two variables have to behave in a complete different way for the two event categories. Let us see the distribution of this variables 7 Extended ML example II A first look at the distribution (mass and time) allow us to state:  pdf for signal mass: double Gaussian  pdf for signal time: exponential (negative)  pdf for background mass: exponential (almost flat)  pdf for background time: exponential + Lorentzian We build the pdf and make as: By maximizing the likelihood, we can estimate the number of signal and background as well as the B meson mass and life time 8 Extended ML example II signal background all data mass Life time The fit is done with the RooFit package (Root) 9 Weighted maximum likelihood Suppose we want to measure the polarization of the J/Ψ meson (1--). The measurement can be done by looking at the angular distribution of the decay product of the meson itself:  and  are respectively the polar and azimuthal angles of the positive muon, in the decay in the J/Ψ rest frame, measured choosing the J/Ψ direction in the lab frame as polar axis.  θ 10 Weighted likelihood – polarization measurement We have to measure the angular distribution and fit with the function There are two main problem to face: When we select our signal, there is an unavoidable amount of background events (evident from the mass distribution) The angular distribution of the background events is unknown and also very difficult to parametrize The likelihood function is: Where ε is the total detection efficiency, P is the above angular function and Norm is a normalization function in order to have probability normalized to 1. 11 Weighted likelihood – polarization measurement The efficiency term at the denominator does not depend on λ parameters and then is a constant term in the maximization procedure. In order to take into account the background events the likelihood sum is extended at all events but with some proper weights: The background mass distribution is linear. The hypothesis is well satisfied otherwise we can always take into account this by readjusting the weights in a proper way right side band The combinatorial background angular distributions are the same in the signal and side bands regions (can be demonstrated shifting the three regions 300 MeV up) left side band The background events contribution cancels out if: 12 Weighted likelihood – polarization measurement How to evaluate the Norm function (depending on detector efficiency) ? We can again use the MC simulation, considering an unpolarized sample, P=1 and the sum is over the MC events. 13 Weighted likelihood – polarization measurement Then from MC events we can compute the function: 14 The sPlot technique 15 Relationship between ML and Bayesian estimators In Bayesian statistics, both θ and x are random variables:. In the Bayes approach, if θ is a certain hypothesis: posterior θ pdf (conditional pdf for θ given x) prior θ probability Purist Bayesian: p(θ|x) contains all the informations about θ. Pragmatist Bayesian: p(θ|x) can be a complicated function: summarize by using new estimator Looking at p(θ|x): what do we use for π(θ)? No golden rule (subjective!), often represent ‘prior ignorance’ by π(θ) = constant, in which case 16 But... we could have used a different parameter, e.g., λ = 1/θ, and if prior π(θ) is constant, then π(λ) is not! ‘Complete prior ignorance’ is not well defined. Relationship between ML and Bayesian estimators The main concern expressed by frequentist statisticians regarding the use of Bayesian probability is its intrinsic dependence on a prior probability that can be chosen in an arbitrary way. This arbitrariness makes Bayesian probability to some extent subjective. Adding more measurements increases one’s knowledge of the unknown parameter, hence the posterior probability will depend less, and be less sensitive to the choice of the prior probability. In those cases, where a large number of measurements occurs, in most of the cases the results of Bayesian calculations tend to be identical to those of frequentist calculations. Many interesting statistical problems arise in the cases of low statistics, i.e. a small number of measurements. In those cases, Bayesian or frequentist methods usually leads to different results. In those cases, using the Bayesian approach, the choice of the prior probabilities plays a crucial role and has great influence in the results. One main difficulty is how to chose a PDF that models one’s complete ignorance on an unknown parameter. One naively could choose a uniform (“flat”) PDF in the interval of validity of the parameter. But it is clear that if we move the parametrization from x to a function of x (say log x or 1/x), the resulting 17 transformed parameter will no longer have a uniform prior PDF The Jeffreys prior One possible approach has been proposed by Harold Jeffreys, adopting a choice of prior PDF that results invariant under parameter transformation. This choice is: with: Determinant of the Fischer information matrix Examples of Jeffreys prior distributions for some important parameters 18 Interval estimation, setting limits 19 Interval estimation — introduction In addition to a ‘point estimate’ of a parameter we should report an interval reflecting its statistical uncertainty. Desirable properties of such an interval may include: ● communicate objectively the result of the experiment; ● have a given probability of containing the true parameter; ● provide informations needed to draw conclusions about the parameter possibly incorporating stated prior beliefs. Often use +/- the estimated standard deviation of the estimator. In some cases, however, this is not adequate: ● estimate near a physical boundary, e.g., an observed event rate consistent with zero. We will look briefly at Frequentist and Bayesian intervals. 20 Neyman confidence intervals Rigorous procedure to get confidence intervals in frequentist approach Consider an estimator for a parameter (measurable) We also need the pdf Specify upper and lower tail probabilities, e.g., α = 0.05, β = 0.05, then find functions uα(θ) and vβ(θ) such that Integral over all the possible estimator values We obtain a confidence interval (CL = 1-α-β) for the estimator function of the true parameter value θ. This is the interval for the estimator No unique way to define this interval with the same CL 21 Confidence interval from the confidence belt Confident belt region is a function of the parameter Find points where observed estimate intersects the confidence belt This gives the confidence interval for The true parameter Confidence level = 1 - α - β = probability for the interval to cover true value of the parameter (holds for any possible true θ) with 22 Confidence intervals by inverting a test Confidence intervals for a parameter θ can be found by defining a test of the hypothesized value θ (do this for all θ): ● ● Define values of the data that are ‘disfavored’ by θ (critical region) such that P(data in critical region) ≤ γ for a specified γ, e.g., 0.05 or 0.1. If data observed in the critical region, reject the value θ . Now invert the test to define a confidence interval as: ● set of θ values that would not be rejected in a test of size γ (confidence level is 1 - γ ). We have to collect many data... The interval will cover the true value of θ with probability ≥ 1 - γ. Equivalent to confidence belt construction; confidence belt is acceptance region of a test. 23 Relation between confidence interval and p-value Equivalently we can consider a significance test for each hypothesized value of θ, resulting in a p-value, pθ. Equivalently we can consider a significance test for each hypothesized value of θ, resulting in a p-value, pθ. The confidence interval at CL = 1–γ consists of those values of θ that are not rejected. E.g. an upper limit on θ is the greatest value for which pθ ≥ γ. In practice find by setting pθ = γ and solve for θ 24 Confidence intervals in practice In practice, in order to find the interval [a, b] we have to solve: we replace uα(θ) with and get a we replace uβ(θ) with and get b a is hypothetical value of θ such that b is hypothetical value of θ such that 25 Meaning of a confidence interval Important to keep in mind: ● The interval is random ● The true θ is an unknown constant Often we report this interval as: This does not mean but: repeat the measurements many times, build the interval according to the same prescription each time, in 1 – –  experiments the interval will contain θ 26 Central vs. one-sided confidence intervals Fixed the CL, the choice of  and  is not unique, in literature this is the so called ordering rule Sometimes, only specified  or : one-side interval (limit) Often  = = / 2: coverage probability 1- : central confidence interval ● N.B.: central confidence level does not mean symmetric interval around θ. In HEP the convention to quote the error is:  = = / 2 with 1 -  = 68.3% = 1σ 27 Intervals from the likelihood function In the large sample limit it can be shown for ML estimators: N-dimensional Gaussian, variance V defines a hyper ellipsoidal confidence region If the θ follows a multi-dimentional Gaussian 28 Approximate confidence regions from L(θ) So the recipe to find the confidence region with CL = 1 - γ is: For finite samples, these are approximate confidence regions. ● Coverage probability not guaranteed to be exactly equal to 1 - γ ; ● no simple theorem to say by how far off it will be (use MC). Remember here the interval is random, not the parameter. 29 Example of interval from ln L(θ) For n=1, CL = 1 - γ = 0.683 Q = 1 30 Setting limits on Poisson parameter Consider again the case in which we have a sample of events that contains signal and background (means s and b), and both of them are Poisson variables. Suppose that we can say how many background we expect. Unfortunately we observe: There is clearly no evidence of signal. This means that we cannot exclude s = 0, we can anyway put un upper limit to the number of signal 31 Upper limit for Poisson parameter We have to find the hypothetical s such that there is a given small probability, say, γ = 0.05, to find as few events as we did or less. Solving numerically for s, it gives an upper limit at a confidence level of 1-γ (usually 0.95). Suppose b = 0 and we find n = 0 32 Calculating Poisson parameter limits To find the lower and upper limits we can use the relation to the 2 distribution with   z/2. It can be found: For low fluctuation of n this can give negative result for sup; i.e. confidence interval is empty. 33 Limits near a physical boundary Suppose e.g. b = 2.5 and we observe n = 0. If we choose CL = 0.9, we find from the formula for sup negative!? Physicist: We already knew s ≥ 0 before we started; can’t use negative upper limit to report result of expensive experiment! Statistician: The interval is designed to cover the true value only 90% of the time this was clearly not one of those times. Not uncommon dilemma when limit of parameter is close to a physical boundary 34 Expected limit for s = 0 Physicist: I should have used CL = 0.95, then sup = 0.496 even better: for CL = 0.917923 we get sup = 10-4 ! We are not taking into account the background fluctuation Reality check: with b = 2.5, typical Poisson fluctuation in n is at least √2.5 = 1.6. How can the limit be so low? Look at the mean limit for the no-signal hypothesis (s = 0) (sensitivity). Distribution of 95% CL limits with b = 2.5, s = 0. Mean upper limit = 4.44 With N MC experiments (poisson) with μ=2.5, I extract n and then evaluate sup with 95% CL 35 The “flip-flopping” problem In order to determine confidence intervals, a consistent choice of ordering rule has to be adopted. Feldman and Cousins demonstrated that the ordering rule choice must not depend on the outcome of the measurements, otherwise the quoted confidence intervals or upper limits could be incorrect. In some cases, experiment searching for a rare signal make the chose, while quoting their result, to switch from a central interval to an upper limit depending on the outcome of the measurement. A typical choice is to quote an upper limit if the significance of the observed signal is smaller than 3σ, and a central value otherwise We have than to quote the error fixing the CL, say 90%. If x ≥ 3σ we choose a symmetric interval (5% each), while if x < 3σ an upper limit implies a completely asymmetric interval. 36 The “flip-flopping” problem From a single measurement of x we can decide to quote an interval with a e certain CL if x>3σ: or we can decide to quote only an upper limit if our measurement is x<3σ. 37 The “flip-flopping” problem The choice to switch from a central interval to a fully asymmetric interval (upper limit) based on the observation of x clearly spoils the statistical coverage. Looking at the figure, depending on the value of μ, the interval [x1, x2] obtained by crossing the confidence belt in by an horizontal line, one may have cases where the coverage decreases from 90% to 85%, which is lower than the desired CL. To avoid flip-flopping, decide before the measurement if you quote limit or 2sided interval - and stick to it. Or use Feldman-Cousins 38 The Feldman Cousins method The ordering rule proposed by Feldman and Cousins provides a Neyman confidence belt that smoothly changes from a central or quasi-central interval to an upper limit in the case of low observed signal yield. The ordering rule is based on the likelihood ratio given a value θ0 of the unknown parameter under a Neyman construction, the chosen interval on the variable x is defined from the ratio of two PDFs of x, one under the hypothesis that θis equal to the considered fixed value θ0, the other under the hypothesis that θ is equal to the maximum-likelihood estimate value θbest(x) corresponding to the given measurement x. 39 Feldman Cousins: Gaussian case Let apply the Feldman-Cousins method to a Gaussian distribution When we divide by f(x|μbest) we obtain: That is a an asymmetric function with a longer tail to the negative x values Using Feldman-Cousin approach, for alrge x we have the usual symmetric confidence interval. Going at small x (close to the boundary) the interval becomes more and more asymmetric and at certain point it become a completely asymmetric interval (upper limit) 40 The Bayesian approach In Bayesian statistics need to start with prior pdf π(θ), this reflects degree of belief about θ before doing the experiment. Bayes’ theorem tells how our beliefs should be updated in light of the data x: Then we have to integrate posterior probability to the desired probability confidence level. For the Poisson case, suppose 95% CL, we have: 41 Bayesian prior for Poisson parameter Include knowledge that s ≥0 by setting prior π(s) = 0 for s<0. Often try to reflect ‘prior ignorance’ with e.g. Not normalized but this is OK as long as L(s) dies off for large s. Not invariant under change of parameter — if we had used instead a flat prior for, say, the mass of the Higgs boson, this would imply a non-flat prior for the expected number of Higgs events. Does not really reflect a reasonable degree of belief, but often used as a point of reference; or viewed as a recipe for producing an interval whose frequentist properties can be studied (coverage will depend on true s). 42 Bayesian interval with flat prior for s Solve numerically to find limit sup. For special case b = 0, Bayesian upper limit with flat prior numerically same as classical case (‘coincidence’). Otherwise Bayesian limit is everywhere greater than classical (‘conservative’). Never goes negative. Doesn’t depend on b if n = 0 43

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Maximum likelihood Interval estimation