Download Maximum likelihood Interval estimation

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Probability wikipedia , lookup

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Foundations of statistics wikipedia , lookup

Transcript
E. Santovetti
lesson 4
Maximum likelihood
Interval estimation
1
Extended Maximum Likelihood
Sometimes the number of total events measurements of the experiment
n is not fixed, but, for example, is a Poisson r.v., mean .
The extended likelihood function is then:
If  is a function of  we have:
Example: the expected number of events of a certain process
●
●
Extended ML uses more informations, so the error on the parameters  will be
smaller, compared to the case in which n independent
In case n doesn't depend from  we have the usual likelihood
2
Extended ML example
Consider two types of events (e.g., signal and background) each of
which predict a given pdf for the variable x: fs(x) and fb(x).
We observe a mixture of the two event types, signal fraction = θ,
expected total number = ν, observed total number = n
Let s =  and b = (1- be the number of signal and background that
we want to evaluate
3
Extended ML example (2)
Consider for signal a Gaussian pdf
and for background an exponential
Maximize the log L, to find s and b
Here errors reflect total Poisson fluctuation as well
as that in proportion of signal/background
4
Unphysical values for estimators
Here the unphysical estimator is unbiased and should nevertheless be
reported, since average of a large number of unbiased estimates
converges to the true value (cf. PDG).
Repeat entire MC experiment many times, allow unphysical estimates.
5
Extended ML example II
LH does not provide any information on the Goodness of the fit.
This has to be checked separately.
●
●
Simulate toy MC according to estimated pdf (using fit results
from data as “true” parameter values) compare max Likelihood
value in toy to the one in data
Draw data in (binned) histogram, “compare” distribution with
result of ML fit.
6
Extended ML example II
Again you want to distinguish (and count) signal events respect to
background events.
The signal is the process
The background is combinatorial: vertex (and then particle) reconstructed
with the wrong tracks.
To select signal from background we can use two main things:
1) The invariant mass of the two daughter particles has to peak to the B
meson mass;
2) The time of flight of the B meson candidate has to be of the order of the B
meson life time.
These two variables have to behave in a complete different way for the
two event categories.
Let us see the distribution of this variables
7
Extended ML example II
A first look at the distribution (mass and time) allow us to state:

pdf for signal mass: double Gaussian

pdf for signal time: exponential (negative)

pdf for background mass: exponential (almost flat)

pdf for background time: exponential + Lorentzian
We build the pdf and make as:
By maximizing the likelihood, we can estimate the number of signal and
background as well as the B meson mass and life time
8
Extended ML example II
signal
background
all
data
mass
Life time
The fit is done with the
RooFit package (Root)
9
Weighted maximum likelihood
Suppose we want to measure the polarization of the J/Ψ meson (1--).
The measurement can be done by looking at the angular distribution of
the decay product of the meson itself:
 and  are respectively the polar and azimuthal angles of the positive
muon, in the decay
in the J/Ψ rest frame, measured choosing the J/Ψ direction in the lab
frame as polar axis.

θ
10
Weighted likelihood – polarization measurement
We have to measure the angular distribution and fit with the function
There are two main problem to face:
When we select our signal, there is an unavoidable amount of background
events (evident from the mass distribution)
The angular distribution of the background events is unknown and also very
difficult to parametrize
The likelihood function is:
Where ε is the total detection efficiency, P is the above angular function and Norm is a
normalization function in order to have probability normalized to 1.
11
Weighted likelihood – polarization measurement
The efficiency term at the denominator does not depend on λ parameters and
then is a constant term in the maximization procedure.
In order to take into account the background events the likelihood sum is
extended at all events but with some proper weights:
The background mass distribution is linear.
The hypothesis is well satisfied otherwise
we can always take into account this by
readjusting the weights in a proper way
right side band
The combinatorial background angular
distributions are the same in the signal and
side bands regions (can be demonstrated
shifting the three regions 300 MeV up)
left side band
The background events contribution
cancels out if:
12
Weighted likelihood – polarization measurement
How to evaluate the Norm function (depending on detector efficiency) ?
We can again use the MC simulation, considering an unpolarized sample, P=1
and the sum is over the MC events.
13
Weighted likelihood – polarization measurement
Then from MC events we can compute the function:
14
The sPlot technique
15
Relationship between ML and Bayesian estimators
In Bayesian statistics, both θ and x are random variables:.
In the Bayes approach, if θ is a certain hypothesis:
posterior θ pdf (conditional pdf for θ given x)
prior θ probability
Purist Bayesian: p(θ|x) contains all the informations about θ.
Pragmatist Bayesian: p(θ|x) can be a complicated function: summarize by using
new estimator
Looking at p(θ|x): what do we use for π(θ)? No golden rule (subjective!), often
represent ‘prior ignorance’ by π(θ) = constant, in which case
16
But... we could have used a different parameter, e.g., λ = 1/θ, and if prior π(θ) is
constant, then π(λ) is not! ‘Complete prior ignorance’ is not well defined.
Relationship between ML and Bayesian estimators
The main concern expressed by frequentist statisticians regarding the use of
Bayesian probability is its intrinsic dependence on a prior probability that can
be chosen in an arbitrary way. This arbitrariness makes Bayesian probability to
some extent subjective.
Adding more measurements increases one’s knowledge of the unknown
parameter, hence the posterior probability will depend less, and be less
sensitive to the choice of the prior probability. In those cases, where a large
number of measurements occurs, in most of the cases the results of Bayesian
calculations tend to be identical to those of frequentist calculations.
Many interesting statistical problems arise in the cases of low statistics, i.e. a
small number of measurements. In those cases, Bayesian or frequentist
methods usually leads to different results. In those cases, using the Bayesian
approach, the choice of the prior probabilities plays a crucial role and has great
influence in the results.
One main difficulty is how to chose a PDF that models one’s complete
ignorance on an unknown parameter. One naively could choose a uniform
(“flat”) PDF in the interval of validity of the parameter. But it is clear that if we
move the parametrization from x to a function of x (say log x or 1/x), the resulting
17
transformed parameter will no longer have a uniform prior PDF
The Jeffreys prior
One possible approach has been proposed by Harold Jeffreys, adopting a
choice of prior PDF that results invariant under parameter transformation. This
choice is:
with:
Determinant of the Fischer
information matrix
Examples of Jeffreys prior distributions for some important parameters
18
Interval estimation,
setting limits
19
Interval estimation — introduction
In addition to a ‘point estimate’ of a parameter we should report an
interval reflecting its statistical uncertainty.
Desirable properties of such an interval may include:
●
communicate objectively the result of the experiment;
●
have a given probability of containing the true parameter;
●
provide informations needed to draw conclusions about the parameter
possibly incorporating stated prior beliefs.
Often use +/- the estimated standard deviation of the estimator. In some
cases, however, this is not adequate:
●
estimate near a physical boundary, e.g., an observed event rate
consistent with zero.
We will look briefly at Frequentist and Bayesian intervals.
20
Neyman confidence intervals
Rigorous procedure to get confidence intervals in frequentist approach
Consider an estimator for a parameter
(measurable)
We also need the pdf
Specify upper and lower tail probabilities, e.g., α = 0.05, β = 0.05, then
find functions uα(θ) and vβ(θ) such that
Integral over all the possible estimator values
We obtain a confidence interval (CL = 1-α-β) for the estimator
function of the true parameter value θ. This is the interval for the estimator
No unique way to define this interval with the same CL
21
Confidence interval from the confidence belt
Confident belt region
is a function of the parameter
Find points where observed
estimate intersects the
confidence belt
This gives the confidence interval for
The true parameter
Confidence level = 1 - α - β = probability for the interval to cover
true value of the parameter (holds for any possible true θ) with
22
Confidence intervals by inverting a test
Confidence intervals for a parameter θ can be found by defining a test of
the hypothesized value θ (do this for all θ):
●
●
Define values of the data that are ‘disfavored’ by θ (critical region)
such that P(data in critical region) ≤ γ for a specified γ, e.g., 0.05 or
0.1.
If data observed in the critical region, reject the value θ .
Now invert the test to define a confidence interval as:
●
set of θ values that would not be rejected in a test of size γ
(confidence level is 1 - γ ). We have to collect many data...
The interval will cover the true value of θ with probability ≥ 1 - γ.
Equivalent to confidence belt construction; confidence belt is
acceptance region of a test.
23
Relation between confidence interval and p-value
Equivalently we can consider a significance test for each hypothesized
value of θ, resulting in a p-value, pθ.
Equivalently we can consider a significance test for each hypothesized
value of θ, resulting in a p-value, pθ.
The confidence interval at CL = 1–γ consists of those values of θ
that are not rejected.
E.g. an upper limit on θ is the greatest value for which pθ ≥ γ.
In practice find by setting pθ = γ and solve for θ
24
Confidence intervals in practice
In practice, in order to find the interval [a, b] we have to solve:
we replace uα(θ) with
and get a
we replace uβ(θ) with
and get b
a is hypothetical value of θ such that
b is hypothetical value of θ such that
25
Meaning of a confidence interval
Important to keep in mind:
●
The interval is random
●
The true θ is an unknown constant
Often we report this interval as:
This does not mean
but: repeat the measurements many times, build the interval according to
the same prescription each time,
in 1 – –  experiments the interval will contain θ
26
Central vs. one-sided confidence intervals
Fixed the CL, the choice of  and  is not unique, in literature this is the
so called ordering rule
Sometimes, only specified  or : one-side interval (limit)
Often  = = / 2: coverage probability 1- : central confidence interval
●
N.B.: central confidence level does not mean symmetric interval
around θ.
In HEP the convention to quote the error is:
 = = / 2 with 1 -  = 68.3% = 1σ
27
Intervals from the likelihood function
In the large sample limit it can be shown for ML estimators:
N-dimensional Gaussian, variance V
defines a hyper ellipsoidal confidence region
If the θ follows a multi-dimentional Gaussian
28
Approximate confidence regions from L(θ)
So the recipe to find the confidence region with CL = 1 - γ is:
For finite samples, these are approximate confidence regions.
●
Coverage probability not guaranteed to be exactly equal to 1 - γ ;
●
no simple theorem to say by how far off it will be (use MC).
Remember here the interval is random, not the parameter.
29
Example of interval from ln L(θ)
For n=1, CL = 1 - γ = 0.683 Q = 1
30
Setting limits on Poisson parameter
Consider again the case in which we have a sample of events that
contains signal and background (means s and b), and both of them are
Poisson variables.
Suppose that we can say how many background we expect.
Unfortunately we observe:
There is clearly no evidence of signal. This means that we cannot
exclude s = 0, we can anyway put un upper limit to the number of signal
31
Upper limit for Poisson parameter
We have to find the hypothetical s such that there is a given small
probability, say, γ = 0.05, to find as few events as we did or less.
Solving numerically for s, it gives an upper limit at a confidence level of
1-γ (usually 0.95).
Suppose b = 0 and we find n = 0
32
Calculating Poisson parameter limits
To find the lower and upper limits we can use the relation to the 2
distribution with   z/2. It can be found:
For low fluctuation of n this can give negative result for sup; i.e.
confidence interval is empty.
33
Limits near a physical boundary
Suppose e.g. b = 2.5 and we observe n = 0. If we choose CL = 0.9, we
find from the formula for sup
negative!?
Physicist:
We already knew s ≥ 0 before we started; can’t use negative upper limit
to report result of expensive experiment!
Statistician:
The interval is designed to cover the true value only 90% of the time this was clearly not one of those times.
Not uncommon dilemma when limit of parameter is close to a physical
boundary
34
Expected limit for s = 0
Physicist: I should have used CL = 0.95, then sup = 0.496
even better: for CL = 0.917923 we get sup = 10-4 !
We are not taking into account the background fluctuation
Reality check: with b = 2.5, typical Poisson fluctuation in n is at least √2.5
= 1.6. How can the limit be so low?
Look at the mean limit for the no-signal hypothesis (s = 0) (sensitivity).
Distribution of 95% CL limits
with b = 2.5, s = 0.
Mean upper limit = 4.44
With N MC experiments (poisson) with
μ=2.5, I extract n and then evaluate sup
with 95% CL
35
The “flip-flopping” problem
In order to determine confidence intervals, a consistent choice of ordering rule
has to be adopted.
Feldman and Cousins demonstrated that the ordering rule choice must not
depend on the outcome of the measurements, otherwise the quoted confidence
intervals or upper limits could be incorrect.
In some cases, experiment searching for a rare signal make the chose, while
quoting their result, to switch from a central interval to an upper limit
depending on the outcome of the measurement.
A typical choice is to quote an upper limit if the significance of the observed
signal is smaller than 3σ, and a central value otherwise
We have than to quote the error fixing the CL, say 90%.
If x ≥ 3σ we choose a symmetric interval (5% each), while if x < 3σ an upper
limit implies a completely asymmetric interval.
36
The “flip-flopping” problem
From a single measurement of x we can decide to quote an interval with a e
certain CL if x>3σ:
or we can decide to quote only an upper limit if our measurement is x<3σ.
37
The “flip-flopping” problem
The choice to switch from a central interval to a fully asymmetric interval
(upper limit) based on the observation of x clearly spoils the statistical
coverage.
Looking at the figure, depending on the value of μ, the interval [x1, x2] obtained
by crossing the confidence belt in by an horizontal line, one may have cases
where the coverage decreases from 90% to 85%, which is lower than the
desired CL.
To avoid flip-flopping, decide before the measurement if you quote limit or 2sided interval - and stick to it. Or use Feldman-Cousins
38
The Feldman Cousins method
The ordering rule proposed by Feldman and Cousins provides a Neyman
confidence belt that smoothly changes from a central or quasi-central interval to
an upper limit in the case of low observed signal yield.
The ordering rule is based on the likelihood ratio given a value θ0 of the
unknown parameter under a Neyman construction, the chosen interval on the
variable x is defined from the ratio of two PDFs of x, one under the hypothesis
that θis equal to the considered fixed value θ0, the other under the hypothesis
that θ is equal to the maximum-likelihood estimate value θbest(x) corresponding
to the given measurement x.
39
Feldman Cousins: Gaussian case
Let apply the Feldman-Cousins method to a Gaussian distribution
When we divide by f(x|μbest) we obtain:
That is a an asymmetric function with a longer tail to the negative x values
Using Feldman-Cousin approach, for alrge x
we have the usual symmetric confidence
interval.
Going at small x (close to the boundary) the
interval becomes more and more asymmetric
and at certain point it become a completely
asymmetric interval (upper limit)
40
The Bayesian approach
In Bayesian statistics need to start with prior pdf π(θ), this reflects
degree of belief about θ before doing the experiment.
Bayes’ theorem tells how our beliefs should be updated in light of the
data x:
Then we have to integrate posterior probability to the desired probability
confidence level.
For the Poisson case, suppose 95% CL, we have:
41
Bayesian prior for Poisson parameter
Include knowledge that s ≥0 by setting prior π(s) = 0 for s<0. Often try to
reflect ‘prior ignorance’ with e.g.
Not normalized but this is OK as long as L(s) dies off for large s.
Not invariant under change of parameter — if we had used instead a
flat prior for, say, the mass of the Higgs boson, this would imply a non-flat
prior for the expected number of Higgs events.
Does not really reflect a reasonable degree of belief, but often used as a
point of reference;
or viewed as a recipe for producing an interval whose frequentist
properties can be studied (coverage will depend on true s).
42
Bayesian interval with flat prior for s
Solve numerically to find limit sup.
For special case b = 0, Bayesian upper limit with flat prior numerically
same as classical case (‘coincidence’).
Otherwise Bayesian limit is everywhere greater than classical
(‘conservative’).
Never goes negative.
Doesn’t depend on b if n = 0
43