Download Slides 2

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Model Averaging
Model Averaging: Benchmark Case
CES LMU München
Gernot Doppelhofer
Norwegian School of Economics (NHH)
15-17 March 2011
1
Model Averaging: Benchmark Case
Outline
ntnulogow
Outline
2
1
Motivation
2
Statistical Framework
Decision Theory
Unconditional Distribution
Bayesian Hypothesis Testing
3
Linear Regression Model
Normal Linear Model
Likelihood Function
Prior Distributions
Posterior Analysis
Model Space
4
Numerical Methods
Model Averaging: Benchmark Case
Motivation
Model Uncertainty
Model uncertainty: problem of empirical (non-experimental)
research
theory uncertainty: “open-endedness”
specification uncertainty: functional form, heterogeneity
data issues: outliers, measurement error
Ignore model uncertainty?
biased estimation, misleading inference and prediction
Goal: robust inference, prediction and policy evaluation
3
Model Averaging: Benchmark Case
Motivation
Model Averaging: Principle
Model averaging fully addresses model uncertainty by “integrating
out” uncertainty from distribution of parameters of interest
Model Averaging conceptually straightforward
combine sample (likelihood) information with relative
(posterior) model weights
Bayesian, Empirical Bayes, and Frequentist approaches
Implementation issues
choice of prior, numerical simulations
4
Model Averaging: Benchmark Case
Statistical Framework
Decision Theory
Decision Theory
Suppose policymaker chooses action (policy) a to maximize
expected utility (or minimizes loss) after observing data Y
max E [u (a, θ |Y )] =
a
Z
u (a, θ |Y )p (θ |Y )d θ
(1)
Parameter (vector) of interest θ represents source of uncertainty
for policymaker:
effect of economic variable: θ > 0 or θ ≤ 0
prediction of future observations y f
Preferred action a∗ depends on objective function (1) and
probability distribution p (θ |Y ).
5
Model Averaging: Benchmark Case
Statistical Framework
Decision Theory
Contrast with Conventional Policy Analysis
1
Entire distribution of parameter θ relevant. Uncertainty
cannot necessarily be reduced to expected value (mean) and
associated variance.
2
Policy evaluation not necessarily same as hypothesis testing.
3
Distinguish parameters θ and estimates of parameters θ̂.
standard Bayesian objection to Frequentist approach
in many cases, Bayesian and maximum likelihood estimates
converge, so issue of second-order importance in large samples
6
Model Averaging: Benchmark Case
Statistical Framework
Unconditional Distribution
Unconditional Distribution
Unconditional or posterior distribution p (θ |Y ) (by Bayes’ rule):
p ( θ |Y ) =
L( Y | θ ) p ( θ )
∝ L( Y | θ ) p ( θ )
p (Y )
L(Y |θ ) = likelihood function summarizing all info about θ
contained in data Y
p (θ ) = prior distribution
p (Y ) = normalizing factor data distribution (marginal
likelihood)
Note: classical econometricians view parameter θ as fixed
(non-random), but estimator θ̂ as random variable.
7
Model Averaging: Benchmark Case
(2)
Statistical Framework
Unconditional Distribution
Model Weights
Suppose K candidate models M1 , ..., MK to explain data Y . Model
Mj described by probability distribution p (Y |θ j , Mj ) with
model-specific parameter vector θ j
Unconditional posterior distribution:
K
p ( θ |Y ) =
∑ wj · p (θj |Mj , Y )
j =1
wj proportional to fit in explaining observed data
thick modelling: equal weights wj = 1/K
8
Model Averaging: Benchmark Case
(3)
Statistical Framework
Unconditional Distribution
Bayesian Model Weights
In Bayesian context, weights are posterior model probabilities (by
Bayes’ rule):
wj = p (Mj |Y ) =
L ( Y | Mj ) p ( Mj )
∝ L ( Y | Mj ) p ( Mj )
p (Y )
(4)
posterior weights proportional to model-specific marginal
likelihood and prior model probability
Marginal (model-specific) likelihood is integrated w.r.t. θ j
L ( Y | Mj ) =
9
Z
θ
L ( Y | θ j , Mj ) p ( θ j | Mj ) d θ j
Model Averaging: Benchmark Case
(5)
Statistical Framework
Unconditional Distribution
Bayesian Interpretation: Hierarchical Mixture Model
Bayesian approach to multi-model setup assigns prior probability
distribution p (θj |Mj ) to parameters of each model, and a prior
probability p (Mj ) to each model.
Prior setup induces a joint distribution over the data, parameters
and models:
p ( Y , θ j , Mj ) = L ( Y | θ j , Mj ) p ( θ j | Mj ) p ( Mj )
Priors embed separate models within large hierarchical mixture
model. Through conditioning and marginalization, we obtain
posterior quantities of interest.
10
Model Averaging: Benchmark Case
(6)
Statistical Framework
Unconditional Distribution
Unconditional Mean and Variance
K
E ( θ |Y ) =
∑ p ( Mj | Y ) · E ( θ j | Mj , Y )
(7)
j =1
V (θ |Y ) = E (θ2 |Y ) − [E (θ |Y )]2 =
(8)
n
K
2 o
= ∑ p ( Mj | Y ) · V ( θ j | Mj , Y ) + E ( θ j | Mj , Y )
− [E (θ |Y )]2
j =1
K
=
∑
j =1
n
2 o
p ( Mj | Y ) · V ( θ j | Mj , Y ) + E ( θ j | Mj , Y ) − E ( θ | Y )
Note: unconditional variance (8) exceeds sum of conditional
variances, reflecting uncertainty about mean (Draper 1995).
11
Model Averaging: Benchmark Case
Statistical Framework
Unconditional Distribution
Predictive Distribution
Predictive distribution for Y f obtained from model-weighted
conditional predictive distributions:
p ( Y f |Y ) =
K
∑ p ( Y f | Mj , Y ) · p ( Mj | Y )
(9)
j =1
where p (Y f |Mj , Y ) =
R
p ( Y f | θ j , Mj ) p ( θ j | Mj , Y ) d θ j .
By averaging over unknown models, predictive distribution
incorporates model uncertainty embedded in priors.
Natural point prediction of Y f obtained by mean of p (Y f |Y ):
E ( Y f |Y ) =
K
∑ E ( Y f | Mj , Y ) · p ( Mj | Y )
j =1
12
Model Averaging: Benchmark Case
(10)
Statistical Framework
Bayesian Hypothesis Testing
Bayesian Hypothesis Testing
Compare two models Mi and Mj by posterior odds:
L ( Y | Mi ) p ( Mi )
p ( Mi | Y )
=
×
p ( Mj | Y )
L ( Y | Mj ) p ( Mj )
(11)
Posterior odds combine sample information, summarized in
so-called Bayes factor, L(Y |Mi )/L(Y |Mj ), and prior odds,
p (Mi )/p (Mj ).
Similarly, weight of model Mi relative to K models given by (4),
where normalizing factor is ∑K
j = 1 L ( Y | Mj ) p ( Mj ) .
13
Model Averaging: Benchmark Case
Statistical Framework
Bayesian Hypothesis Testing
Model Averaging vs. Model Selection
Starting with priors p (θ j |Mj ) and p (Mj ), the posterior model
probabilities represent a complete representation of postdata model
uncertainty, which can be used for variety of inferences and
decisions.
Optimal strategy depends on underlying loss function:
Highest posterior model selection corresponds to 0-1 loss for
correct selection.
Model averaged point prediction (10) minimizes quadratic loss
function.
Predictive distribution (9) minimizes Kullback-Leibler loss
w.r.t. actual predictive distribution p (Y f |θ j , Mj ).
14
Model Averaging: Benchmark Case
Statistical Framework
Bayesian Hypothesis Testing
Implementation Issues
Bayesian approach to model averaging: (i) generality, (ii) explicit
treatment of model uncertainty, (iii) ready integration of
decision-making
Remark (Implementation of Bayesian Model Averaging)
15
1
prior distribution p (θ ): assumption about hyperparameters
and functional form
2
prior model probabilities p (Mj )
3
model space: number of models K can be very large
requiring numerical simulations of posterior distribution (2)
Model Averaging: Benchmark Case
Linear Regression Model
Linear Regression Model
Example
Consider the linear regression:
yi = α + xi 1 β1 + ... + xik βk + εi
(12)
yi = observations of dependent variable, i = 1, ..., N.
xi 1 , ..., xik = observation of k explanatory variables,
i = 1, ..., N.
α, β1 , ..., βk = coefficients associated with constant term and
regressors.
εi = residuals, i = 1, ..., N.
16
Model Averaging: Benchmark Case
Linear Regression Model
Multivariate Regression Model
Write multivariate linear regression model (using matrix
algebra):
y = Xβ + ε
(13)
N × 1 vectors of dependent variable y and residuals ε:




y1
ε1
 . 
 . 


y=
ε=
 . ,
 . 
yN
εN
17
Model Averaging: Benchmark Case
Linear Regression Model
N × k + 1 matrix of regressors X = (1, x1 , ..., xk ) with slope
coefficients β = (α, β1 , ..., βk )0 .


1 x11 ... x1k
 .
.
...
. 
,
X=
 . ...

.
1 xN1 ... xNk
note: each regression model also includes constant term and
corresponding intercept α.
18
Model Averaging: Benchmark Case
Linear Regression Model
Normal Linear Model
Benchmark: Normal Linear Model
Benchmark assumptions about residuals and regressors:
1
residuals are independently and identically normally
distributed: εi ∼ i.i.d N (0, σ 2 ).
residuals conditionally exchangeable and homoscedastic with
diagonal variance matrix σ2 I.
2
Alternative assumptions about regressors (see Poirier 1995):
Case I Regressors are fixed, i.e. not random variables.
Case II Regressors predetermined (weakly exogenous),
with prior distribution independent of
parameters [α, β0 , σ2 ].
19
Model Averaging: Benchmark Case
Linear Regression Model
Normal Linear Model
Uncertainty about Regressors
For given variable of interest, y, analyst uncertain which
explanatory variables (regressors) Xj from a total number k to
include
Examples: linear regression model, nonparametric regression
with unknown regression function E (y|X), forecasting in
data-rich environment
Model Mj described by (k × 1) binary vector
γ = (γ1 , ..., γ k )0 , where γj = 1 (γj = 0) means xj included
(excluded) in regression (12)
model space large = all combinations of k regressors, K = 2k
20
Model Averaging: Benchmark Case
Linear Regression Model
Likelihood Function
Likelihood Function
For the normal, linear regression model, the likelihood function can
be written (see Koop 2003, section 2.2):
0
1 0
L(y| β, σ ) ∝ exp − 2 βj − β̂j Xj Xj βj − β̂j
2σ
(
"
#)
vj sj2
−(vj +1)
× σ
exp − 2
(14)
2σ
2
where OLS estimates for slope β̂j = (Xj0 Xj )−1 Xj0 y and
variance sj2 = (y − Xj β̂j )0 (y − Xj β̂j )/vj and degrees of
freedom vj = N − kj − 1.
likelihood (14) is product of normal distribution for slope βj
and inverse-Gamma distribution for variance σ2 .
21
Model Averaging: Benchmark Case
Linear Regression Model
Prior Distributions
Prior Distributions
Prior distribution for parameters can take any form. Choose classes
of prior distribution: (i) analytic tractability, (ii) ease of
interpretation and (iii) computational reasons (see also Lect 1):
Bayesian Conjugate priors: lead to posterior distribution of same
class when combined with likelihood.
Noninformative Priors: introduce no informative prior information.
Idea: “let the data speak”.
Empirical Bayes (EB) Priors: use sample information to specify
prior parameters and limit prior information.
22
Model Averaging: Benchmark Case
Linear Regression Model
Prior Distributions
Bayesian Conjugate Priors
For normal regression model (12), natural conjugate prior is normal
distribution β and inverse-Gamma σ2 (so-called Normal-Gamma
family):
p ( βj |σ2 , Mj ) ∼ N ( β0j , σ 2 V0j )
(15)
2
2
p ( σ | Mj ) = p ( σ ) ∼
IG (s02 , v0 )
interpretation: “fictituous sample” with same properties as
data.
drawback of Bayesian approach: marginal likelihood and
posterior model weights depend on unknown hyper-parameters
( β 0 , V 0 , s 0 , v0 ) .
23
Model Averaging: Benchmark Case
Linear Regression Model
Prior Distributions
Noninformative Priors
Non-Bayesians criticize arbitrary priors. Non-data information
should be minimized, since “not scientific”.
Jeffreys (1946, 1961) proposes “noninformative” prior proportional
square root of information matrix.
Jeffrey’s priors can be motivated using Shannon’s information
criterion as distance between densities (see Zellner 1971).
In multi-parameter and/or hierarchical setups noninformative
priors can give highly undesirable results (for discussion, see
Poirier 1995).
Poirier (1995) recommends to use noninformative priors with great
care and to conduct Bayesian sensitivity analysis.
24
Model Averaging: Benchmark Case
Linear Regression Model
Prior Distributions
Empirical Bayes Priors
To represent diffuse prior information, let prior be dominated by
sample information. Posterior distribution essentially reflects
sample information embodied in likelihood.
Remark (Koop’s (2003) “rule of thumb”)
Use non-informative priors for parameters common to all models
(α, σ2 ), but use proper priors for all other parameters (βj ’s):
p ( α) ∝ 1
(16)
p ( σ2 ) ∝
25
1
σ2
Model Averaging: Benchmark Case
Linear Regression Model
Prior Distributions
Zellner’s g -prior
Zellner (1986) proposes to choose prior covariance (inverse of prior
precision) in (15) equal to so-called g -prior:
V0j = g0 Xj0 Xj
−1
(17)
g0 is factor of proportionality of prior to sample precision.
extremes: g0 = 0 implies non-informative prior, g0 = 1 implies
prior and data receive equal weight.
Fernandez, Ley and Steel (2001a) recommend “benchmark” values:

, if N ≤ k 2
 1/k 2
g0 =
(18)

1/N
, if N > k 2
26
Model Averaging: Benchmark Case
Linear Regression Model
Posterior Analysis
Bayesian Posterior Analysis
Remember: posterior distribution (2) proportional to likelihood
times prior distribution.
Using Bayesian conjugate prior distributions, standard textbook
results (see for example, Koop 2003, sections 3.5 & 3.6):
1
27
posterior distribution p ( β, σ2 |y ) is Normal-inverse-Gamma.
2
Slope coefficient β has marginal t-distribution, with posterior
mean and variance incorporating both prior and sample
information.
3
posterior odds of comparing two models M1 to M2 , depend on
(i) prior odds ratio, (ii) model fit, (iii) coherence between prior
and sample information and (iv) parsimony.
Model Averaging: Benchmark Case
Linear Regression Model
Posterior Analysis
Empirical Bayes (EB) Approach
FLS (2001a): relative posterior model weights proportional to prior
p (Mj ), likelihood and degree of freedom penalty (1 + g0 )−kj /2 :
p ( Mj | Y ) ∝ p ( Mj ) ·
1 + g0
g0
−kj /2
−(N −1) /2
· SSEj
(19)
weights normalized by ∑j p (Mj |Y ), constants drop out.
loose one observation and one regressor from estimating
intercept α = ȳ.
penalize adding regressors kj and large sum of squared errors
SSEj ≡ (y − Xβ)0 (y − Xβ) in model Mj .
28
Model Averaging: Benchmark Case
Linear Regression Model
Posterior Analysis
Frequentist Approach
Consider “sample-dominated” proper prior, assuming g -prior (17)
dominated by sample information as N → ∞ (see Leamer 1978):
p (Mj |Y ) ∝ p (Mj ) · N −kj /2 · SSEj−N/2
(20)
weights proportional to (exponentiated) Schwarz (1978) model
selection criterion or Bayesian Information Criterion (BIC)
model weights (by definition) consistent
29
Model Averaging: Benchmark Case
Linear Regression Model
Posterior Analysis
Comparison of Approaches
Compare posterior model weights from FLS (19) and BIC (20):
FLS (2001) degree of freedom penalty stricter for k 2 > N,
since
(1 + g0 )/g0 = 1 + k 2 > N
intuition: prior variance V0j more diffuse if k 2 > N
FLS (2001) same weights as BIC, if N > k 2 .
Feldkircher and Zeugner (2011) warn of “supermodel” effect (few
model dominate) implied by overly diffuse prior if K 2 >> N.
30
Model Averaging: Benchmark Case
Linear Regression Model
Model Space
Prior Over Model Space
Regressors X = (x1 , ..., xk ) in general not independent; even if
variables orthogonal, inference affected.
31
1
Ignore – independence prior
2
Dilution prior
3
Hierarchical prior
4
Integrate out hyperparameters
Model Averaging: Benchmark Case
Linear Regression Model
Model Space
Independence Prior
Treat regressors xi as if independent with prior π i = p ( βi 6= 0).
This implies following model prior:
k
p ( Mj ) =
γi
∏ πi
( 1 − π i ) 1 − γi
i =1
uniform prior: π U
i = 0.5 for all xi
BACE prior: π BACE
= (k̄/k ), with expected prior model size k̄
i
(see Sala-i-Martin, Doppelhofer and Miller, SDM
2004)
32
Model Averaging: Benchmark Case
Linear Regression Model
Model Space
Prior Probabilities by Model Size
33
Model Averaging: Benchmark Case
Linear Regression Model
Model Space
Dilution Prior
George (1999) suggest to dilute prior weight of “similar” models
containing correlated variables.
For example, modified independence prior (George 2001):
k
γ
Pr(Mj ) = |Rj | ∏ π i i (1 − π i )1−γi
i =1
Rj = correlation matrix proportional to Xj0 Xj .
|Rj | = 1 when Xj ’s orthogonal, |Rj | → 0 with columns of Xj
becoming more “redundant”.
34
Model Averaging: Benchmark Case
Linear Regression Model
Model Space
Hierarchical Priors
Partition model space or Xj ’s hierarchically into regions/trees. For
example Brock, Durlauf and West (BPEA 2003) propose hierarchy:
Theory Uncertainty: assume theories are independent.
Specification Uncertainty: lag length in dynamics, nonlinearities or
numerous empirical proxies for similar theory (cf.
George’s dilution priors)
Heterogeneity Uncertainty: parameter heterogeneity
35
Model Averaging: Benchmark Case
Linear Regression Model
Model Space
Integrating out Hyperparameters
Prior model size k̄ unknown.
Standard approaches:
36
1
Sensitivity analysis over prior model size: SDM (2004)
2
Treat as unknown nuisance parameter and integrate out. See
Brown, Vannucci and Fearn (1998, 2002), Stone and Weeks
(2001), Ley and Steel (2009).
Model Averaging: Benchmark Case
Linear Regression Model
Model Space
Interpretation of Model Space
Specification of model space implies important methodological
issues, in particular assumption about “true model”.
Bernardo and Smith (1994) distinguish two polar cases:
M-closed view: true model unknown, but included in model space.
M-open view no model under consideration is true.
Another important aspect is local vs. global approach to model
uncertainty. (see BDW 2003).
Alternative model weights can be given information-theoretic
foundation.
37
Model Averaging: Benchmark Case
Linear Regression Model
Model Space
Alternative Model Averaging Approaches
Akaike Information Criterion (AIC): minimizes distance from true
distribution in M-open environments.
AICj = N ln SSEj + 2kj
Bayesian Information Criterion (BIC): consistent in M-closed
environment.
BICj = N ln SSEj + ln(N )2kj
Mallow’s Criterion (MC): minimizes asymptotically classical
squared error.
MCj = SSEj + 2σ̂2 kj
Relative performance depends on sample size and stability of
estimated model (see Hansen 2007).
38
Model Averaging: Benchmark Case
Numerical Methods
Numerical Methods
Computational burden of calculating posterior quantities of interest
is important challenge for practical implementation of model
averaging
model space can be large – need to approximate posterior
distribution
analytic (closed form) expressions often not available
good news: computing time much lower, and continued
advances in numerical methods
Here: only brief overview of Markov Chain Monte Carlo (MCMC)
techniques (see Chib (2001) or Geweke (2005) for introductions)
39
Model Averaging: Benchmark Case
Numerical Methods
MCMC Simulations
MCMC Simulations
Simulate a stochastic process g (θ (s ) ) s.t. stationary distribution
g (θ ) is target distribution:
40
1
Conjugate problems: draw directly from known posterior
distribution
2
Non-conjugate, but analytic conditional distribution for
parameters: Gibbs sampler
3
Non-conjugate, unknown distribution: Metropolis-Hastings
algorithm; draw from approximating (known) distribution
Model Averaging: Benchmark Case
Numerical Methods
MCMC Simulations
Example (Monte Carlo Integration)
Suppose we want to calculate
E [g (θ |Y )] ∝
Z
g ( θ ) p ( θ |Y ) d θ
(21)
Under mild regularity conditions, can show that sample counterpart
gS = S1 ∑Ss=1 g (θ (s ) ) converges almost surely to E [g (θ |Y )] and a
central limit theorem applies.
Bottom line: object of interest can be calculated with arbitrary
precision, and convergence can be checked using numerical
standard error of the resulting Markov Chain.
41
Model Averaging: Benchmark Case
Conclusion
Conclusion
Model averaging provides consistent and general treatment of
model uncertainty.
+ integration of decision-theory
+ flexible policy analysis
- computational burden
- specification of priors over distribution and model
space
42
Model Averaging: Benchmark Case
Appendix
Doppelhofer G. 2008. Model Averaging. Palgrave Dictionary of
Economics. 2nd edition.
Fernandez C, Ley E, Steel MFJ. 2001a. Benchmark Priors for
Bayesian Model Averaging. Journal of Econometrics 100(2):
381-427.
Hansen BE. 2007. Least Squares Model Averaging.
Econometrica 75(4): 1175-89.
Hoeting JA, Madigan D, Raftery AE, Volinsky CT. 1999.
Bayesian Model Averaging: A Tutorial. Statistical Science
14(4): 382-417.
Sala-i-Martin X, Doppelhofer G, Miller RI. 2004. Determinants
of Long-Term Growth: A Bayesian Averaging of Classical
Estimates (BACE) Approach. American Economic Review
94(4): 813-35.
43
Model Averaging: Benchmark Case
Appendix
Bayesian Conjugate Priors
Bayesian conjugate (Normal-inverse-Gamma) prior for p ( βj , σ2 |Mj )
and p (σ2 |Mj ) leads to posterior odds (see Koop 2003, section 3.6):
p ( Mj | y )
p ( Mj )
=
×
p ( Ml | y )
p ( Ml )
−1
−1
|V0j
|/|V0j
+ Xj0 Xj |
−1
−1
|V0l
|/|V0l
+ Xl0 Xl |
!1/2 SSEj + Qj
SSEl + Ql
−N/2
(22)
p (Mi ) = prior model probability, i = j, l
SSEi = sum of squared errors
Qi = quadratic form in OLS estimates and prior parameters
Note: posterior odds depend on prior odds and relative model fit,
including coherence between prior and data and parsimony.
43
Model Averaging: Benchmark Case
Appendix
Diffuse Priors
−1
Assuming g -priors V0i = (g0 Xi0 Xi ) and taking the limit as
g → 0 implies posterior odds (see Leamer 1978, ch. 4:
p ( Mj )
p ( Mj | y )
=
×
p ( Ml | y )
p ( Ml )
SSEj
SSEl
−N/2
second factor equal to likelihood ratio of two models
problematic since larger models always preferred
44
Model Averaging: Benchmark Case
(23)
Related documents