Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Model Averaging Model Averaging: Benchmark Case CES LMU München Gernot Doppelhofer Norwegian School of Economics (NHH) 15-17 March 2011 1 Model Averaging: Benchmark Case Outline ntnulogow Outline 2 1 Motivation 2 Statistical Framework Decision Theory Unconditional Distribution Bayesian Hypothesis Testing 3 Linear Regression Model Normal Linear Model Likelihood Function Prior Distributions Posterior Analysis Model Space 4 Numerical Methods Model Averaging: Benchmark Case Motivation Model Uncertainty Model uncertainty: problem of empirical (non-experimental) research theory uncertainty: “open-endedness” specification uncertainty: functional form, heterogeneity data issues: outliers, measurement error Ignore model uncertainty? biased estimation, misleading inference and prediction Goal: robust inference, prediction and policy evaluation 3 Model Averaging: Benchmark Case Motivation Model Averaging: Principle Model averaging fully addresses model uncertainty by “integrating out” uncertainty from distribution of parameters of interest Model Averaging conceptually straightforward combine sample (likelihood) information with relative (posterior) model weights Bayesian, Empirical Bayes, and Frequentist approaches Implementation issues choice of prior, numerical simulations 4 Model Averaging: Benchmark Case Statistical Framework Decision Theory Decision Theory Suppose policymaker chooses action (policy) a to maximize expected utility (or minimizes loss) after observing data Y max E [u (a, θ |Y )] = a Z u (a, θ |Y )p (θ |Y )d θ (1) Parameter (vector) of interest θ represents source of uncertainty for policymaker: effect of economic variable: θ > 0 or θ ≤ 0 prediction of future observations y f Preferred action a∗ depends on objective function (1) and probability distribution p (θ |Y ). 5 Model Averaging: Benchmark Case Statistical Framework Decision Theory Contrast with Conventional Policy Analysis 1 Entire distribution of parameter θ relevant. Uncertainty cannot necessarily be reduced to expected value (mean) and associated variance. 2 Policy evaluation not necessarily same as hypothesis testing. 3 Distinguish parameters θ and estimates of parameters θ̂. standard Bayesian objection to Frequentist approach in many cases, Bayesian and maximum likelihood estimates converge, so issue of second-order importance in large samples 6 Model Averaging: Benchmark Case Statistical Framework Unconditional Distribution Unconditional Distribution Unconditional or posterior distribution p (θ |Y ) (by Bayes’ rule): p ( θ |Y ) = L( Y | θ ) p ( θ ) ∝ L( Y | θ ) p ( θ ) p (Y ) L(Y |θ ) = likelihood function summarizing all info about θ contained in data Y p (θ ) = prior distribution p (Y ) = normalizing factor data distribution (marginal likelihood) Note: classical econometricians view parameter θ as fixed (non-random), but estimator θ̂ as random variable. 7 Model Averaging: Benchmark Case (2) Statistical Framework Unconditional Distribution Model Weights Suppose K candidate models M1 , ..., MK to explain data Y . Model Mj described by probability distribution p (Y |θ j , Mj ) with model-specific parameter vector θ j Unconditional posterior distribution: K p ( θ |Y ) = ∑ wj · p (θj |Mj , Y ) j =1 wj proportional to fit in explaining observed data thick modelling: equal weights wj = 1/K 8 Model Averaging: Benchmark Case (3) Statistical Framework Unconditional Distribution Bayesian Model Weights In Bayesian context, weights are posterior model probabilities (by Bayes’ rule): wj = p (Mj |Y ) = L ( Y | Mj ) p ( Mj ) ∝ L ( Y | Mj ) p ( Mj ) p (Y ) (4) posterior weights proportional to model-specific marginal likelihood and prior model probability Marginal (model-specific) likelihood is integrated w.r.t. θ j L ( Y | Mj ) = 9 Z θ L ( Y | θ j , Mj ) p ( θ j | Mj ) d θ j Model Averaging: Benchmark Case (5) Statistical Framework Unconditional Distribution Bayesian Interpretation: Hierarchical Mixture Model Bayesian approach to multi-model setup assigns prior probability distribution p (θj |Mj ) to parameters of each model, and a prior probability p (Mj ) to each model. Prior setup induces a joint distribution over the data, parameters and models: p ( Y , θ j , Mj ) = L ( Y | θ j , Mj ) p ( θ j | Mj ) p ( Mj ) Priors embed separate models within large hierarchical mixture model. Through conditioning and marginalization, we obtain posterior quantities of interest. 10 Model Averaging: Benchmark Case (6) Statistical Framework Unconditional Distribution Unconditional Mean and Variance K E ( θ |Y ) = ∑ p ( Mj | Y ) · E ( θ j | Mj , Y ) (7) j =1 V (θ |Y ) = E (θ2 |Y ) − [E (θ |Y )]2 = (8) n K 2 o = ∑ p ( Mj | Y ) · V ( θ j | Mj , Y ) + E ( θ j | Mj , Y ) − [E (θ |Y )]2 j =1 K = ∑ j =1 n 2 o p ( Mj | Y ) · V ( θ j | Mj , Y ) + E ( θ j | Mj , Y ) − E ( θ | Y ) Note: unconditional variance (8) exceeds sum of conditional variances, reflecting uncertainty about mean (Draper 1995). 11 Model Averaging: Benchmark Case Statistical Framework Unconditional Distribution Predictive Distribution Predictive distribution for Y f obtained from model-weighted conditional predictive distributions: p ( Y f |Y ) = K ∑ p ( Y f | Mj , Y ) · p ( Mj | Y ) (9) j =1 where p (Y f |Mj , Y ) = R p ( Y f | θ j , Mj ) p ( θ j | Mj , Y ) d θ j . By averaging over unknown models, predictive distribution incorporates model uncertainty embedded in priors. Natural point prediction of Y f obtained by mean of p (Y f |Y ): E ( Y f |Y ) = K ∑ E ( Y f | Mj , Y ) · p ( Mj | Y ) j =1 12 Model Averaging: Benchmark Case (10) Statistical Framework Bayesian Hypothesis Testing Bayesian Hypothesis Testing Compare two models Mi and Mj by posterior odds: L ( Y | Mi ) p ( Mi ) p ( Mi | Y ) = × p ( Mj | Y ) L ( Y | Mj ) p ( Mj ) (11) Posterior odds combine sample information, summarized in so-called Bayes factor, L(Y |Mi )/L(Y |Mj ), and prior odds, p (Mi )/p (Mj ). Similarly, weight of model Mi relative to K models given by (4), where normalizing factor is ∑K j = 1 L ( Y | Mj ) p ( Mj ) . 13 Model Averaging: Benchmark Case Statistical Framework Bayesian Hypothesis Testing Model Averaging vs. Model Selection Starting with priors p (θ j |Mj ) and p (Mj ), the posterior model probabilities represent a complete representation of postdata model uncertainty, which can be used for variety of inferences and decisions. Optimal strategy depends on underlying loss function: Highest posterior model selection corresponds to 0-1 loss for correct selection. Model averaged point prediction (10) minimizes quadratic loss function. Predictive distribution (9) minimizes Kullback-Leibler loss w.r.t. actual predictive distribution p (Y f |θ j , Mj ). 14 Model Averaging: Benchmark Case Statistical Framework Bayesian Hypothesis Testing Implementation Issues Bayesian approach to model averaging: (i) generality, (ii) explicit treatment of model uncertainty, (iii) ready integration of decision-making Remark (Implementation of Bayesian Model Averaging) 15 1 prior distribution p (θ ): assumption about hyperparameters and functional form 2 prior model probabilities p (Mj ) 3 model space: number of models K can be very large requiring numerical simulations of posterior distribution (2) Model Averaging: Benchmark Case Linear Regression Model Linear Regression Model Example Consider the linear regression: yi = α + xi 1 β1 + ... + xik βk + εi (12) yi = observations of dependent variable, i = 1, ..., N. xi 1 , ..., xik = observation of k explanatory variables, i = 1, ..., N. α, β1 , ..., βk = coefficients associated with constant term and regressors. εi = residuals, i = 1, ..., N. 16 Model Averaging: Benchmark Case Linear Regression Model Multivariate Regression Model Write multivariate linear regression model (using matrix algebra): y = Xβ + ε (13) N × 1 vectors of dependent variable y and residuals ε: y1 ε1 . . y= ε= . , . yN εN 17 Model Averaging: Benchmark Case Linear Regression Model N × k + 1 matrix of regressors X = (1, x1 , ..., xk ) with slope coefficients β = (α, β1 , ..., βk )0 . 1 x11 ... x1k . . ... . , X= . ... . 1 xN1 ... xNk note: each regression model also includes constant term and corresponding intercept α. 18 Model Averaging: Benchmark Case Linear Regression Model Normal Linear Model Benchmark: Normal Linear Model Benchmark assumptions about residuals and regressors: 1 residuals are independently and identically normally distributed: εi ∼ i.i.d N (0, σ 2 ). residuals conditionally exchangeable and homoscedastic with diagonal variance matrix σ2 I. 2 Alternative assumptions about regressors (see Poirier 1995): Case I Regressors are fixed, i.e. not random variables. Case II Regressors predetermined (weakly exogenous), with prior distribution independent of parameters [α, β0 , σ2 ]. 19 Model Averaging: Benchmark Case Linear Regression Model Normal Linear Model Uncertainty about Regressors For given variable of interest, y, analyst uncertain which explanatory variables (regressors) Xj from a total number k to include Examples: linear regression model, nonparametric regression with unknown regression function E (y|X), forecasting in data-rich environment Model Mj described by (k × 1) binary vector γ = (γ1 , ..., γ k )0 , where γj = 1 (γj = 0) means xj included (excluded) in regression (12) model space large = all combinations of k regressors, K = 2k 20 Model Averaging: Benchmark Case Linear Regression Model Likelihood Function Likelihood Function For the normal, linear regression model, the likelihood function can be written (see Koop 2003, section 2.2): 0 1 0 L(y| β, σ ) ∝ exp − 2 βj − β̂j Xj Xj βj − β̂j 2σ ( " #) vj sj2 −(vj +1) × σ exp − 2 (14) 2σ 2 where OLS estimates for slope β̂j = (Xj0 Xj )−1 Xj0 y and variance sj2 = (y − Xj β̂j )0 (y − Xj β̂j )/vj and degrees of freedom vj = N − kj − 1. likelihood (14) is product of normal distribution for slope βj and inverse-Gamma distribution for variance σ2 . 21 Model Averaging: Benchmark Case Linear Regression Model Prior Distributions Prior Distributions Prior distribution for parameters can take any form. Choose classes of prior distribution: (i) analytic tractability, (ii) ease of interpretation and (iii) computational reasons (see also Lect 1): Bayesian Conjugate priors: lead to posterior distribution of same class when combined with likelihood. Noninformative Priors: introduce no informative prior information. Idea: “let the data speak”. Empirical Bayes (EB) Priors: use sample information to specify prior parameters and limit prior information. 22 Model Averaging: Benchmark Case Linear Regression Model Prior Distributions Bayesian Conjugate Priors For normal regression model (12), natural conjugate prior is normal distribution β and inverse-Gamma σ2 (so-called Normal-Gamma family): p ( βj |σ2 , Mj ) ∼ N ( β0j , σ 2 V0j ) (15) 2 2 p ( σ | Mj ) = p ( σ ) ∼ IG (s02 , v0 ) interpretation: “fictituous sample” with same properties as data. drawback of Bayesian approach: marginal likelihood and posterior model weights depend on unknown hyper-parameters ( β 0 , V 0 , s 0 , v0 ) . 23 Model Averaging: Benchmark Case Linear Regression Model Prior Distributions Noninformative Priors Non-Bayesians criticize arbitrary priors. Non-data information should be minimized, since “not scientific”. Jeffreys (1946, 1961) proposes “noninformative” prior proportional square root of information matrix. Jeffrey’s priors can be motivated using Shannon’s information criterion as distance between densities (see Zellner 1971). In multi-parameter and/or hierarchical setups noninformative priors can give highly undesirable results (for discussion, see Poirier 1995). Poirier (1995) recommends to use noninformative priors with great care and to conduct Bayesian sensitivity analysis. 24 Model Averaging: Benchmark Case Linear Regression Model Prior Distributions Empirical Bayes Priors To represent diffuse prior information, let prior be dominated by sample information. Posterior distribution essentially reflects sample information embodied in likelihood. Remark (Koop’s (2003) “rule of thumb”) Use non-informative priors for parameters common to all models (α, σ2 ), but use proper priors for all other parameters (βj ’s): p ( α) ∝ 1 (16) p ( σ2 ) ∝ 25 1 σ2 Model Averaging: Benchmark Case Linear Regression Model Prior Distributions Zellner’s g -prior Zellner (1986) proposes to choose prior covariance (inverse of prior precision) in (15) equal to so-called g -prior: V0j = g0 Xj0 Xj −1 (17) g0 is factor of proportionality of prior to sample precision. extremes: g0 = 0 implies non-informative prior, g0 = 1 implies prior and data receive equal weight. Fernandez, Ley and Steel (2001a) recommend “benchmark” values: , if N ≤ k 2 1/k 2 g0 = (18) 1/N , if N > k 2 26 Model Averaging: Benchmark Case Linear Regression Model Posterior Analysis Bayesian Posterior Analysis Remember: posterior distribution (2) proportional to likelihood times prior distribution. Using Bayesian conjugate prior distributions, standard textbook results (see for example, Koop 2003, sections 3.5 & 3.6): 1 27 posterior distribution p ( β, σ2 |y ) is Normal-inverse-Gamma. 2 Slope coefficient β has marginal t-distribution, with posterior mean and variance incorporating both prior and sample information. 3 posterior odds of comparing two models M1 to M2 , depend on (i) prior odds ratio, (ii) model fit, (iii) coherence between prior and sample information and (iv) parsimony. Model Averaging: Benchmark Case Linear Regression Model Posterior Analysis Empirical Bayes (EB) Approach FLS (2001a): relative posterior model weights proportional to prior p (Mj ), likelihood and degree of freedom penalty (1 + g0 )−kj /2 : p ( Mj | Y ) ∝ p ( Mj ) · 1 + g0 g0 −kj /2 −(N −1) /2 · SSEj (19) weights normalized by ∑j p (Mj |Y ), constants drop out. loose one observation and one regressor from estimating intercept α = ȳ. penalize adding regressors kj and large sum of squared errors SSEj ≡ (y − Xβ)0 (y − Xβ) in model Mj . 28 Model Averaging: Benchmark Case Linear Regression Model Posterior Analysis Frequentist Approach Consider “sample-dominated” proper prior, assuming g -prior (17) dominated by sample information as N → ∞ (see Leamer 1978): p (Mj |Y ) ∝ p (Mj ) · N −kj /2 · SSEj−N/2 (20) weights proportional to (exponentiated) Schwarz (1978) model selection criterion or Bayesian Information Criterion (BIC) model weights (by definition) consistent 29 Model Averaging: Benchmark Case Linear Regression Model Posterior Analysis Comparison of Approaches Compare posterior model weights from FLS (19) and BIC (20): FLS (2001) degree of freedom penalty stricter for k 2 > N, since (1 + g0 )/g0 = 1 + k 2 > N intuition: prior variance V0j more diffuse if k 2 > N FLS (2001) same weights as BIC, if N > k 2 . Feldkircher and Zeugner (2011) warn of “supermodel” effect (few model dominate) implied by overly diffuse prior if K 2 >> N. 30 Model Averaging: Benchmark Case Linear Regression Model Model Space Prior Over Model Space Regressors X = (x1 , ..., xk ) in general not independent; even if variables orthogonal, inference affected. 31 1 Ignore – independence prior 2 Dilution prior 3 Hierarchical prior 4 Integrate out hyperparameters Model Averaging: Benchmark Case Linear Regression Model Model Space Independence Prior Treat regressors xi as if independent with prior π i = p ( βi 6= 0). This implies following model prior: k p ( Mj ) = γi ∏ πi ( 1 − π i ) 1 − γi i =1 uniform prior: π U i = 0.5 for all xi BACE prior: π BACE = (k̄/k ), with expected prior model size k̄ i (see Sala-i-Martin, Doppelhofer and Miller, SDM 2004) 32 Model Averaging: Benchmark Case Linear Regression Model Model Space Prior Probabilities by Model Size 33 Model Averaging: Benchmark Case Linear Regression Model Model Space Dilution Prior George (1999) suggest to dilute prior weight of “similar” models containing correlated variables. For example, modified independence prior (George 2001): k γ Pr(Mj ) = |Rj | ∏ π i i (1 − π i )1−γi i =1 Rj = correlation matrix proportional to Xj0 Xj . |Rj | = 1 when Xj ’s orthogonal, |Rj | → 0 with columns of Xj becoming more “redundant”. 34 Model Averaging: Benchmark Case Linear Regression Model Model Space Hierarchical Priors Partition model space or Xj ’s hierarchically into regions/trees. For example Brock, Durlauf and West (BPEA 2003) propose hierarchy: Theory Uncertainty: assume theories are independent. Specification Uncertainty: lag length in dynamics, nonlinearities or numerous empirical proxies for similar theory (cf. George’s dilution priors) Heterogeneity Uncertainty: parameter heterogeneity 35 Model Averaging: Benchmark Case Linear Regression Model Model Space Integrating out Hyperparameters Prior model size k̄ unknown. Standard approaches: 36 1 Sensitivity analysis over prior model size: SDM (2004) 2 Treat as unknown nuisance parameter and integrate out. See Brown, Vannucci and Fearn (1998, 2002), Stone and Weeks (2001), Ley and Steel (2009). Model Averaging: Benchmark Case Linear Regression Model Model Space Interpretation of Model Space Specification of model space implies important methodological issues, in particular assumption about “true model”. Bernardo and Smith (1994) distinguish two polar cases: M-closed view: true model unknown, but included in model space. M-open view no model under consideration is true. Another important aspect is local vs. global approach to model uncertainty. (see BDW 2003). Alternative model weights can be given information-theoretic foundation. 37 Model Averaging: Benchmark Case Linear Regression Model Model Space Alternative Model Averaging Approaches Akaike Information Criterion (AIC): minimizes distance from true distribution in M-open environments. AICj = N ln SSEj + 2kj Bayesian Information Criterion (BIC): consistent in M-closed environment. BICj = N ln SSEj + ln(N )2kj Mallow’s Criterion (MC): minimizes asymptotically classical squared error. MCj = SSEj + 2σ̂2 kj Relative performance depends on sample size and stability of estimated model (see Hansen 2007). 38 Model Averaging: Benchmark Case Numerical Methods Numerical Methods Computational burden of calculating posterior quantities of interest is important challenge for practical implementation of model averaging model space can be large – need to approximate posterior distribution analytic (closed form) expressions often not available good news: computing time much lower, and continued advances in numerical methods Here: only brief overview of Markov Chain Monte Carlo (MCMC) techniques (see Chib (2001) or Geweke (2005) for introductions) 39 Model Averaging: Benchmark Case Numerical Methods MCMC Simulations MCMC Simulations Simulate a stochastic process g (θ (s ) ) s.t. stationary distribution g (θ ) is target distribution: 40 1 Conjugate problems: draw directly from known posterior distribution 2 Non-conjugate, but analytic conditional distribution for parameters: Gibbs sampler 3 Non-conjugate, unknown distribution: Metropolis-Hastings algorithm; draw from approximating (known) distribution Model Averaging: Benchmark Case Numerical Methods MCMC Simulations Example (Monte Carlo Integration) Suppose we want to calculate E [g (θ |Y )] ∝ Z g ( θ ) p ( θ |Y ) d θ (21) Under mild regularity conditions, can show that sample counterpart gS = S1 ∑Ss=1 g (θ (s ) ) converges almost surely to E [g (θ |Y )] and a central limit theorem applies. Bottom line: object of interest can be calculated with arbitrary precision, and convergence can be checked using numerical standard error of the resulting Markov Chain. 41 Model Averaging: Benchmark Case Conclusion Conclusion Model averaging provides consistent and general treatment of model uncertainty. + integration of decision-theory + flexible policy analysis - computational burden - specification of priors over distribution and model space 42 Model Averaging: Benchmark Case Appendix Doppelhofer G. 2008. Model Averaging. Palgrave Dictionary of Economics. 2nd edition. Fernandez C, Ley E, Steel MFJ. 2001a. Benchmark Priors for Bayesian Model Averaging. Journal of Econometrics 100(2): 381-427. Hansen BE. 2007. Least Squares Model Averaging. Econometrica 75(4): 1175-89. Hoeting JA, Madigan D, Raftery AE, Volinsky CT. 1999. Bayesian Model Averaging: A Tutorial. Statistical Science 14(4): 382-417. Sala-i-Martin X, Doppelhofer G, Miller RI. 2004. Determinants of Long-Term Growth: A Bayesian Averaging of Classical Estimates (BACE) Approach. American Economic Review 94(4): 813-35. 43 Model Averaging: Benchmark Case Appendix Bayesian Conjugate Priors Bayesian conjugate (Normal-inverse-Gamma) prior for p ( βj , σ2 |Mj ) and p (σ2 |Mj ) leads to posterior odds (see Koop 2003, section 3.6): p ( Mj | y ) p ( Mj ) = × p ( Ml | y ) p ( Ml ) −1 −1 |V0j |/|V0j + Xj0 Xj | −1 −1 |V0l |/|V0l + Xl0 Xl | !1/2 SSEj + Qj SSEl + Ql −N/2 (22) p (Mi ) = prior model probability, i = j, l SSEi = sum of squared errors Qi = quadratic form in OLS estimates and prior parameters Note: posterior odds depend on prior odds and relative model fit, including coherence between prior and data and parsimony. 43 Model Averaging: Benchmark Case Appendix Diffuse Priors −1 Assuming g -priors V0i = (g0 Xi0 Xi ) and taking the limit as g → 0 implies posterior odds (see Leamer 1978, ch. 4: p ( Mj ) p ( Mj | y ) = × p ( Ml | y ) p ( Ml ) SSEj SSEl −N/2 second factor equal to likelihood ratio of two models problematic since larger models always preferred 44 Model Averaging: Benchmark Case (23)