Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Misspecification in Infinite-Dimensional Bayesian Statistics Author(s): B. J. K. Kleijn and A. W. van der Vaart Source: The Annals of Statistics, Vol. 34, No. 2 (Apr., 2006), pp. 837-877 Published by: Institute of Mathematical Statistics Stable URL: http://www.jstor.org/stable/25463439 . Accessed: 22/07/2011 07:06 Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at . http://www.jstor.org/page/info/about/policies/terms.jsp. JSTOR's Terms and Conditions of Use provides, in part, that unless you have obtained prior permission, you may not download an entire issue of a journal or multiple copies of articles, and you may use content in the JSTOR archive only for your personal, non-commercial use. Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at . http://www.jstor.org/action/showPublisher?publisherCode=ims. . Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed page of such transmission. JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about JSTOR, please contact support@jstor.org. Institute of Mathematical Statistics is collaborating with JSTOR to digitize, preserve and extend access to The Annals of Statistics. http://www.jstor.org TheAnnals of Statistics 2006,Vol. 34,No. 2, 837-877 DOI: 10.1214/009053606000000029 ? InstituteofMathematical Statistics, 2006 IN INFINITE-DIMENSIONAL MISSPECIFICATION BAYESIAN STATISTICS By B. J. K. Kleijn and A. W. We consider is misspecified. a distribution Pq, that the posterior that minimize concentrates The method gence. infinite-dimensional ric regression its mass the Kullback-Leibler and parametric if the from show near in the support of the the points to Pq. An with respect divergence condition determine the rate of conver a prior-mass to several is applied examples, models. These include Gaussian and condition Vaart of posterior the asymptotic behavior distributions a prior distribution and a random Given sample not be in the support which of the prior, we may prior entropy der Amsterdam Vrije Universiteit model van with interest special mixtures, for nonparamet models. con Of all criteria for statistical estimation, asymptotic is the the least that estimation among sistency Consistency requires disputed. come arbitrarily close to the true, underlying if enough distribution, procedure are used. It is of a frequentist nature, because a notion of observations it presumes an underlying, true distribution If applied to posterior distrib for the observations. 1. Introduction. a useful property by many Bayesians, as it could warn utions, it is also considered one away from prior distributions or unexpected, with undesirable, consequences. lead to undesirable Priors which have been documented, in particular, posteriors for non- or semiparametric models (e.g., [4, 5]), in which case it is also difficult to a particular prior on purely intuitive, subjective grounds. In the present paper we consider the situation where the posterior distribution cannot possibly be asymptotically or the prior, is because the consistent, model, a From misspecified. frequentist point of view, the relevance of studying misspec ification is clear, because the assumption that the model contains the true, under motivate inmany practical situations. From lying distribution may lack realistic motivation an objective Bayesian of the is of in prin view, interest, because, point question a the allows unrestricted choice of ciple, Bayesian paradigm prior, and, hence, we must allow for the possibility that the fixed distribution of the observations does not belong to the support of the prior. In this paper we show that in such a case near a point in the support of the prior that is closest the posterior will concentrate to the true sampling distribution as measured diver through the Kullback-Leibler near this point. for the rate of concentration gence, and we give a characterization Received March 2003; revised March 2005. AMS 2000 subject classifications. 62G07, 62G08, 62G20, 62F05, 62F15. Key words of convergence. and phrases. Misspecification, infinite-dimensional 837 model, posterior distribution, rate B. J.K. KLEIJN AND A. W VAN DER VAART 838 are i.i.d. observations, the paper we assume that X\, X2,... each Throughout a to probability measure Po- Given amodel & and a prior IT, distributed according subset B c g? is given by supported on g?, the posterior mass of a measurable = (1.1) Un(B\Xx,...,Xn) Here it is assumed = \ j f\p(Xi)dYl(P)/ j f\p(Xi)dYl(P). i= \ i that the model is dominated p, by a a-finite measure to the dominating measure and the is writ density of a typical element P e g? relative ten p and assumed appropriately measurable. If we assume that the model is well means that then that the is, Po ^, specified, posterior consistency posterior dis an arbitrarily large fraction of their total mass tributions concentrate in arbitrarily small neighborhoods of Po, if the number of observations used to determine the is large enough. To formalize this, we let d be a metric on ?P and say posterior that the Bayesian if, for every s > 0, procedure for the specified prior is consistent, -> > in More infor 0, :d(P, Po) Tln({P Xn) e}\X\,..., specific Po-probability. an mation rate the behavior of estimator is its of concerning asymptotic given by > a zero Let be that to 0 decreases and for sn sequence convergence. suppose that, any constants (1.2) ? 00, Mn n?(P e &>:d(P, Po) > Mnen\Xu .. ,Xn) -> 0, to a decreasing in Po-probability. The sequence en corresponds sequence of neigh borhoods of Po, the d-radius of which goes to zero with n, while still capturing most of the posterior mass. If (1.2) is satisfied, then we say that the rate of conver gence is at least sn. If Po is at a positive distance from the model gP and the prior concentrates all as it will concentrate its mass on g?, then the posterior is inconsistent all its mass on g? as well. However, in this paper we show that the posterior will still settle P* e g?, and we shall characterize the sequences down near a given measure en such that the preceding display is valid with d(P\ P*) taking the place of d(P, Po). itsmass near minimum Kullback One would expect the posterior to concentrate *s near the likelihood maximal since asymptotically Leibler points, YYi=\ P(X() The integrand in the numerator Kullback-Leibler divergence. points of minimal so subsets of the model in which is of (1.1) is the likelihood, the (log-)likelihood no mass. a it is account for of the total fraction Hence, great large large posterior P* is a minimum Kullback point of convergence surprise that the appropriate point in g&, but the general issue of rates (and which metric d to use) turns than expected. We follow the work by Ghosal, Ghosh out to be more complicated and van der Vaart [8] for the well-specified situation, but need to adapt, change or Leibler extend many steps. After deriving general in some detail, in several examples results, we consider of Dirichlet Gaussian mixtures priors on the mixing using fitting cluding Bayesian Our results on the re the regression problem and parametric models. distribution, that a Bayesian approach in gression problem allow one, for instance, to conclude BAYESIAN MISSPECIFICATION 839 that uses a prior on the regression the nonparametric function, but em problem a will to lead normal distribution for the consistent of estimation errors, ploys the regression function, even if the regression errors are non-Gaussian. This result, fact that least squares estima which is the Bayesian counterpart of the well-known tors (the maximum if the errors are Gaussian) likelihood estimators perform well if the errors are non-Gaussian, is important to validate the Bayesian approach to regression, in the literature. but appears to have received little attention even Let L\(&,&/) and organization. denote the set of all finite measures on convex be and let the hull of a set of mea &/) (<$?", conv(J2) signed sures J2: the set of all finite linear combinations ? f?r & and ^/ ? 0 Qi J2i ^/ Qi function /, let Qf denote the integral f fdQ. with ]T, Xi = \. For a measurable as follows. The paper is organized Section 2 contains the main results of the 1.1. Notation in increasing generality. the three classes of Sections 3, 4 and 5 concern that we consider: mixtures, the regression model and parametric models. examples re 6 and 7 contain the proofs of the main results, where Sections the necessary sults on tests are developed in Section 6 and are of independent interest. The final paper, section is a technical appendix. be an i.i.d. sample from a distribution X2,... Po on a on collection gP of probability distributions space i$g, &/). Given the posterior measure is defined rion^, i3?, &/) and a prior probability measure as in (1.1) (where 0/0 = 0 by definition). Here it is assumed that the "model" g? h^ a measure 1 a is dominated a-finite and that is p by /?(i) density of P e g? relative to p such that the map ix, p) \-+ pix) is measurable relative to the product on g?, so that the right-hand of srf and an appropriate a-field side of (1.1) is a a a function of B of measure as measurable function iX\,..., Xn) and probability 2. Main a measurable results. Let Xi, for every X\,..., is positive. The Xn such that the denominator to the model gP. For simplicity tion Po may or may not belong assume that Po possesses a density po relative to p as well. "true" distribu we of notation, Informally we think of the model g? as the "support" of the prior n, but we sense. At this point we only assume shall not make this precise in a topological = 1 on ?? in the sense that n(^) that the prior concentrates (but we note later that this too can be relaxed). Further requirements are made in the statements of the main results. Our main theorems assume the existence of a point implicitly P* e g? minimizing the Kullback-Leibler of Po to the model &. In divergence is to assumed be that finite, particular, the minimal Kullback-Leibler is, divergence P* satisfies (2.1) By the convention we assume without observations. _P0log^<oo. Po that logO = -00, loss of generality the above implies that Po <?CP* and, hence, that the density /?* is strictly positive at the B. J. K. KLEIJN AND A. W VAN DER VAART 840 Our theorems for the posterior distribution to concen trate in neighborhoods of P* at a rate that is determined by the amount of prior mass "close to" the minimal Kullback-Leibler P* and the "entropy" of the point model. To specify the terms between quotation marks, we make the following de give sufficient conditions finitions. We define the entropy and the neighborhoods in which the posterior is to con centrate its mass relative to a semi-metric d on &. The general results are for mulated relative to an arbitrary semi-metric and next the conditions will be sim or not these simplifications can be for more plified specific choices. Whether an case made depends on the model g?, convexity (see being important special Lemma in the case of well-specified for exam 2.2). Unlike priors, considered, distance is not always appropriate in the misspecified ple, in [8], the Hellinger num in terms of a covering general entropy bound is formulated > s as we defined for under 0 follows: define testing misspecification, number Af of convex sets B\,..., Nt(s, g?, d; Po, P*) as the minimal B^ of prob < measures on cover e the set {P e g?: (??", g/) needed to d(P, P*) < 2e] ability such that, for every /, situation. The ber for inf (2.2) sup -logPo / p \a e2 >-. -^ is no finite covering to be of this type, we define the covering number as refer to the logarithms numbers g*, d\ Nt(s, Po, P*) entropy for log These numbers differ from ordinary metric entropy testing under misspecification. sets B[ are required to satisfy the preceding in that the covering numbers display to rather than be balls of radius e.We insist that the sets 5/ be convex and that (2.2) If there infinite. We P that do not hold for every P e Bj. This implies that (2.2) may involve measures belong to the model g* if this is not convex itself. For s > 0, we define a specific kind of Kullback-Leibler of P* neighborhood by (2.3) ^(^,P*;Po) = <*2} {pG^:-P0log4<^2^o(log4) For a given model g*, prior Yl on g? and some P* e g*, as < oo for all Peg0. < oo and Po(p/p*) that that ?Polog(p*/po) Suppose ?> 0 and ne^^oo numbers sn with sn there exist a sequence of strictly positive and a constant L > 0, such that, for all n, THEOREM 2.1. sume (2.4) n(5(^,P*;P0))>^L^2, Nt(s, (2.5) Then for (2.6) every sufficiently n?(P e ^:d(P, Po, P*) &,d\ < en?n large constant M,as P*) > Msn\Xu all s> for sn. n -* oo, ..., Xn) - 0 in Li(P0"). 841 BAYESIAN MISSPECIFICATION The proof of this theorem is given in Section 7. The theorem does not explicitly but this is Kullback-Leibler divergence, require that P* be a point of minimal to the 6.4 The theorem is extended conditions the Lemma below). (see implied by case of nonunique minimal Kullback-Leibler points in Section 2.4. 2.1 are a prior mass condition of Theorem The two main conditions an entropy posterior situation condition consistency in [8]. Below prior mass (2.4) and can to Schwartz' for be compared conditions (2.5), which for the well-specified (see [13]), or the two main conditions we discuss the background in turn. of these conditions condition for the (2.4) reduces to the corresponding = < we in if Because P* oo, Po. ?Po log(p*/po) [8] may correctly specified in the definition rewrite the first inequality (2.3) of the set Bis, P*; Po) as The condition case P -Polog? the set Bis, Therefore, imal Kullback-Leibler Po ? P* <-P0log P*; Po) contains with divergence i + s2. Po that are within s2 of the min only P e & over the model. The lower to Po respect as (2.4) on the prior mass of Bis, P*; Po) requires that the prior measure a mass to share its certain of minimal total Kullback-Leibler sign neighborhoods of P*. As argued in [8], a rough understanding of the exact form of (2.4) for the over g*. "optimal" rate sn is that an optimal prior spreads its mass "uniformly" bound serves to lower-bound the 2.1, the prior mass condition in the expression for the posterior. The background of the entropy condition involved, but can be (2.5) is more a to condition in the well-specified situation given in compared corresponding In the proof denominator of Theorem 2.1 of [8]. The purpose of the entropy condition is to measure the com a a to the of rate slower of model, convergence. plexity larger entropy leading The entropy used in [8] is either the ordinary metric entropy log Nis, g?, d), or Theorem the local entropy logNis/2, {P e g?\s < diP, P0) < 2s}, d). For d theHellinger the minimal distance, sn satisfying g?, d) = ns2n is roughly the fastest log Nisn, rate of convergence a density for estimating in the model g? relative to d ob of estimation tainable by any method (cf. [2]). We are not aware of a concept of if the model is misspecified, but a rough interpre "optimal rate of convergence" tation of (2.5) given (2.4) would be that in the misspecified situation the posterior near the closest Kullback-Leibler concentrates at the point optimal rate pertaining to the model gP. that the complexity of the model be measured in Misspecification requires a different, somewhat In on the semi way. complicated examples, depending metric d, the covering numbers Nt(e, g?, d\ Pq, P*) can be related to ordinary metric numbers Nis, g*,d). For instance, we show below covering (see Lem mas 2.1-2.3) are bounded is convex, then the numbers Nt(e, g?, d; Pq, P*) that, if the model & if the distance diP\, P2) equals by the covering numbers Nis, g?,d) = distance between the measures Qi defined by dQi ip$/p*)dPi, the Hellinger that is, a weighted Hellinger distance between Pi and P2. B. J. K. KLEIJN AND A. W VAN DER VAART 842 have P* = Po, and the entropy numbers for above by ordinary entropy numbers for the Hellinger dis a refinement of the main theorem of [8]. To see tance. Thus, Theorem 2.1 becomes ? > 1 ? x for x > 0, this, we first note that, since log jc In the well-specified testing can be bounded we situation > I- -logPoj? J J VpJpodp = -h2(p,p0). ? = side of (2.2) with a 1/2 and po p* is bounded below by infpG^ h2(p, po). Because Hellinger balls are convex, they are eligible for the sets B[ required in the definition candidates of the covering numbers for < we cover set set of If the g?:2e h(P, Po) < 4e] by a minimal [P testing. balls of radius e/2, then these balls automatically satisfy (2.2), by the Hellinger It follows triangle that the left-hand inequality. It follows that Nt(e, g?, h\ Po, Po) < N(e/2, 2s < h(P, P0) < 4^}, h). {P e &: a local covering number of the type used by [8]. sets rather than for testing allow general convex to in be smaller balls of a given diameter, general than local they appear genuinely a convex sets not need size restriction.) numbers. the covering (Notably, satisfy we are interested we not in of the do know of However, any examples gains setting no use case for the extended there appears to be in, so that in the well-specified The side right-hand the entropy Because is exactly numbers situation they covering numbers as defined by (2.2). In the general, misspecified even for standard parametric models, are essential, such as the one-dimensional normal location model. of At a more technical level, the entropy condition of [8] ensures the existence P versus the true measure certain tests of the measures Po. In the misspecified P to the minimal to compare Kullback-Leibler case it is necessary the measures is not a test turns out to the It that than rather P*, Po. appropriate comparison point P* in the ordinary sense of testing, but to P versus the measure of the measures = versus the measure test the measures defined by dQ(P) Po Q(P) (po/p*)dP measures P where set of J2 the ranges over g?, this leads [see (7.4)]. With Q(P) to consideration of minimax testing risks of the type inf sup(Po>+(2"(l-0)), 0 taking values in [0, 1]. A difference with the usual results on minimax Q testing risks is that the measures in measures in be infinite fact not be (and may general). may probability of Le Cam and Birge, we show in Section 6 that, for a arguments Extending is bounded above convex set B, the minimax display testing risk in the preceding where the infimum is taken over all measurable functions by (2.7) inf 0<a<lge=g sup pa(Po,Q)n, BAYESIAN MISSPECIFICATION where 843 a h> pa(Po, Q) is the Hellinger transform pa(P, ? transform reduces to the map the Hellinger For Q QiP), the function dp. f paql~a a^Px-a{QiP),Po) = Q) = Poip/p*)\ < in (2.2) is satisfied, then Poip/p*)a If the inequality e~? /4 and, hence, the set of measures QiP) with P ranging over B( can be tested with error probabilities bounded by e~n? /4'. For s bounded away from zero, or in (2.2). also encountered are exponentially small, ensuring converging slowly to zero, these probabilities on the "unlikely alternatives" that the posterior does not concentrate B{. but the of The testing bound (2.7) is valid for convex alternatives alternatives J2, interest {Pe^:d(P,P*)> not convex. A test function are complements of balls and, hence, typically Ms} can be constructed for nonconvex alternatives using a sets. The entropy condition controls the size of this (2.5) covering of gP by convex cover and, hence, the rate of convergence in misspecified is determined situations Because the numbers the of the theorem d; gP, Ntis, Pq, P*). by covering validity on of suitable relies the existence the condition tests, (2.5) could be entropy only a can condition. To be condition be (2.5) replaced by testing precise, replaced by 2 the condition that the conclusion of Theorem 6.3 is satisfied with Dis) ? enSn. and testing entropy. Because the entropies for testing are abstract, it is useful to relate them to ordinary entropy numbers. For our the bound given by the following lemma is useful. We assume that, for examples, ? 1 some fixed constants c, C > 0 and for every m e N, k \,..., km > 0 with J2i ^i < and every P, P\,..., Pm e & with d(P, Pi) cdiP, P*) for all i, 2.1. Distances somewhat (2.8) P*) Tkid2iPi, LEMMA on c and C 2.1. such Cj^k^iPi, i i P) < sup 0<a<\ -logPof^^y. V P* / If (2.8) holds, then there exists a constant A > 0 depending only < that, for all s > 0, Nt(s, g*,d; Pq, P*) < NiAs, {P e g?\s diP, P*) < 2s}, d). [Anyconstant A < (1/8) A (l/4>/C) A (?c) works.] For a given constant A > 0, we can cover the set g?e := {P e g?:s < = < balls of radius As. If the centers of these diP, P*) 2s} with N N(As, g??,d) are not contained balls in g?E, then we can replace these N balls by Af balls of radius 2As with centers in g?e whose union also covers the set g?e. It suffices Proof. to show that (2.2) is valid for this cover. Choose 2A < c. If > then diPi, P*) diP, P*) (2.8), the left-hand sumption by J2i kiiis small A. - 2As)2 - B/ equal to the convex hull of a typical ball B in P e g?e is the center of B and Pi e B for every /, 2As by the triangle inequality and, hence, by as side of (2.2) with B( = conv(B) is bounded below Ci2As)2). This is bounded below by s2/4 for sufficiently B. J.K. KLEIJN AND A. W VAN DER VAART 844 The logarithms logN (As, {P e g*:s < d(P, P*) < 2s},d) of the covering in the preceding numbers lemma are called "local entropy numbers" and also the Le Cam dimension relative to the semi-metric d. They are bounded of the model & above by the simpler ordinary entropy numbers g*, d). The preceding log N(As, shows that the entropy condition (2.5) can be replaced by the ordinary en < condition whenever the semi-metric d satisfies (2.8). ns2 tropy log N(sn, g*, d) = = m 1 and P\ If we evaluate (2.8) with P, then we obtain, for every P e g*, lemma (2.9) d2(P,P*)< sup -logPof^V of the covering is also implied by finiteness (Up to a factor 16, this inequality an the metrics about numbers for testing.) This simpler condition indication gives with ordinary entropy. In Lemma 2.2 we show d that may be used in combination that if d and the model g? are convex, then the simpler condition (2.9) is equivalent to (2.8). Because minus ? log* > the logarithm ? x for every x > 0, we can further simplify by bounding ? This yields the in the right-hand side by 1 Po(p/p*)a. 1 bound d2(P,P*)< sup |"l-P0(4) o<(*<iL Vp*/ J I side for situation we have Po = P* and the right-hand In the well-specified a = 1/2 becomes is 1/2 times the Hellinger I? f distance be which ofp+Jpodp, can be of lower bounding this method tween P and Po. In misspecified situations ? = a a small useless, as 1 Po(p/p*)a may be negative for 1/2. On the other hand, as it can be shown that as a | 0 the expression value of a may be appropriate, to the difference of Kullback-Leibler is proportional 1 ? Po(p/p*)a divergences of P*. If this approximation is definition the which by positive Polog(p*/p), is bounded above by the d which then a semi-metric this can main in the be used theorem. We discuss Kullback-Leibler divergence of Sections 4 and 5. further in Section 6 and use this in the examples is of interest, in particular, for non- or semipara The case of convex models & can be made uniform in p, the point of For a convex model, and permits some simplification. models, is it Kullback-Leibler exists) (if automatically unique (up to divergence are auto on a the null-set of Po). Moreover, redefinition Po(p/p*) expectations 2.1, and condition (2.8) is satisfied for a finite, as required in Theorem matically va weighted Hellinger metric. We show this in Lemma 2.3, after first showing that convex on the semi-metric hull of g* the lower bound of the (if (2.9) simpler lidity d is defined on this convex hull) implies the bound (2.8). metric minimal i-? d2(P, P') Ifd is defined on the convex hull of g?, the maps P are convex on conv(g^) for every P' e g* and (2.9) is valid for every P in the instead of d. convex hull of g&, then (2.8) is satisfied for ^d LEMMA 2.2. BAYESIAN MISSPECIFICATION 845 If g? is convex and P* e g? is a point at minimal Kullback < 1 to Pq, then Poip/p*) Leibler with respect for every Peg0 divergence and (2.8) is satisfied with LEMMA 2.3. d2{P\ ,Pi) = of Lemma Proof i yfpif?d^ For the proof of Lemma to find P*) <2j2^id2(Pi, J2^id2(Pi, \ j(SP~\ 2.2. repeatedly gle inequality ~ 2.2, we first apply the trian P) + 2d2(P, P*) i <2j2^id2iPi,P)+4d2(p,J2^iPi)+^ <6j^kid2iPi, by the convexity of d2. It follows P) + P*\ 4d2\YkiPi, that rf2(?,ktPi, P*) = 3/2Ei hd2(Pi, P). If (2.9) holds for P E/ hPi, = 6. replaced by d2/4 and C Proof [Px :ke of Lemma [0, 1]} c & 0 < fik) :=-P0log ^ (2.10) since P* For Peg0, = kP + (1 by Px 2.3. e^ define k)P*. > (l/4)?; A^2(P;, P*) thenwe obtain (2.8) with d2 a family of convex combinations For all values of k e [0,1], = " + -Polog^l a(4 l)), is at minimal Kullback-Leibler divergence with respect to Po in g* For every fixed y > 0, the function k \-> log(l + A.y)/A. is nonnega by assumption. to y as k | 0. The function is bounded tive and increases monotonically in absolute < value by 2 for j k and monotone the and dominated [?1, 0] by ^. Therefore, theorems applied to the positive and negative parts of the integrand in convergence the right-hand side of (2.10), /'((H-) = l-*,(?). the fact that /(0) = 0 with (2.10), we see that /'(OH-) > 0 and, hence, Combining < 1. The first assertion of Lemma 2.3 now follows. we find PQip/p*) For the proof that (2.9) is satisfied, we first note that - logx > 1? x, so that it > suffices to show that 1 Po(p/p*)1/2 d2(P, P*). Now f(J^-^)2-*dp p* J by the first part of the proof. = l+ Po4 p* < 2 - 2P0 2P0 ? J4 p* y yp B. J.K. KLEIJN AND A. W VAN DER VAART 846 In this section we give some generalizations 2.1. of Theorem us to prove that optimal rates are achieved in parametric 2.3 extends Theorem 2.1 to situations in which the model, the 2.2. Extensions. Theorem 2.2 enables Theorem models. on n. Third, we consider the case in which prior and the point P* are dependent the priors Yln assign a mass slightly less than 1 to the models g?n. rate of convergence 2.1 does not give the optimal Theorem for finite l/^/n the choice sn = l/y/n is excluded dimensional models g?, both because (by the the prior mass condition is too restrictive. The condition ns2 -> oo) and because The adapted prior mass theorem remedies this, but is more complicated. following form: for all natural numbers takes the following condition n n) Yl(Pe&>:jsn<d(P,P*)<2jsn) Yl(B(sn,P*,P0)) n and j, 2j2/8 < a given model gP, prior XI on g? and some P* e g?, < oo and Po(p/p*) < oo for all P assume g#. that -Polog(p*/po) If -> > are such that 0 and numbers with sn 0, sn strictly positive liminfn^ then, for every sequence Mn ?> oo, as n ?> oo, (2.11) and (2.5) are satisfied, THEOREM (2.12) For 2.2. e &:d(P, n?(P P*) > Mnsn\Xl,..., Xn) -? 0 in Lx(P0). reason to choose the model g? and the to be no compelling appears n. same of The theorems does not the n for the every validity preceding prior each on For in the formalize theorem. this. We this fact n, we let following depend to measures on a relative densities set of be pn (3?, srf) given by g?n probability on an a measure a a -finite measure Yln pn on this space. Given appropriate prior There a-field, we the posterior define by (1.1) with P* and n replaced by P?* and n?. remain valid if gP, Yl, P* and d theorems The preceding 2.3. on n, but satisfy the given conditions for each n (for a single constant L). THEOREM depend e g?n :dn(P, P*) > extension, we note that the assertion PfiYln(P -> theorems remains valid even if the pri 0 of the preceding Xn) Mnsn\X\,..., ors Yln do not put all their mass on the "models" g?n (but the models g?n do satisfy the entropy condition). Of course, in such cases the posterior puts mass outside the to complement the above assertion with the assertion that and it is desirable model -> true if the priors put 1 in Li(Po). The latter is certainly Xn) n?(c^w|Xi,..., As a final only very small fractions of their mass is true if latter assertion <2'13) outside the models g?n. More precisely, the * nYln(B(sn, ,?, This observation may prove p? LJ^n\ (p^)ndYln(P)<o(e-2^). p*n) P*, Po)) is not relevant relevant to alleviate it in the present paper. However, for the examples theorems in the preceding the entropy conditions BAYESIAN MISSPECIFICATION in certain seems limit the complexity of the models and it to allow a trade-off between and prior mass. Condi complexity a crude form of such a trade-off: a small part g?cn of the model situations. reasonable 847 These tion (2.13) allows may be more complex, conditions that it receives provided a negligible amount of prior mass. The preceding theorems yield a rate of convergence sn -> 0, Consistency. as a mass and model function of prior the entropy. In certain situations expressed In contrast, for inferring consis prior mass and entropy may be hard to quantify. such quantification is unnecessary. This could be proved tency of the posterior, 2.3. in the well-specified directly, as [13] achieved rate theorems. from the preceding [A direct theorem with of Theorem a slightly 2.1 only. set Bis, bigger situation, but it can also be inferred proof might actually give the same We this for the situation consider P*; Po).] For a given model g?, prior Y\ on g? and some P* COROLLARY 2.1. ^, ? < oo and < oo for all P e g?. Suppose that Polog(/?*/Po) Po(/?/p*) that, for every s > 0, assume Tl(B(s, P*; Po)) > 0, (2.14) &,d\ supNt(ri, Pq, P*) < oo. T]>? Then for (2.15) every s > 0, as n ?> oo, nn(Pe&:d(P,P*)>e\Xu...,Xn)->0 Proof. inLxiP?). and g by f(e) = UiBis, Define functions / P*; Po)) and g(e) = sn -? 0 suPrj>s Ntirj, g?, d; Pq, P*). We shall show that there exists a sequence 2 2 _ n?" >e such that fisn) and gisn) < en?" for all sufficiently large n. This implies that the conditions of Theorem 2.1 are satisfied for this choice of sn and, hence, the posterior converges with rate at least sn. To show the existence of sn, define functions hn by *.(.)=?-,(g(?)+7L). the assumptions and hnis) -> 0 as n -> for every fixed s > 0. Therefore, there exists sn I 0 with hnisn) -> 0 [e.g., - - - such that < > n^\ next define sn = n\ < ni < hnil/k) \/k for all n \/k < <n In an there Af such that hnisn) < 1 for n > exists rik nic+{]. particular, This is well defined and finite _ This implies that fisn)>e by 2 2 ne" and g(en) < enE* for > N. every n oo, fix for N. B. J.K. KLEIJN AND A. W. VAN DER VAART 848 we 2.4. Multiple In this section points of minimal Kullback-Leibler divergence. extend Theorem 2.1 to the situation where there exists a set g?* of minimal Kullback-Leibler points. points can arise the true distribution in two very different ways. First consider the and the elements the model of g* Po pos are sampled from Po, they fall the observations minimal Multiple situation where sess different supports. Because one in the set where po > 0 and, hence, with probability the exact nature of the = on set elements of the model the is g* irrelevant. p 0} {po Clearly, multiple minima arise if the model contains multiple extensions of P* to the set on which = 0. In this case the observations to distinguish do not provide be the means po no such distinction can be made by the tween these extensions and, consequently, either. Theorems 2.1 and 2.2 may apply under this type of nonunique posterior ness, as long as we fix one of the minimal points, and the assertion of the theorem as soon as it is true for one of the minimal will be true for any of the minimal points = 0 under the This follows conditions of the theorem, d(P*, because, points. P2*) on the set po > 0, in view of (2.9). whenever P* and P2* agree Genuine multiple Kullback-Leibler may occur points of minimal divergence as well. For with means distribution both have artificial of normal distributions fit a model instance, one might consisting ? in (?oo, 1] U [1, oo) and variance one, in a situation where the true ? 1 is normal with mean 0. The normal distributions with means and 1 the minimal and we Kullback-Leibler divergence. This situation is somewhat are not aware of more or semiparametric it appears because in the nonparametric interesting examples us most in the present paper. Nevertheless, of an that the situation might arise, we give a brief discussion case that interests 2.1. of Theorem than to a single measure set g** C g* of points at minimal extension Rather P* e g*, the extension refers to a finite sub Kullback-Leibler divergence. We give condi near this concentrates distribution asymptotically the posterior set of points. We redefine our "covering numbers for testing under misspecifica number N of convex sets B\,..., tion" Nt(s, g?, d; Po, g**) as the minimal Bn on (3?, srf) needed to cover the set {P e gP: s < d(P, of probability measures tions under which g?*) < 2s} such that sup inf sup -logP0(4) (2.16) This we roughly says that, for every Peg0, can apply arguments as before. there exists ^ T' some minimal point to which Yl on g* and some subset For a given model g*, prior < oo for all P e ?? that g?* C g?, assume / po) < oo and Po(p/p*) Polog(p* numbers sn that there exist a sequence of strictly positive and P* e g?*. Suppose THEOREM 2.4. BAYESIAN MISSPECIFICATION with sn -> 0 and ns2 ?> oo and a constant 849 L > 0, s?c/z fto, /or all n and all s>sn, inf (2.17) n(Bisn,P*;PQ))>e-Ln??, (2.18) Ntis, Then for (2.19) every sufficiently large constant M > 0, as n -? oo, > Msn\Xx,. Yln(P e g? :diP, &*) 3. Mixtures. Let /i denote < en??. g?, d; Pq, &>*) the Lebesgue i/iLi(/ff). ..J?)-^0 measure on z e R, IR.For each let jch* space (?3K\ &/) that /?(jc|z) be a fixed /x-probability density on a measurable a on on and F E distribution for define the /x-density z), ix, depends measurably PfM= / p(jc|z)JF(z). In this section we consider mix Let P/7 be the corresponding probability measure. ture models ^ ? {Pp'.F e^}, where J^ is the set of all probability distributions on a given compact interval [?M, M]. We consider consistency for general mix a rate of convergence in the special case that the family location family, that is, with (j) the standard normal density, tures and derive the normal p(-\z) is (3.1) PFix) = f<t)ix-z)dFiz). are an i.i.d. sample X\, drawn from a distribution ...,Xn Po on not with which is of the As a mixture form. (?vBT,&/) pq p-density necessarily we a use on for Dirichlet distribution J?\ F, process (see [6, 7]) prior The observations 3.1. General mixtures. We say that the model is Po-identifiable if, for all pairs Fx,F2e&, F\^F2 => P0(pF] ^ PF2) > 0. on the model this condition excludes Imposing ness of P* may occur (as discussed in Section the first way in which nonunique 2.4). ~ LEMMA 3.1. Assume that PQlogip f / Po) < oo for some F e J?\ If the map is continuous for all x, then there exists an F* e JP that minimizes z ?->p(x\z) ? over & - If the model F h-> PQlogip f/pq) is PQ-identifiable, then this F* is unique. PROOF. If Fn is a sequence in& with Fn -? F for the weak topology on &, then pFn ix) -? pp(x) for every x, since the kernel is continuous in z (and, hence, as a result of the compactness also bounded of [?M, M]) and by use of the B. J.K. KLEIJN AND A. W. VAN DER VAART 850 lemma. It lemma. Consequently, pFn -> pF in L\(p) by Scheffe's portmanteau t-> the is continuous for weak F from into L the & follows that i (p) pF map topol theorem. The is compact for this topology, by Prohorov's ogy on &. The set & ? as a is lower semi-continuous Kullback-Leibler p \-+ Polog(p/po) divergence ? h> F the to Lemma R from Therefore, Po map 8.2). (see log(pF/po) map L\(p) on the compactum & is lower semi-continuous and, hence, attains its infimum on J^. ? of the is also convex. By the strict convexity The map p i-> Polog(p/po) ? we X e for function x h> have, (0, 1), any logx, - -Polog unless Po(p\ Polog(pF/po) V = / po pi) = 1. This shows is satisfied (3.2) for the weighted d2(PFl,PF2) that - po the point A.)P0log?, of minimum Po of F m* is Po-identifiable. is unique if& Kullback-Leibler Thus, a minimal on the kernel p. Because conditions that (2.9) < -AP0log-(1 = point PF* exists and is unique under mild is convex, Lemma 2.3 next shows the model distance, whose square is equal to Hellinger lIJ f(^p^-^pJ-2)2I^-dp.pp* is bounded then this expression by the squared Hellinger Loo(/x), the measures between PF[ and PFl. and the map F \-+ pp from & Because & is compact for the weak topology = to L\(p) is continuous {Pf :F e ^} (cf. the proof of Lemma 3.1), the model g? and total the Hellinger is compact relative to the total variation distance. Because If po/Pf* distance H structure, the model it is totally bounded, distance and, hence, relative to the Hellinger with are finite for all s. Combined ering numbers N(s, g?,H) distances variation define the same uniform is also compact that is, the cov the result of the 2.2 and 2.3, this yields the result that the en paragraph and Lemmas preceding ? ^oo(i^) is satisfied for d as in (3.2) if po/Pf* of 2.1 condition Corollary tropy we theorem. the obtain and following THEOREM s>0, then 3.1. If po/PF* PF*) Yln(F:d(PF, LOQ(p) > and s\X\,...,Xn) Yl(B(s, PF*\ Po)) -? 0 in L\(P$) > 0 for for d every given by (3.2). to the situation where p(x\z) = Next we specialize mixtures. 3.2. Gaussian ? The kernel and derive the rate of convergence. convolution z) is a Gaussian (j>(x if Po is Lebesgue is well known to be Po-identifiable convolution model Gaussian as in be defined d Let continuous (3.2). We assume that (see, e.g., [12]). absolutely ? so exists a minimal some there that for is finite F, Po is such that Po log(pF/po) Kullback-Leibler point F*, by Lemma 3.1. 851 BAYESIAN MISSPECIFICATION LEMMA 3.2. then the entropy constant If for some condition g?,d) \ogNisn, sn a large enough multiple is satisfied for > 0, dippx, C\ ? C\HiPF\, Pf2) Pf2), <nsl oflognj' y/n. is bounded above by the distance the square of the Hellinger ? < that the assumption Li-norm, implies Pf2111- Hence, Pf2) d2iPp^ C\\\Ppx for all s > 0, we have NiC\s, g?, ||-||i). As a result of Lemma 3.3 g?, d) < Nis2, in [9], there exists a constant C2 > 0 such that Proof. (3.3) Because |/Y, -PF2\\{ <C2\\PFx -PFJ^maxjl,M,^log+? from which it follows that /V(C^log(l/?)'/2, &, j, j|-|loo) for see that there exists in [9], we 3.2 the help of Lemma enough s. With constant C3 > 0 such that, for all 0 < s < e~', small J^^ ||-||i) < N(e,&>, a . logW(e,?M|-||oo)<C3(logi) all of the above, we note Combining / / that, for small enough 1\1/4 logW(c,C2?l/2(tog-J s > 0, \ , &,d) <\ogN{e, &,\\-\\oo) <C3(logi)2. So if we sn such that, for all n > can find a sequence 1, there exists an s > 0 such that j x 1/4 / <sn C3flog-J C1C2^1/2^log-j then we have demonstrated / and <C3Hog-J that this is the case shows easily = case we which choose, for fixed n, s (in We the are now location <ns2n, that logA^(^,^,J)<log^^CiC26:,/2('log-N) One 1\2 ,g?,d\ <ns2n. sn = max{CiC2, C^ilogn/^/n) l/n), if n is taken large enough. for to apply Theorem 2.1. We consider, in position with the normal density standard mixtures (3.1) for given M > 0, 0 as the kernel. B. J.K. KLEIJN AND A. W VAN DER VAART 852 We the prior n equal to a Dirichlet prior on & specified by a finite a with compact continuous support and positive, Lebesgue density choose base measure on [-M,M]. on R dominated Let Po be a distribution by Lebesgue mea ? ?00 0>0- Then the posterior concen that po/Pf* distribution at the rate logn/y/n around PF* asymptotically to the relative d on g? given by (3.2). THEOREM 3.2. sure p. Assume trates its mass distance PROOF. below The set of mixture densities by the upper and lower envelope = U(x) (f>(x+ M)t{x<-M] +(/>(* pF with functions F e& is bounded above and M)t{x>M] +(t>(0)t{-M<x<M}, L(x) = 4>{x M)t[x<o) +(/>(x + M)t{x>0]. So for any F e &, \PF*J L 0(0) + Po(e-2MXt{x<-M} +e2MXt{x>M}) < 00, of pF* and PF* has sub by a multiple 2.2 and 2.3, the covering number for testing in (2.5) is bounded above by the ordinary metric Po, P*) Nt(s, g*,d; covering some constant for A. Then Lemma 3.2 demonstrates that number N(As, g*, d), a of logn/^/n. the entropy condition (2.5) is satisfied for sn large multiple because Gaussian bounded po is essentially tails. In view of Lemmas to verify the prior mass condition (2.4). Let s be given such that a discrete distribution in Lemma 3.2 there exists function [9], By F! e D[?M, zi,..., M] supported on at most N < C2 log(l/e) {z\, zn] C points ? < > are constants 0 where that that such C2 C\s, C\, [?M, M] \\pF* Pf'Woo = we on Without loss of M write We F' generality, only. depend Pj^zj= 1,...,J2f=\ not assume if set is this is that the N} may Namely, 2^-separated. {zj :j = 1,..., : of and subset the case, we may choose a maximal N} 2^-separated {zj j in this shift the weights pj to the nearest point in the subset. A discrete F" obtained ? < virtue of the So fashion satisfies \\pF> p/Hloo 2?||0/||oo. triangle inequal by of the standard normal kernel (p is bounded, ity and the fact that the derivative a given F' may be replaced by a 2?>separated F" if the constant C\ is changed It suffices 0 < s < e~l. accordingly. By Lemma difference 3.3 in [9], there exists a constant D\ such that the Li-norm satisfies / \\PF*-PF'\\\<D\s\\og-J lxl/2 , of the 853 BAYESIAN MISSPECIFICATION for small enough s. Using Lemma 3.6 in [9], we note, moreover, a constant D2 such that, for any F e&, \\Pf - Pf>\\\ < a constant So there exists < s, then Pj\ + J2 D2(s \Pizj D > 0 such that, if F ~ - e, zj+s] satisfies that there exists Pj\\. ~ ZI7L1 ( W/2 \F\Zj ?> Zj +?] ? . \\PF-PF*h<De\log-) be the measure Let QiP) is essentially Po/Pf* \\QiPFl) - bounded < PF2111for all Fx,F2e &. QiPF2)111 K\\PFl a constant that there exists it follows ? The assumption that ipQ/pF*)dP. by dQiP) > 0 such that a constant K exists that there implies defined r yv Since g(PF.) = Pq, D' > 0 such that, for small enough 1 s > 0, \Fe^:J2\FlZj-s,Zj+s]-Pj\<s\ C J. jpe^:||(2(PF)-/>olli<(^)2^log^ = < We have that dQiPF)/dP0 /?f/pf* PoiU/L) < 00. The and Po(pf/Pf*) is bounded by the square root of the L x-distance. Therefore, we see that the set of Lemma 8.1 with r\? r)is) = D^1/2(log(l/?))1//4, on set F in the side of with the the last Pp right-hand display is contained distance Hellinger applying measures in the set \pp.Fe &, -Po\og?< where f is) = D"r)is)ilogil/r)is))) constant D", and small enough I Pf* C2(^), Poflog?) V J Pf^/ < < $2is)\ for an appropriate D"D'sl/2ilogil/s))5/4, s. It follows that Fe^:J^\F[zj-s,Zj-{-s]~pj\<SK {N [8] (Lemma 6.1) or Lemma A.2 in [9], we see that the prior measure Following side of the previous display is lower bounded by the right-hand at ciexp(-C2^1og(l/?))>exp(-L(log(l/^))2)>exp(-L/(log(l/C(?)))2), where quence and L = C2C2 > 0. So if we can find a se 1, c2 > 0 are constants sn such that, for each sufficiently large n, there exists an ? > 0 such that c\ > en>S(e\ ns2n> , (log-?) B. J.K. KLEIJN AND A. W VAN DER VAART 854 Yl(B(sn, PF*; P0)) > Yl(B($(s), PF*; P0)) > exp(-L'nsl) and, is shows satisfied. One and ?(s) = that, for sn = logn/^/n (2.4) easily are fulfilled for sufficiently the two requirements large n. then hence, l/^/n, Let Po be the distribution of a random vector (X, Y) satis for random X and variables eo eo taking val independent ues in a measurable a measurable and in and IR, respectively, space (3?,srf) -> function fo: 3E E. The variables X and eo have given marginal distributions, which may be unknown, but are fixed throughout the following. is The purpose 4. Regression. ? fo(X) fying Y + the regression function fo based on a random sample of variables as (X, Y). with the same distribution (X\, Y\),..., (Xn, Yn) to this problem might A Bayesian start from the specification of a approach -> a on of If distribution class the measurable functions R. & JT /: prior given a posterior. to determine of X and eo are known, distributions this is sufficient If one are not to introduce priors for these distributions then known, might proceed to estimate as well. The approach we take here is to fix the distribution of eo or Laplace distribution, while aware of the fact that its true distribution the consequences of the resulting model misspec may be different. We investigate of the error distribution does not ification. We shall show that misspecification these unknowns to a normal of the regression In this sense function. a nonparametric to misspecifi the same robustness Bayesian approach possesses or contrast estimation least minimum absolute cation as minimum squares using have serious deviation. conditions consequences for estimation see that the use of the Laplace distribution requires no on the tail of the distribution of the errors, whereas the normal distrib We shall also appears to give good results only if these tails are not too big. Thus, the tail of the method absolute deviation versus the nonrobustness robustness of minimum of least squares also extends to Bayesian regression. We build the posterior based on a regression model Y ? f(X) + e for X and e on the true distribution as is the assumption of (X, Y). If we assume independent, cancels out that the distribution Px of X has a known form, then this distribution for the posterior on /. If, instead, we put independent priors on / of the expression ution then the prior on Px would disappear upon marginalization and Px, respectively, the posterior for /, of the posterior of (/, Px) relative to /. Thus, for investigating we may assume without distribution of X is that the marginal loss of generality into the dominating measure p for the model. It can be absorbed of the random variable For / e gP, let Pf be the distribution (X, Y) satisfying = e e X Y and for + variables, X having the same distribution f(X) independent a given density p, possibly different from the density of as before and e possessing the cases that p is normal and Laplace. Given the true error eo. We shall consider known. a prior n on J?\ the posterior distribution for / is given ^ /Bn?=iPW-/(x?))<*n(/) fYYUp(Yi-f(Xi))dn(f) by BAYESIAN MISSPECIFICATION 855 near fo + Eeo in the case that p concentrates near if if these translates and + P is Laplace, median(eo) fo density are true in function contained the model of the g?. If the prior is fo regression or also in the sense that fo + p?g? (where p is the expectation misspecified this remains true with fo replaced of eo), then, under some conditions, median We shall show that this distribution is a normal In agreement /* of fo on &. by a "projection" the paper, we shall denote the true distribution with the notation in the rest of of an observation (X, Y) by Po = as is with in different from The model & that, 0). / (stressing general, Po Pf on 3tg x R in the statement of the main results is the set of all distributions Pf with / e&. 4.1. Normal normal (4.1) that the density p is equal regression. Suppose = Then, with p = Eeo, p(z) (27r)-1/2exp(?\z2). density - = log -^-(X, Y) -\(f2 Pfo j { = ?L -Polog Pfo - -P0(f 2 2 to the standard fo)(X), f0)2(X) + eo(f fo - - p)2 -p2. i-> ? is mini Polog(pf/po) ? ? mized for / = /* e g? minimizing the map / i-> Po(f fo l^)2 In particular, if fo + p e gfi, then the minimizer is fo + p and is the point P/0+/x in the model that is closest to Po in the Kullback-Leibler sense. If also p = 0, then, even though the posterior on near asymptotically Pf will concentrate P/0, which is typically not equal to Po, the induced posterior on / will concentrate near the It follows that the Kullback-Leibler divergence / true regression function fo. This favorable property of Bayesian is anal estimation to of error that least also for nonnormal distributions. ogous squares estimators, If /o + p is not contained in the model, then the posterior for / will in general not be consistent. We assume that there exists a unique /* e J?~ that minimizes ? ? is a closed, convex subset / h-> Po(/ fo p)2, as is the case, for instance, if& some conditions of L2(Po). Under we shall show that the posterior concentrates near /*. If p = 0, then /* is the projection of fo into & and the asymptotically posterior still behaves that Eo<?o = 0. The following L2(Po)-distance LEMMA in a desirable lemma shows manner. that (2.8) For simplicity is satisfied of notation, for we assume a multiple of the on &. be a class of uniformly bounded functions f :3? ?> R convex is and in closed that fo fo Z.2(Po). Assume is uniformly bounded, that Eo^o = 0 and that Eo^M'^0' < oo for every M > 0. Then there exist positive constants C\,Ci, C3 such that, for all m e N, /, f\,..., such 4.1. Let ^ that either & or & B. J.K. KLEIJN AND A. W. VAN DER VAART 856 with J2i A./= and kx,...,km>0 fme^ 1, Polog^<-^P0(/-r)2, 2 Pf* (4.2) <ClP0(/-r)2, P0^log^ 0<a<l We Proof. (43) / V Pf* - C2J^ki(PQifi n2 > sup -logPof^i^)" C3Po(/ - f)2). i have - Y) = -I[(/0 2 iog?L(X, Pf* f)2 - - ifo D2]iX) e0(f* - f)(X). The first zero by assumption. side has mean second term on the right-hand ? ? ? as is if side has expectation term on the right-hand fo /*, f)2 |Po(/* of -Furthermore, the minimizing is convex, if& /* & property the case if /o The ? ? /*)(/* implies that Poifo /) > 0 for every f e & side of the first term on the right-hand in both cases (4.2) holds. Therefore, From (4.3) we also have, with M is bounded a uniform and then the expectation above upper bound by ~ ? f)2 \Poif* on & and fQ, 9 <P0[(/* (-BL) \Pf*/ Poflog-^-) \ Pf*/ - < P0[(f* - /)2(2/o -??-) Poflog V Pf*/ x - /)2(2/o- / / ~ - /*)2] + 2P0e2P0(f* - ~ P)2 + 2e2if* f)2] f)2, ^ e2a(M2+M\e0\) ? times Po(/ sides can be further bounded by a constant /*)2, right-hand of eQ only. where the constant depends on M and the distribution = we see that there ex In view of Lemma 4.3 (below) with p = pf* and #; /?/., on M only such that, for all A,;> 0 with J^t A.j= 1, ists a constant C > 0 depending Both (4.4) ll- Po(^^)" V Pf* I By Lemma a = 4.3 with <2a2c^Ai.Po(/;. ) -?Polog^-| i T,i*-iP/i\ 1 and p ? Pf and similar arguments, we for any / e&, \l-Po(^^)-Polog-^-\<2Cj:xiPo(fi-f)2. V Pf ) I Y,ihPf,\ For A, = i 1 this becomes ll I - Po(^-) \PfJ Polog^-I Pf,\ < 2CPo(fi - f)2. _ r)2 also have that, BAYESIAN MISSPECIFICATION differences, Taking we obtain 857 that Polog-^-?>Polog^ EihPfi Pfi i <4C?A,Po(//-/)2. i = By the fact that log(ab) log a + logb for every a, b > 0, this inequality remains on true if / the left is replaced by /*. Combine the resulting inequality with (4.4) to find that \ pp / Pfi i ~ -2a2Cj2^iPo(f* i > ^ _ 2a2c) ~ fi)2-4C^iPo(fi f)2 i ~ 5>,ft(/* fi)2-4C^iPo(fi - f)2, we have used small a > 0 and suitable con (4.2). For sufficiently stants C2, C3, the right-hand side is bounded below by the right-hand side of the lemma. Finally the left-hand side of the lemma can be bounded the supremum by ? over a 6 (0, 1) of the left-hand side of the last > 1? x for display, since log x every x > 0. where In view of the preceding of the quantities involved lemma, the estimation can be based on the L2(Po) distance. The "neighborhoods" B(s, Pf*; Po) involved in the prior mass conditions and (2.11) can be interpreted in the form main in the theorems (2.4) < P0(/* < s2}. B(s, Pf.; Po) = {feg?: P0(f fo)2 fo)2 + e2, P0(f f)2 If P0(f - /*)(/* - /o) = 0 for every / e & (as is the case if /* = f0 or if /* lies in the interior of &\ then this reduces to an L2(Po)-ball around /* by Pythagoras' theorem. In view of the preceding lemma and Lemma 2.1, the entropy for testing in (2.5) can be replaced by the local The rate of entropy of g? for the Z>2(Po)-metric. of the posterior distribution Theorem 2.1 is then also convergence guaranteed by relative to the L2(Po)-distance. These observations the theorem. yield following THEOREM Po(f ? /*)(/* 4.1. ? Assume the assertions of Lemma 4.1 and, in addition, that fo) = Ofor every f e g?. If sn is a sequence of strictly positive B. J. K. KLEIJN AND A. W. VAN DER VAART 858 numbers ? sn ?> 0 and ns? with (4.5) oc such that, for - e^:PQif n(/ < pf L > 0 and all n, a constant > s2n) e~Ln?n, (4.6) N(sn,^,\\-\\Po,2)<en?n, - then nnife^: > P0(/ /*)2 large constant M. ficiently -> 0 in Xn) Ms2n\Xx,..., LxiP$),for every suf are many special cases of interest of this theorem and the more general reason can 2.1 and 2.2 using the preceding results that be obtained from Theorems in the context of the well-specified ing. Some of these are considered regression There estimates on the prior mass and the entropy are not dif [14]. The necessary ferent for problems other than the regression model. Entropy estimates can also be contrast estimators. For these of minimum found in work on rates of convergence reasons we exclude a discussion concrete of examples. results. The following pair of lemmas was used in the proof of the preceding model LEMMA measure There exists a universal 4.2. Pq and any finite measures constant C such that, for any probability P and Q and any 0 < a < 1, -aPolog^ <a2CP0 l-Po(-) (log^j ((^j + t[q>p] . t{qSp^j 1 x)/ix2ex) for x > 0 and The function R defined by P(x) = iex PROOF. ? ? = for x < 0 is uniformly bounded on R by a constant C. We 1 x)/x2 iex Rix) can write =l+aP0log Po(-) \P/ P + The lemma LEMMA ity measure P0P(tflog^(alog^) (^j Mq>p}+Mq<p] follows. 4.3. There exists a universal Pq and all finite ki >0withJ2i^i measures = 1, constant P, Qx,..., C such that, for any probabil 0 < a < 1, Qm and constants V pJ l\pJ V p ) Ii1_fo(SM)"_a/,olog^l<2aJci:A,/.?(iog*)2[(*)2+l"i iZiMqA i PROOF. In view of Lemma ?IA 8?p?) 4.2 with ll?p?) q = ?, hqi, it suffices to bound 1^k"i<>p+t^x"ii-p) J BAYESIAN MISSPECIFICATION 859 side of the lemma. We can replace a in the display by 2 and by the right-hand to the the expression make larger. Next we bound the two terms corresponding by indicators separately. decomposition of the map x h> x log x, the convexity By If Y.i ^iQi > P> then the left-hand side is positive and the inequality of the map x h-> x2 allows when we square on both sides. Convexity the square of the right-hand side as in the lemma. of the logarithm, the concavity By - log-< Xi log ?. > P i the set J^i ^iGi < P the left-hand side is positive sides and preserve the inequality. On is preserved us to bound P and we can again take squares on both 4.2. Laplace density p(x) = that the error-density Suppose regression. Laplace p is equal to the Then \ exp(?\x\). log^-(X, Y) = -(\e0 + /o(X) /(X)| - kol), Pfo _Jp0log^ = p0<D(/-/0), Pfo ? ? over v e R at the <J>is minimized for $(v) = Eo(|eo v\ \eo\). The function in&, that if fo + m, for m the median of eo, is contained median of eo. It follows ? over is minimized then the Kullback-Leibler / e g? divergence Polog(pf/po) at / = fo + m. If& is a compact, convex subset of L i (Po), then in any case there the Kullback-Leibler but it appears dif exists /* ^" that minimizes divergence, in general. For simplicity of notation, we shall this concretely ficult to determine that m assume = 0. then the function O will of eo is smooth, to expect at v = m = 0, it is reasonable it is minimal = 0 and some m constant of Co, positive neighborhood If the distribution Because d>(v) = Eofleo (4.7) Because exists, O H it is also reasonable - smooth too. that, for v in a kol) > CoM2. to expect that its second derivative, if it is strictly positive. LEMMA and is convex, " be let fo 4.4. Let g? be a class be uniformly bounded. ?> R of uniformly bounded functions f: 3C Assume that either foegp and (4.7) holds, 860 B. J.K. KLEIJN AND A. W. VAN DER VAART or that ^ is convex and in Li(Po) and that 4> is twice continu compact second derivative. Then there exist pos strictly positive m such e all and that, for C3 N, /, fx,..., fm e & with J2i ki = 1, with ously differentiable itive constants Cq,Cx,C2, kx,...,km>? Po\og-^<-C0P0if-f*)2, Pf* (4.8) <cxP0if-f*)2, p0(iogEll\ \ Pf J sup 0<c*<l > -logP0(EiXiPfi)a V Pf* / C2J2^i(Po(fi n2 - - C3Poif 1 fi)2). that fo ? &', so that /* = fo. As 4> is monotone on v a is also satisfied for in (0, 00) and (?00,0), (4.7) inequality automatically given on the compactum). the compactum (with Co depending compactum Choosing ? such that (/ is contained in it with probability one, we large enough /*)(X) = that (4.8) holds (with f0 conclude /*). If /* is not contained in& but & is convex, we obtain a similar inequality ? with /* replacing fQ, as follows. Because /* minimizes / h> Po<&if fo) over e t and ft = (1 & for e the & of + tf the map [0,1], 0/* right derivative > 0. t h> PQ<$>ift - fo) is nonnegative at t = 0. This yields P0&(f* fo)if /*) By a Taylor expansion, Proof. first Suppose = Polog ^? f0) P0(0>(/ Pf = for some / between tive and the function Thus, the right-hand (4.8) follows. P0O'(/* - - fo)if /*) + X-Po<s>"if-fo)if - /*)2, and /*. The first term on the right-hand side is nonnega <J>"is bounded away from zero on compacta by assumption. ? side is bounded below by a constant times Poif f*)2 and / again Because with M <&(/* f0)) is bounded logipf/p/*) a uniform upper bound on & in absolute value by \f ? f*\, we also have, and fo, 2 Po(log^) \ Pf*) \ pp) \pf*J in the proof of Lemma 4.1, we Lemma 4.3 to obtain the result. As <Po(f*-f)2, can combine these inequalities, (4.8) and BAYESIAN MISSPECIFICATION 861 As in the case of regression using the normal density for the error-distribution, of Theo for the application lemma reduces the entropy calculations the preceding rem 2.1 to estimates of the L2(Po)-entropy of the class of regression functions &. is the same as in the case where a normal distri The resulting rate of convergence bution is used for the error. A difference with the normal case < oo are necessary. Instead of the type Eo^0' tail conditions a certain smoothness of the error eo. of the true distribution is that presently no the lemma assumes of posterior for finite The behavior distributions 5. Parametric models. was more in models considered and dimensional, [1] recently by misspecified we in In the the also this section references Bunke and Milhaud [3] (see latter). near a minimal Kullback show that the basic result that the posterior concentrates Leibler point at the rate y/n follows from our general theorems under some natural in a general metric indexed by a parameter We first consider models rate to of the the metric of the parameter set. and relate convergence space entropy sets. to Euclidean Next we specialize parameter conditions. of probability densities indexed by a parameter 0 [po :0 e 0} be a collection of the data and assume that in a metric space (&,d). Let Po be the true distribution there exists a #* e 0, such that, for all 0, 0\, 02 e 0 and some constant C > 0, Let (5.1) ? Po log < -Cd2(0,0*), Po* (5.2) pJJK_x\ \ Vpo2 (5.3) Po(log^) ) <d2(eue2), 2 <d2(0u02). Kullback-Leibler di inequality implies that #* is a point of minimal ? m* 9 between and the model. The second and third Po vergence Polog(pe/po) are (integrated) Lipschitz on the dependence conditions conditions of po on 9. The following lemma shows that in the application of Theorems 2.1 and 2.2 these The first one to replace the entropy for testing by the local entropy of 0 to (a multiple of) the natural metric d. In examples to relax the conditions itmay be worthwhile In partic somewhat. can the conditions be Rather "localized." that they than ular, (5.2)-(5.3) assuming conditions allow relative are valid the same results can be obtained if they are valid 0\,02 e 0, for every pair (9\, O2) with d(9\, O2) sufficiently small and every pair (9\, O2) with = 9*. For = 9* and = and situa 6\ 92 92 Po P$* (i.e., the well-specified arbitrary a on condition is bound the distance between and tion), (5.2) Pq* Hellinger P#,. LEMMA for every 5.1. stants C\, C2 such Under that, for the preceding all m eN,9,9\,... conditions, ,9m e 0 there exist and X\,..., con positive Xm > 0 with 862 B. J.K. KLEIJN AND A. W. VAN DER VAART sup -logPQfE/X/^T. V PO* YJ^d2(0i,9^-CXTkid2i9,9i)<C2 PROOF. In view a constant exists 1- (5.4) By Lemma of Lemma / / / 0?x<l 5.3 with (below) p = pe*, (5.2) and (5.3), there C such that \ Pe* ) \ Po(^^Y-aPo(log-^-) 5.3 with a = 1, p = po, < Hi^iPOi/ t 2a2Cj:kid2i6i,e% (5.2) and (5.3), V Pe J V Ei^iPOiJ <2C?M2(^). l-Po(^^)-/>o(log-^-) i We can evaluate combination this with k( of the resulting = 1 (for each / in turn) and next subtract the convex from the preceding inequalities display to obtain Pe Poflog )-J2kiP0(log^-) <4Cj^kid2i9i,9). this remains valid if we replace 6 on the left By the additivity of the logarithm, hand side by 0*. Combining the resulting inequality with (5.1) and (5.4), we obtain l-Po( \ / Po*iP0i) The lemma follows ? > 1 ? x. logx >aYkid2iei,6*)iC-2a)-4CYkid2i9i,9). upon i choosing a > 0 small sufficiently and using If the prior on the model is induced by a prior on the parameter {pe :9 e 0} set 0, then the prior mass condition (2.11) translates into a lower bound for the mass set of the prior Bis,9*',Po) = to (5.1), In addition \9ee:-Polog-^<s2,P0(log-^) Ipo* to assume it is reasonable V a lower bound po*/ <s2\. J of the form (5.5) P0log-^>-Cd2i9,9*), Po* at least for small of di9,9*). a ball of the form values This that (5.3) implies together with < Bis, 9*; Po) [9:di9, 9*) Cxs] for small enough s. of (2.4) or (2.11) we may replace Bis, P*; Po) by a ball Thus, in the verification of radius s around 0*. These observations lead to the following theorem. contains BAYESIAN MISSPECIFICATION Let 5.1. THEOREM hold. (5.1 )-(5.5) d(0, {0e@:s< sup logN(As, If for S>?n Yl(0:jsn 863 small A and C, sufficiently 0*) < 2s}, d) < ns2n, <d(0,0*)<2jsn) < ~e ne2j2/% Yl(0:d(0,0*)<Csn) then Yl (0 :d(0, 0*) > Mnsn\X\,..., 5.1. Finite-dimensional Euclidean Xn) -> 0 in L i (P$) for Let 0 models. with space equipped the Euclidean any Mn -> oo. be an open subset of m-dimensional distance d and let {po :0 e 0} be a model (5.1)?(5.5). satisfying Then the local covering numbers constant B, N(As, {0e@:s< as in the preceding theorem satisfy, for some / -B\m d(0, 0*) < 2s}, d) < ( ) (see, e.g., [8], Section 5). In view of Lemma 2.2, condition (2.5) is satisfied for sn a density that is bounded a large multiple If the prior n on 0 possesses of l/^/n. away from zero and infinity, then Yl(0:d(0,0*)<js) Yl(B(s,0*;Po)) ~<c V ,m ' for some constant the C2. It follows that (2.11) is satisfied for the same sn. Hence, at rate 1/*Jn near the point 0*. posterior concentrates The preceding situation arises if the minimal point 0* is interior to the parameter set 0. An example is fitting an exponential family, such as the Gaussian model, to observations that are not sampled from an element of the family. If the minimal then we cannot expect (5.1) to hold for the natural point 0* is not interior to 0, and different distance rates of convergence may arise. We is somewhat surprising. include a simple example of the latter type, which Example. that Po is the standard normal distribution and the model Suppose > of all N(0, 0 with 1. The minimal Kullback-Leibler 1)-distributions = 1. If the a density on [1, 00) that is bounded point is 0* prior possesses away near from 0 and infinity near 0* at the rate l/n. 1, then the posterior concentrates consists One easily shows that -Polog^ (5.6) = Po* -logPot^Y \po*' This shows y/\0\ -021 ^(0-0*)(0+0*), 2 =l-a(0-0*){0+0*-a(0-0*)). 2 = that (2.9) is satisfied for a multiple of the metric d(pox,po2) on 0 = [1, 00). Its strengthening (2.8) can be verified by the same B. J.K. KLEIJN AND A. W. VAN DER VAART 864 as before, or, alternatively, the existence of suitable tests can be estab lished directly based on the special nature of the normal location family. [A suit able test for an interval (0\, 02) can be obtained from a suitable test for its left end methods as in regular parametric mod point.] The entropy and prior mass can be estimated can to els and conditions be be shown satisfied for sn a large multiple (2.5)?(2.11) ? rate to of l/y/n. This yields the the metric Vl#i relative #21 and, hence, \/y/n the rate l/n Theorem in the natural metric. In the only gives an upper bound on the rate of convergence. on a to be sharp. For instance, for uniform prior [1, 2], present situation this appears the posterior mass of the interval [c, 2] can be seen to be, with Zn = y/nXn, 2.2 <S>(2jil-Zn)-<S>(c^-Zn) ~ ^ we c e c(-i/2)(<?-i)n+Zn(c-l)JZ cJTi-Zn Q(2jn-Zn)-&(y/n-Zn) where y/E-Zn ? ? if xn, yn -> ratio to see that <?>(yn) ?(xn) (l/xn)<f>(xn) c = cn = zero from for that xn/yn -> 0. This is bounded away use Mills' (0, 1) such 1+ C/n and fixed C. LEMMA measure exists a universal There 5.2. Po and any finite measures constant C such that for any probability P and Q and any 0 < a < 1, ( |l Fo(-) -<xPolog-|<a2CP0 /--l) (log-) Mq<P]l There exists a universal constant C such that, for any probability and Xm > 0 with P, Q\,..., Po any finite measures Qm and any X\,..., holds: 1 and 0 < a < 1, the following inequality LEMMA measure = J2i h + lte>p> 5.3. l-^o I \ P -otPolog?-? J ?/*i#l <2a^EA,/>o[(y|-l) +(log|)2]. 1 The function R defined by R(x) = (ex PROOFS. I)2 x)/a2(ex/2a ? 1? bounded on R by for x < 0 is uniformly for x > 0 and R(x) = (ex x)/x2 a constant C, independent of a e (0,1]. [This may be proved by noting that the ? ? ? ? ? 1 x)/(ex^2 and functions 1) (ex I)2 are bounded, where (ex l)/a(eax in their power series.] For the exponentials this follows for the first by developing the proof of the first lemma, we can proceed as in the proof of Lemma 4.2. For the as in the proof of Lemma 4.3, this time proof of the second lemma, we proceed ? use of the of the convexity also making map x h> \y/x 112 on [0, oo). BAYESIAN MISSPECIFICATION The proofs of Theorems 6. Existence of tests. versus the positive, finite measures obtained QiP) 865 2.1 and 2.2 rely on tests of Po from points P that are at pos Kullback-Leibler divergence. (i.e., not necessarily probabil itive distance from the set of points with minimal we need to test Po against finite measures known results on tests using the Hellinger ity measures), Because such as in [10] distance, or [8], do not apply. It turns out that in this situation the Hellinger distance may not be appropriate and instead we use the full Hellinger transform. The aim of this section is to prove the existence of suitable tests and give upper bounds on their first the in a general notation and then specialize We formulate to results power. the application inmisspecified models. on a measurable Let P be a probability measure setup. space a role and on i3?, srf) the of let be measures finite class ? of Po) iX, srf) (playing = Q with dQ [playing the role of the measures (po/p*) dP\ We wish to bound the minimax risk for testing P versus J2, defined by 6.1. General tt(P, B) = inf sup (P0 + Qil (/))), where the infimum conv(i?) denote LEMMA is taken over the convex all measurable If there exists a a-finite 6.1. functions -> (p: 3? [0, 1]. Let hull of the set B. TtiP,B) = sup measure that dominates all Qe B, then > (Pip <q) + Qip q)). geconv(^) Moreover, there exists a test 0 that attains the infimum in the definition of TtiP, B). If p' is a measure that dominates both ?? and P PROOF. then a a-finite measure p exists dominating B, ? Let and be + of p P). p pi q (e.g., //-densities P and Q, for every Q e B. The set of test-functions can be identified with (j) the positive unit ball O of L^iX, is dual to Lxi3?', srf, p), since srf, p), which If equipped with the weak-* the positive unit ball <J>is p is a-finite. topology, Hausdorff and compact by the Banach-Alaoglu theorem (see, e.g., [11], Theo rem 2.6.18, and note that the positive functions form a closed and convex subset of the unit ball). The convex hull conv(i?) (or rather the corresponding is a convex subset of L\ i3?, srf, //). The map set of //-densities) L^iSC,s^,p) x Lxi$r,?f,p)-+R, (0,g)H>0P is concave + (l-0)G in Q and convex in 0. [Note that in the current context we write 0P instead of P0, in accordance with the fact that we consider 0 as a bounded lin ear functional on Lxi&, the map is weak-*-continuous in (f> srf, p).] Moreover, for every fixed Q [note that every net (f)a ^-X cf)by definition weak-*-converging B. J. K. KLEIJN AND A. W VAN DER VAART 866 for application srf, p)]. The conditions cpaQ -> (j)Q for all Q eL\(36', the minimax theorem (see, e.g., [15], page 239) are satisfied and we conclude of satisfies inf ^^ (0P + (1-0)G)= sup inf(0P + (l-(/))Q). sup <t>e? Qeconv(g) Qeconv(?) on the left-hand side is the minimax expression testing risk ix(P,g2g). The infimum on the right-hand side is attained at the point (p= t{p < q}, which leads to the first assertion of the lemma upon substitution. The second assertion of the lemma follows because the function 0 h? sup{0P + ? e conv(J2)} is a supremum of weak-*-continuous functions and, (1 (p)Q:Q The hence, attains its minimum on the compactum <?>. to express It is possible the right-hand side of the preceding lemma in the P and Q, but this is not useful for the following. Instead, L\ -distance between we use a bound in terms of the Hellinger transform pa(P, Q) defined by, for 0<cx < 1, Pa(P,Q) = Paql-adp. f inequality, this quantity is finite for all finite measures By Holder's of the choice of dominating measure p. is independent definition For any pair (P, Q) and every a e (0, 1), we can bound P(p <q) + Q(p>q)= pdp+ Jp<q (6.1) I P and Q. The qdp Jp>q <[Jp<q paq{-adix+Jp>q f paqX-adix = Pa(P,Q) lemma is bounded by the right-hand side of the preceding Hence, supg pa(P, Q) if is the fact that it factorizes of this bound for all a e (0, 1). The advantage For ease of notation, define P and Q are product measures. = <&) pa(g*, LEMMA 6.2. sup{pa(P, e conv(^), Q):P For any 0 < a < 1 and classes Q e conv(^)}. g?\, g^2, &\, ^2 of finite mea sures, Pa(g?l where g?\ x g?2 X &>2, ?X denotes X Bi) the class < Pa(&l,?l)pa(&2, of product measures ?22), {P\ x P2'P\ ^i, P2 ^2}. x ?2) x g?2) and Q e conv(i?i be given. Since Let P e con\(g?\ a measures -finite both are (finite) convex combinations, p\ and P2 can always be Proof. 867 BAYESIAN MISSPECIFICATION found such that both P and Q have px x p2 densities as convex combination which both can be written in the form of a finite pix,y) qix,y) = = i|i\ > 0, ^Vy-= 1, ^Kjq\jix)q2jiy), k} for px x p2-almost-all pairs to&\ for measures belonging are /^-densities for measures paq{-adipx 1, ki>0, ./ f J^A./= ^kipuix)p2iiy), j x JT. Here pi/ and ix, y) e X qXj are //j-densities and Bx, respectively p% and q2j (and, analogously, in g?2 and J22). This implies that we can write xp2) / V LjKjqijix) E/^/Pi/C*) / J (where, as usual, the integrand of the inner integral is taken equal to zero whenever the /xj-density equals zero). The inner integral is bounded by p(Xi&)2, J22) for x e fixed After this the X. bound, every upper substituting remaining integral is bounded by pai^x,B]). Combining 6.2 and 6.1, we obtain (6.1) with Lemmas measure If P is a probability on i3?', srf), then, for nated set of finite measures -> : 3?n such that, 1] [0, 4>n for all 0 < a < 1, THEOREM 6.1. sup (P>? + Qnil The bound measures given by the theorem P and Q, we have PMliP,G) (f>n)) is useful = 1| < is a domi (X', srf) and B > 1, there exists a test every n J2)n. if pa (P, ?}) - theorem. on PaiP, only f{VP the following < 1. For probability Jqfdp use the bound with a = 1/2 if the Hellinger of distance and, hence, we might to a measure P is For finite the conv(J2) positive. Q, general quantity pX/2iP, Q) on Q, the Hellinger transform paiP, Q) may may be bigger than 1 and, depending even 1 for every a. The the (generalized) Kullback-Leibler lie above lemma shows that this is controlled ? P logiq/p). divergence following by B. J.K. KLEIJN AND A. W. VAN DER VAART 868 LEMMA 6.3. tion a h> Pa(Q, measure P and a finite measure Q, the func ?> > [0, 1] with pa(Q, P) 0) as a | 0, P(q For a probability is convex on P) Pa(Q, P) -> Q(P > 0) as a f 1 and D1 (q\ dPa(Q,P)\ dot (which may to be equal ? |a=o \pJ oo). The function a h-> eay is convex on (0, 1) for all y e [?oo, oo), on (0, 1). The function of a \-+ pa(Q, P) = P(q/p)a the convexity implying = > y-> on [0, 1] for any y a for y < 1, ealogy is continuous 0, is decreasing ya = as a | 0, 1. By monotone convergence, increasing for y > 1 and constant for y PROOF. = Q(0<p<q). Mo<P<q} l{0<?}te(-) q(^) the dominated theorem, with dominating convergence < < -> 0, for a 1/2, we have, as a Mp>q) (p/q)l/2^{p>q) By !{*>*>-?g(-) e(-) Combining Q(p/q)a the - two displays preceding function 1{^} = g(p>^). above, we see that x (p/q)a pi-a(Q, P) = Q(p > 0) as a | 0. ? of the function a \-+ eay, the map or i-?-/^(y) = (ea:y By the convexity = < as a \, 0, to (d/da)|a=o/a(.y) 0, we y, for every y. For y decreases, > 0, < 0, while, for y by Taylor's formula, fa(y) fa(y)< \)/ot have sup^<y^<-^(a+?)>;. ^ 0<a/<a < 0v the quotient that fa(y) conclude Consequently, 6^le^a+?^yty>o. _ as a and is above to 0 bounded decreases | by 0 v ^ log(q/p) a-i(eaiog(q/p) < > 1. We a 2s for conclude which is for small 0, P-integrable s~l(q/p)2??q>p that we Hence, ~ P) ~ Po(Q,P))= ^(p?(G. lp((-) as a I 0, by the monotone Two typical graphs convergence of the Hellinger i l)h>o Plog(^yq>o, theorem. \-+ pa (Q, P) are shown in Fig in a situation normal location model transform a to fitting a unit variance ure 1 [corresponding For P a proba are sampled from a N(0, the observations 2)-distribution]. a = 0, but will 1 at to is transform measure P the with < Q, equal Hellinger bility = > near a 1 a if level that is above 1 increase to 0) > 1. Unless Q(p eventually where BAYESIAN MISSPECIFICATION o J co 10 J c\i oJ cvi FlG. O "1J I?I-1-1-1-1-T"^ 0.2 1. ^J Ifi 1 The Hellinger 0.8 transforms measure defined by dQ = (dN(3/2, with Intercepts (right). Q(p > 0), respectively. slope ^I-1-1-1-1-r-1 0.2 0.0 \-> pa(Q, l)/dN(0, axis the vertical The J "I 1.0 a at at 0 equals / / | j o O 0.6 oJ c\i d 0.0 0.4 j cvi "I // J ol CO // I I IT) d "1 o / / I 869 P), far \))dP the right (minus) P = N(0,2) 0.4 and 0.8 0.61.0 Q, (left) and dQ = (dN(3/2, and equal left of the graphs the Kullback-Leibler divergence the respectively, l)/dN(\, P(q P > \))dP 0) and \og(p/q). itwill never decrease below the level 1. For prob is negative, the slope P logip/q) P and Q, this slope equals minus the Kullback-Leibler distance ability measures = and, hence, is strictly negative unless P Q. In that case, the graph is strictly be choice to work with. For a general low 1 on (0, 1) and pX/2iP, Q) is a convenient to assume val finite measure transform the paiQ, P) is guaranteed Q, Hellinger ues strictly less than 1 near a = 0, provided that the Kullback-Leibler divergence is negative, which is not automatically the case. For testing a compos P logip/q) in Q e conv(i?). i?, we shall need that this is the case uniformly 6.1 guarantees of tests based alternative the existence i?, Theorem 2 on n observations if with error probabilities bounded by e~n? ite alternative For a convex s 2< sup sup 0<a<lQe? 1 i log-. PaiQ,P) In some of the examples we can achieve of this type by bounding the inequalities ? \-^ a a of (uniform) Taylor expansion right-hand side below by log pa (P, Q) in a near a ? 0. Such arguments are not mere technical generalizations: they can be to to relative standard necessary prove posterior consistency already misspecified parametric models. If Piq = 0) > 0, then the Hellinger hence, good tests exist, even though is strictly less than 1 at a = 0 and, be true that pi/2(P, Q) > 1. The exis transform itmay B. J.K. KLEIJN AND A. W. VAN DER VAART 870 tence of good tests is obvious in this case, since we can reject Q if the observations land in the set q = 0. is dominated. If this is not the case, then In the above we have assumed that ? we use tests [10], that Le Cam's generalized that the results go through, provided is, we define tt(P, ?) = inf sup (4>P+ (1 0)g), linear maps the infimum is taken over the set of all continuous, positive < 1 for all h> measures P. This collec R that such Lx i3?, &/) </>P 0: probability includes the linear maps that arise from integration of measur tion of functionals able functions 0: 3? i-> [0, 1], but may be larger. Such tests would be good enough where for our purposes, but the generality to misspecified models. application to have appears little additional value for our The next step is to extend the upper bound to alternatives ?H that are possibly of that are complements not convex. We are particularly interested in alternatives on be the set of finite measures balls around P in some metric. Let g/) L\i3?', h->R be such that r (P, .):.Sh>R si) */) x L+(^, (#\ */) and let r :L\i3?, function (written in a notation so as to suggest a distance from P is a nonnegative to Q). For QeB, set (6.2) r2(P,?)= For s > 0, define NT(e, L\i%\s/):f(P, J2) sup log 0<cx<l PaiP, to be the minimal of convex Q) > s/2} needed to cover {QeB'.s assume that J2 is such that this number that these convex subsets have f-distance theorem number Q) is finite s/2 to P subsets of {Q e < r(P, Q) < 2s} and for all s > 0. (The requirement is essential.) Then the following applies. set of measure and J2 be a dominated Let P be a probability THEOREM 6.2. > 1, there exists a test (j)n > n s and 0 all measures Then all oni3?, &/). for finite such that, for all J e N, oo j=1 (6.3) sup {Q:r(P,Q)>Je} Qn(l-4>n)<e'nj2?/4. PROOF. Fix n > 1 and s > 0 and define ?>j = {Q e ?:je there exists for every j > 1 a finite (j + 1)?}. By assumption, = of finite measures, convex sets ?!) NT(js, Cyj,..., Nj C/.w,property that (6.4) inf f(P,Q)>J-^-, l<i<Nj. < r(P, Q) < cover with of J27 by the further 871 BAYESIAN MISSPECIFICATION 6.1, for all n > 1 and for each such that, for all a e (0, 1), we have to Theorem According test 4>njj set there exists Cjj, a Pn(t>n,j,i<Pa(PXjj)\ SUp Q"(l-<l>nJj)<Pa(P,Cjj)n QeCjj (6.4), we have By inf sup sup pa(P,Q)= For fixed P continuously main L\(3?, the left-hand srf) is concave. By side of the preceding inf define Q) is convex and can be extended on [0, 1]. The function Q \-+ p?(P, Q) with do the minimax theorem (see, e.g., [15], page 239), display inf pa(P,Cjj). sup pa(P,Q)= 0<a<l that sup Q*cu a new test function Qn(\-4>njj)<e-nj2e2/4. (j)nby max (j)n =sup j>\ Then, equals Q^c Pnc/>njJv Now <e~j2e2/4. \-^ pa(P, a and Q, the function to a convex function 0<a<l It follows e-f2(P>Q) QeCjj QeCj,i0<a<l for every J > 0n,/\i. \<i<Nj 1, oo Nj oo pn*n <E E pn^u< E ^v^'V/4> = = = 7 sup Gn(l -0?) 1' < sup max ? ^ j>ji<Nj 1 7 1 ' sup Qn(\ -fajj) 7>i QeCjj where ^ = {e:T(P,G)>7e} = < sup^"^V/4 - ^^yV/4, Uy>y^y. to misspecification. 6.2. Application When applying the above in the proof for in the problem is to test the true distribution models, consistency Po misspecified = = for P e g?. In this Q Q(P) against measures taking the form d Q (po/p*)dP transform takes the form pa(Q, Po) ? Po(p/p*)a case, the Hellinger and its right at a = 0 is equal to Polog(/?//?*). derivative This is negative for every P e g? if and only if P* is the point in& at minimal to Po. Kullback-Leibler divergence This observation of minimal point sumed. We illustrates that the measure Kullback-Leibler formalize P* divergence, this in the following lemma. in Theorem even if this a 2.1 is necessarily as is not explicitly B. J. K. KLEIJN AND A. W. VAN DER VAART 872 < oo and Poip/ that Polog(/?o/P*) < oo and the right-hand side of (2.9) is positive, then Pologipo/p*) p*) the covering numbers for testing Nt is, ??, d; Po, P*) Po logipo/p). Consequently, in Theorem 2.1 can be finite only if P* is a point of minimal Kullback-Leibler di LEMMA If P* 6.4. and are P such < to Po. relative vergence = = 1. If The assumptions 0) > 0, then P0(p imply that Po(p* > 0) = oo and there is we assume to that p is prove. Thus, may nothing Polog(/?o/p) also strictly positive under Po. Then, in view of Lemma 6.3, the function g defined PROOF. by g(a) = Po(p/pT = PaiQ, Po) is continuous on [0, 1] with g(0) = P0(p > = side of (2.9) can be positive only if gia) < 1 for some 1 and the right-hand 0) a e [0, 1]. By convexity of g and the fact that g(0) = 1, this can happen only if the In view of Lemma 6.3, this derivative right derivative of g at zero is nonpositive. iss'(0+) = Finiteness P0log(/?/p*). of that numbers for testing for some s > 0 implies > e P is positive with ?? 0, as J(P, P*) every of in one of the sets P; in the definition be contained > case the right-hand side of (2.9) for some s 0, in which the covering of (2.9) side the right-hand must P such every Ntis, ??, d; Po, P*) is bounded below by s2/4. < 1 for every If Poip/p*) a is (po/p*)dP subprobability e &, measure P then the measure Q defined by dQ ? the Hellinger and, hence, by convexity, \-> transform a paiPo, Q) is never above the level 1 and is strictly less than 1 at = = a Q. In such a case there appears to be no loss in generality 1/2 unless Po to work with the choice a = 1/2 only, leading to the distance d as in Lemma 2.3. This lemma The shows following of our main proof Ntis, &,d',P0,P*) that this situation theorem translates results. Recall arises is convex. if 2? Theorem the definition 6.2 into the form of the covering needed numbers for the for testing in i2.2). < oo for all P e ??. Assume 6.3. Suppose P* e ?? and Poip/p*) D such that, for some sn > 0 and every that there exists a nonincreasing function THEOREM s> sn, (6.5) Ntis,&>,d;PQ,P*)<Dis). Then for every every J e N, s > sn there exists a test 4>n idepending on s > 0) such e-ne2/4 ^^Dis)^-^^, (6.6) sup {Pe&>:d(P,P*)>Je} QiP)ni\-d>n)<e-nj2el\ that, for BAYESIAN MISSPECIFICATION Proof. J2 as Define the set of all finite 873 measures as P Q(P) ranges (where po/p* = 0 if po = 0) and define x(Q\, Q2) = d(P\, P2). over g* Then Q(P*) rem 6.2 with have NT(s, test function = P0 and, hence, d(P, P*) P0). Identify P of Theo x(Q(P), the the present measure definitions Po. By (2.2) and (6.2), we < < > s for the sn. Therefore, Po, P*) D(s) every <2) Nt(s, g?,d; to exist by Theorem 6.2 satisfies guaranteed = 00 00 < PS4>n E < D(s) ? D(js)e-^2fA e-n^2l\ 7=1 7=1 is nonincreasing. This can be bounded further (as in the assertion) since, < < x for all 0 1, J2n>\ xH <x/(l? *) The second line in the assertion is simply the second line in (6.3). D because 7. Proofs of the main theorems. 8.1 in [8] and can be proved Lemma LEMMA define B(s, on g*, 7.1. For P*; Po) given s > 0 and P* by (2.3). Then for pon(fU jtmdTKP) of The following in the same manner. is analogous to < 00, that Polog(po/p*) measure Yl every C > 0 and probability e g? such < < Tl(B(e,/>*; />o))o-2(,+C)) -^. In view of (2.5), the conditions of Theorem 6.3 2 are satisfied, with the function D(s) = en6n (i.e., constant in s > sn). Let <f>n be the test as in the assertion of this theorem for s = Msn to be and M a large constant, determined later in the proof. For C > 0, also to be determined later in the proof, let Qn be the event Proof C7-1) Then Theorem lemma / P%(3?n Set 2.2. > fl -^(Xt)dYl(P) ^(1+C)^2n(P(^, P*; P0)). < \ Qn) = nn(P by Lemma 7.1. > s\Xu...,Xn). &>:d(P,P*) l/(C2ns2n), Yln(s) J e N, we can decompose P%Yln(JMsn) = (7.2) For P$(fln(JMen)4>n) + P0"(n?(/M??)(l every n > 1 and 0n)l^) + P^{Yln(JMsn)(l-(t>n)t^n). < 1, the three terms on the right-hand side separately. Because Yln(s) -> zero as to 00 for This converges ns2 by l/(C2ns2). fixed C and/or can be made arbitrarily small by choosing a large constant C if ns2 is bounded away from zero. We estimate the middle term is bounded B. J. K. KLEIJN AND A. W. VAN DER VAART 874 By the first inequality the first term on the right-hand in (6.6), side of (7.2) is bounded by < P^hniJMen)4>n) Pfa < (l-M2/4)ns2 [_e_nMle:iA. on the right-hand side is bounded above sufficiently large M, the expression /8 for sufficiently by 2e~n?"M large n and, hence, can be made arbitrarily small by to 0 for fixed M if ne2 ?> oo. choice of M, or converges of the third term on the right-hand Estimation side of (7.2) is more involved. = 1, we can write > Because 0) Po(p* For PS(nn(JMen)(l-4>n)lQn) (?'3) -pan-* [fd(p,pn>JMenni=i(p/p*)(Xi)dnjP)^ o( ^4 where we have written UUUip/p^x^diiiP) the arguments J' for clarity. By the definition of Qn, the is bounded below by e~^+c^nenTl(B( n, P*; Po)). integral in the denominator = measure the this for defined bound, QiP) writing Inserting by dQiP) and using Fubini's the right-hand side of theorem, we can bound ipo/P*)dP, the preceding display by , e{\+C)nel (7'4) 777^7-^-^P*; UiBiSn, Xt Po)) / Jd(P,P*)>JMen Q(P)na-4>n)dTl(P). = < diP, P*) < Men(j + 1)}, we can decompose {P e 0?: Msnj = > JMsn} tests (pn have been chosen to satisfy The Uj>j &nj< e^1 1^!* uniformly in P e &>nJ. the inequality [Cf. the sec <2(P)n(l 4>n) is bounded in (6.6).] It follows that the preceding ond inequality display by Setting <?nj [P :diP, P*) {\+C)nel njJ e-nJ2M2^4Ui^n YliBisn, P*; Po))-!-Y\ f^j <V j) e(l+C)ne*+nelM2j2/Z-nj2M2e*/4^ large M, by (2.11). For fixed C and sufficiently bounded away from zero and J = Jn -> oo. this converges to zero if ns2 is the numera is a probability measure, mass the condition (2.11) is prior by asser mass We that the the conclude condition (2.4). prior implied (for large j) by ? = 2.2. That in tion of Theorem oo, follows from Theorem 2.1, but with M Mn fact it suffices that M is sufficiently large follows by inspection of the preceding Proof of tor in (2.11) proof. Theorem 2.1. is bounded above Because n 1. Therefore, BAYESIAN MISSPECIFICATION of Proof 2.4. Theorem 875 The proof of this theorem is that we cannot difference follows the same steps to the prepara as the preceding appeal proofs. A to in the theorems lemmas and separate steps. The shells gPnj proof tory split = < < must e be covered by sets BnJj + d(P, gP*) l)sn} {P M(j P\Mjsn we use as in the definition set the appropriate element (2.16), and for each such P* ,, e g?* to define a test 'J' <j>n,; and to rewrite the left-hand side of (7.3). We n,j,i omit the details. Lemma 8.1 is used to upper bound the Kullback the of and the expectation squared logarithm by a function of divergence A similar lemma was presented in [16], where both p and q were the Li-norm. 8. Technical lemmas. Leibler assumed to be densities of probability distributions. We generalize this result to we a measure are use to is and the L finite forced instead of the \ q distance. the case where Hellinger LEMMA for 8.1. For every probability s\> > 0 such that, every b > 0, there exists a constant < measure measure P and finite Q with 0 < h2(p,q) sbP(p/q)b, r: (0, oo) -> R defined PROOF. The function implicitly ? ? the following 1) l)2 possesses r(x)(*J~x properties: r is nonnegative and decreasing. ~ asx \, 0, whence there exists s' r(x) log(l/x) on [0, e']. (A computer graph indicates that sr = > 0 such that For every b > 0, there exists s'^ ? > 1, we may take s'^ (For b 1, but for b close In view of the definition of r and the first property, = Plog^ Q -2p(/i-lUpr(i)(^-l)2 \VP I \P/\S P by ? = log Jt > 0 such that r(x) 0.4 will do.) < 2(*J~x 21og(l/x) is increasing on xbr(x) to zero, must be very s^ we can write I J--A < h2(p, q) + \\p-q\\\+ r(s)h2(p, q) +M~\\ ~ < el, [0, s^]. small.) B. J.K. KLEIJN AND A. W. VAN DER VAART 876 for any 0 < s < 4, where we use the fact that \yfqfp ? 11< 1 if q/p < 4. Next we choose s < and use the third property to bound the last term on the right-hand si side by P(p/q)bsbr(s). the resulting bound with the second property, Combining we then obtain, for s < s1 A A 4, s^ Plog^</z2(p,^) + + ||p-^||1+21og-/z2(p,^) sq sb = . 2^1og-pf^ s \qj and third terms on the right-hand side < a for small sb, then this q) sufficiently s\)P(p/q)b we can choice is eligible and the first inequality of the lemma follows. Specifically, A 4)b. choose Sb < (s' A s'l To prove the second inequality, we first note that, since | logx| < 2\y/x ? 1| for For the second h2(p,q)/P(p/q)b, take the same form. If h2(p, x>l, / 2 P(k)g^ ij?>lJ<4P( Next, with I? /i-lj \ =4h2(p,q). r as in the first part of the proof, < , Sh2(p, q) + 2r2(s)h2(p, q) + 2sbr2(s)P^j for s < of the third property of r. (The power of 4 in the first line ? < 1.)We can use the second of the array can be lowered to 2 or 0, as \y/q/p 11 = r to finish the next to and choose of bound sb r(s) property h2(p, q)/P(p/q)b < a we can ^ choose sb (sf proof. Specifically, s^>2)b' sfl>2, in view aw LEMMA 8.2. If p,pn,Poo Pn -> Poo as n -> oo, then liminf^oo If Xn = pn/p can write mean. in We in L\(p) densities probability > P P log(p/p?) log(p/poo). ? such that then Xn -> X in P-probability Poo/p, sum of P(logX?)lxn>i as and the and Plog(pn/p) < < is Because 0 x, the sequence (logx)l^>i (logXw)lxn>i P(logX?)lxn<i. in is in absolute value by the sequence dominated hence, and, |Xn|, uniformly theorem, we have convergence tegrable. By a suitable version of the dominated are -> Because the variables (logZn)lXn<i P(logX00)lx00>i. P(logX?)lLxn>i < we can apply Fatou's lemma to see that limsupP(logXw)lxn<i nonnegative, Proof. P(logXoo)lXoo<i. and X 877 BAYESIAN MISSPECIFICATION REFERENCES [1] Berk, R. H. (1966). Limiting of posterior distributions behavior Ann. Math. Statist. 37 51-58. MR0189176 [Corrigendum L. [2] BlRGE, (1983). Z. Wahrsch. O. [3] BUNKE, [4] DiACONis, models. Ann. [6] P. and l'estimation. de under estimates possibly MR1626075 the consistency On (1986). theorie of Bayes behavior Asymptotic 26 617-644. D. et metriques MR0722129 (1998). D. Freedman, of Bayes estimates Bayes estimates of problems. Ann. (with discus On (1986). inconsistent location. MR0829556 A Bayesian (1973). espaces is incorrect. 1-67. MR0829555 14 68-87. Statist. 181-238. Statist. 14 Statist. T. S. Ferguson, Ann. P. and FREEDMAN, sion). Ann. [5] DIACONIS, X. and MlLHAUD, incorrect les dans Approximation Verw. Gebiete 65 the model when 37 745-746.] of some analysis non-parametric Statist. 1 209-230. MR0350949 [7] FERGUSON, T. S. Prior (1974). on distributions of probability spaces measures. Ann. 2 Statist. 615-629. MR0438568 [8] Ghosal, J. K. and van S., Ghosh, Ann. distributions. S. and van [9] Ghosal, imum Statist. der likelihood der A. W. Vaart, estimation and Bayes A. W. Vaart, 28 500-531. (2000). rates of posterior Convergence MR1790007 and rates of convergence (2001). Entropies for mixtures of normal Ann. densities. for max Statist. 29 1233-1263. MR1873329 [10] Le Cam, L. M. (1986). Asymptotic in Statistical Methods Decision Theory. Springer, New York. MR0856411 R. E. [11] MEGGINSON, (1998). An Introduction to Banach Space Theory. New Springer, York. MR 1650235 [12] of maximum likelihood estimators for certain nonparamet Consistency in particular: Mixtures. J. Statist. Plann. 19 137-158. MR0944202 Inference Z Wahrsch. Verw. Gebiete 4 10-26. MR0184378 (1965). On Bayes procedures. J. (1988). Pfanzagl, ric families, [13] Schwartz, [14] Shen, L. X. Ann. [15] STRASSER, [16] Wong, W. gence and Wasserman, Statist. H. H. L. 29 687-714. (1985). Mathematical and Shen, X. rates of sieve MLE's. (2001). Rates of of convergence distributions. posterior MR1865337 Berlin. MR0812467 de Gruyter, Theory of Statistics, for likelihood ratios and conver Probability inequalities Ann. Statist. MR 1332570 23 339-362. (1995). of Mathematics Department Vrije Universiteit De Boelelaan Amsterdam 1081a 1081 HV Amsterdam The Netherlands E-mail: aad@cs.vu.nl kleijn@cs.vu.nl