Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Theoretical Statistics. Lecture 1. Peter Bartlett 1. Organizational issues. 2. Overview. 3. Stochastic convergence. 1 Organizational Issues • Lectures: Tue/Thu 11am–12:30pm, 332 Evans. • Peter Bartlett. bartlett@stat. Office hours: Tue 1-2pm, Wed 1:30-2:30pm (Evans 399). • GSI: Siqi Wu. siqi@stat. Office hours: Mon 3:30-4:30pm, Tue 3:30-4:30pm (Evans 307). • http://www.stat.berkeley.edu/∼bartlett/courses/210b-spring2013/ Check it for announcements, homework assignments, ... • Texts: Asymptotic Statistics, Aad van der Vaart. Cambridge. 1998. Convergence of Stochastic Processes, David Pollard. Springer. 1984. Available on-line at http://www.stat.yale.edu/∼pollard/1984book/ 2 Organizational Issues • Assessment: Homework Assignments (60%): posted on the website. Final Exam (40%): scheduled for Thursday, 5/16/13, 8-11am. • Required background: Stat 210A, and either Stat 205A or Stat 204. 3 Asymptotics: Why? Example: We have a sample of size n from a density pθ . Some estimator gives θ̂n . • Consistent? i.e., θ̂n → θ? Stochastic convergence. • Rate? Is it optimal? Often no finite sample optimality results. Asymptotically optimal? • Variance of estimate? Optimal? Asymptotically? • Distribution of estimate? Confidence region. Asymptotically? 4 Asymptotics: Approximate confidence regions Example: We have a sample of size n from a density pθ . Maximum likelihood estimator gives θ̂n . √ −1 Under mild conditions, n θ̂n − θ is asymptotically N 0, Iθ . Thus √ 1/2 nIθ (θ̂n − θ) ∼ N (0, I), and n(θ̂n − θ)T Iθ (θ̂n − θ) ∼ χ2 (k). So we have an approximate 1 − α confidence region for θ: ( ) 2 χk,α T θ : (θ − θ̂n ) Iθ̂n (θ − θ̂n ) ≤ . n 5 Overview of the Course 1. Tools for consistency, rates, asymptotic distributions: • Stochastic convergence. • Concentration inequalities. • Projections. • U-statistics. • Delta method. 2. Tools for richer settings (eg: function space vs Rk ) • Uniform laws of large numbers. • Empirical process theory. • Metric entropy. • Functional delta method. 6 3. Tools for asymptotics of likelihood ratios: • Contiguity. • Local asymptotic normality. 4. Asymptotic optimality: • Efficiency of estimators. • Efficiency of tests. 5. Applications: • Nonparametric regression. • Nonparametric density estimation. • M-estimators. • Bootstrap estimators. 7 Convergence in Distribution X1 , X2 , . . . , X are random vectors, Definition: Xn converges in distribution (or weakly converges) to X (written Xn X) means that their distribution functions satisfy Fn (x) → F (x) at all continuity points of F . 8 Review: Other Types of Convergence d is a distance on Rk (for which the Borel σ-algebra is the usual one). as Definition: Xn converges almost surely to X (written Xn → X) means that d(Xn , X) → 0 a.s. P Definition: Xn converges in probability to X (written Xn → X) means that, for all ǫ > 0, P (d(Xn , X) > ǫ) → 0. 9 Review: Other Types of Convergence Theorem: as P Xn → X =⇒ Xn → X =⇒ Xn P Xn → c ⇐⇒ Xn as X, c. P NB: For Xn → X and Xn → X, Xn and X must be functions on the sample space of the same probability space. But not convergence in distribution. 10 Convergence in Distribution: Equivalent Definitions Theorem: [Portmanteau] The following are equivalent: 1. P (Xn ≤ x) → P (X ≤ x) for all continuity points x of P (X ≤ ·). 2. Ef (Xn ) → Ef (X) for all bounded, continuous f . 3. Ef (Xn ) → Ef (X) for all bounded, Lipschitz f . T T 4. Eeit Xn → Eeit X for all t ∈ Rk . (Lévy’s Continuity Theorem) 5. for all t ∈ Rk , tT Xn tT X. (Cramér-Wold Device) 6. lim inf Ef (Xn ) ≥ Ef (X) for all nonnegative, continuous f . 7. lim inf P (Xn ∈ U ) ≥ P (X ∈ U ) for all open U . 8. lim sup P (Xn ∈ F ) ≤ P (X ∈ F ) for all closed F . 9. P (Xn ∈ B) → P (X ∈ B) for all continuity sets B (i.e., P (X ∈ ∂B) = 0). 11 Convergence in Distribution: Equivalent Definitions Example: [Why do we need continuity?] Consider f (x) = 1[x > 0], Xn = 1/n. Then Xn → 0, f (x) → 1, but f (0) = 0. [Why do we need boundedness?] Consider f (x) = x, n w.p. 1/n, Xn = 0 w.p. 1 − 1/n. Then Xn 0, Ef (Xn ) → 1, but f (0) = 0. 12 Relating Convergence Properties Theorem: Xn P X and d(Xn , Yn ) → 0 =⇒ Yn Xn X and Yn P X, c =⇒ (Xn , Yn ) P (X, c), P Xn → X and Yn → Y =⇒ (Xn , Yn ) → (X, Y ). 13 Relating Convergence Properties Example: NB: NOT Xn X and Yn Y =⇒ (Xn , Yn ) (X, Y ). (joint convergence versus marginal convergence in distribution) Consider X, Y independent N (0, 1), Xn ∼ N (0, 1), Yn = −Xn . Then Xn X, Yn Y , but (Xn , Yn ) (X, −X), which has a very different distribution from that of (X, Y ). 14 Relating Convergence Properties: Continuous Mapping Suppose f : Rk → Rm is “almost surely continuous” (i.e., for some S with P (X ∈ S)=1, f is continuous on S). Theorem: [Continuous mapping] Xn X =⇒ f (Xn ) f (X). P P as as Xn → X =⇒ f (Xn ) → f (X). Xn → X =⇒ f (Xn ) → f (X). 15 Relating Convergence Properties: Continuous Mapping Example: For X1 , . . . , Xn i.i.d. mean µ, variance σ 2 , we have √ n (X̄n − µ) N (0, 1). σ So n 2 ( X̄ − µ) n σ2 (N (0, 1))2 = χ21 . Example: We also have X̄n − µ 0 hence (X̄n − µ)2 0. Consider f (x) = 1[x > 0]. Then f ((X̄n − µ)2 ) 1 6= f (0). (The problem is that f is not continuous at 0, and PX (0) > 0, for X satisfying (X̄n − µ)2 X.) 16 Relating Convergence Properties: Slutsky’s Lemma Theorem: Xn X and Yn c imply X n + Yn Yn X n Yn−1 Xn (Why does Xn X and Yn X + c, cX, c−1 X. Y not imply Xn + Yn 17 X + Y ?) Relating Convergence Properties: Examples Theorem: For i.i.d. Yt with EY1 = µ, EY12 = σ 2 < ∞, √ Ȳn − µ n Sn N (0, 1), where Ȳn = n−1 n X Yi , i=1 Sn2 = (n − 1)−1 18 n X i=1 (Yi − Ȳn )2 . Proof: 2 X n n 1 2 2 Yi − Ȳn Sn = |{z} n − 1 n | {z } i=1 P | {z } →EY1 P →1 P →EY12 (weak law of large numbers) P → EY12 − (EY1 )2 (continuous mapping theorem, Slutsky’s Lemma) = σ2 . 19 Also √ 1 n Ȳn − µ Sn | {z } |{z} N (0,σ 2 ) P →1/σ (central limit theorem) N (0, 1) (continuous mapping theorem, Slutsky’s Lemma) 20 Showing Convergence in Distribution Recall that the characteristic function demonstrates weak convergence: itT X itT Xn → Ee for all t ∈ Rk . Xn X ⇐⇒ Ee Theorem: [Lévy’s Continuity Theorem] itT Xn → φ(t) for all t in Rk , and φ : Rk → C is continuous at 0, If Ee itT X then Xn X, where Ee = φ(t). Special case: Xn = Y . So X, Y have same distribution iff φX = φY . 21 Showing Convergence in Distribution Theorem: [Weak law of large numbers] P Suppose X1 , . . . , Xn are i.i.d. Then X̄n → µ iff φ′X1 (0) = iµ. Proof: P We’ll show that φ′X1 (0) = iµ implies X̄n → µ. Indeed, EeitX̄n = φn (t/n) = (1 + tiµ/n + o(1/n))n → |{z} eitµ . =φµ (t) Lévy’s Theorem implies X̄n P µ, hence X̄n → µ. 22 Showing Convergence in Distribution e.g., X ∼ N (µ, Σ) has characteristic function itT X φX (t) = Ee itT µ−tT Σt/2 =e . Theorem: [Central limit theorem] √ 2 Suppose X1 , . . . , Xn are i.i.d., EX1 = 0, EX1 = 1. Then nX̄n N (0, 1). 23 Proof: φX1 (0) = 1, φ′X1 (0) = iEX1 = 0, φ′′X1 (0) = i2 EX12 = −1. √ it nX̄n Ee √ = φ (t/ n) n 2 2 = 1 + 0 − t EY /(2n) + o(1/n) → e−t 2 /2 = φN (0,1) (t). 24 n Uniformly tight Definition: X is tight means that for all ǫ > 0 there is an M for which P (kXk > M ) < ǫ. {Xn } is uniformly tight (or bounded in probability) means that for all ǫ > 0 there is an M for which sup P (kXn k > M ) < ǫ. n (so there is a compact set that contains each Xn with high probability.) 25 Notation: Uniformly tight Theorem: [Prohorov’s Theorem] 1. Xn X implies {Xn } is uniformly tight. 2. {Xn } uniformly tight implies that for some X and some subsequence, X. Xnj 26 Notation for rates: oP , OP Definition: P Xn = oP (1) ⇐⇒Xn → 0, Xn = oP (Rn ) ⇐⇒Xn = Yn Rn and Yn = oP (1). Xn = OP (1) ⇐⇒Xn uniformly tight Xn = OP (Rn ) ⇐⇒Xn = Yn Rn and Yn = OP (1). (i.e., oP , OP specify rates of growth of a sequence. oP means strictly slower (sequence Yn converges in probability to zero). OP means within some constant (sequence Yn lies in a ball). 27 Relations between rates oP (1) + oP (1) = oP (1). oP (1) + OP (1) = OP (1). oP (1)OP (1) = oP (1). (1 + oP (1))−1 = OP (1). oP (OP (1)) = oP (1). P Xn → 0, R(h) = o(khkp ) =⇒ R(Xn ) = oP (kXn kp ). P Xn → 0, R(h) = O(khkp ) =⇒ R(Xn ) = OP (kXn kp ). 28