Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Scandinavian Journal of Statistics, Vol. 36: 671–685, 2009 doi: 10.1111/j.1467-9469.2009.00651.x © 2009 Board of the Foundation of the Scandinavian Journal of Statistics. Published by Blackwell Publishing Ltd. Empirical Likelihood Confidence Intervals for Response Mean with Data Missing at Random LIUGEN XUE College of Applied Sciences, Beijing University of Technology ABSTRACT. A kernel regression imputation method for missing response data is developed. A class of bias-corrected empirical log-likelihood ratios for the response mean is defined. It is shown that any member of our class of ratios is asymptotically chi-squared, and the corresponding empirical likelihood confidence interval for the response mean is constructed. Our ratios share some of the desired features of the existing methods: they are self-scale invariant and no plug-in estimators for the adjustment factor and asymptotic variance are needed; when estimating the non-parametric function in the model, undersmoothing to ensure root-n consistency of the estimator for the parameter is avoided. Since the range of bandwidths contains the optimal bandwidth for estimating the regression function, the existing data-driven algorithm is valid for selecting an optimal bandwidth. We also study the normal approximation-based method. A simulation study is undertaken to compare the empirical likelihood with the normal approximation method in terms of coverage accuracies and average lengths of confidence intervals. Key words: bandwidth, confidence interval, empirical likelihood, kernel regression imputation method, missing at random, response mean 1. Introduction Missing response data often arise in various experimental settings, including market research surveys, medical studies, opinion polls and socioeconomic investigations. Statistical analysis with missing data is a very difficult task since in most cases the missing data themselves contain little or no information about the missing data mechanism (MDM). The fundamental and most widely used assumption about the MDM is that it is a missing at random (MAR) model (Rubin, 1976). The basic idea of MAR is that the probability that a response variable is observed can depend only on the values of those other variables that have been observed. This concept has been extensively studied, and effective computational methods for handling missing data under the MAR assumption have been well developed. Let X be a d-dimensional vector of factors and let Y be a response variable influenced by X. In practice, one often obtains a random sample of incomplete data {(Xi , Yi , i ); 1 ≤ i ≤ n}, where all the Xi s are observed and i = 0 if Yi is missing, otherwise i = 1. This class of sample missing data can arise because of a double or two-phase sampling scheme first proposed by Neyman (1938). The data may also arise from other distinctive sources. Typically, they may occur in any experimental situation where the treatment is susceptible to contamination or subject mortality. To estimate the mean of Y for the missing data {(Xi , Yi , i ); 1 ≤ i ≤ n}, a common method is to impute (i.e. fill in) a plausible value for each missing datum and then construct an estimator from the imputed data as if they were complete data. Cheng (1994) applied kernel regression imputation to estimate the mean of Y, say . He gave an estimator of , say ˆC , and established the asymptotic normality of a modified version of ˆC under the assumption that the Y values are MAR. Hahn (1998) established the semi-parametric efficiency bound 672 L. Xue Scand J Statist 36 for an estimation of , and constructed an estimator based on the propensity score p(x) that achieves the bound. These results can be used to perform interval estimation and hypothesis testing on . The other works include Wang & Rao (2002a,b) and Chen et al. (2006). A competitive method for constructing the confidence interval of is the empirical likelihood method, introduced by Owen (1988, 1990). It has many advantages over other methods such as those based on normal approximations or the bootstrap (Hall & La Scala, 1990). Many authors have developed methods for non- and semi-parametric regression models. Some related works include: Chen & Hall (1993), Kitamura (1997), Chen & Sitter (1999), Peng (2004), Wang et al. (2004), Zhu & Xue (2006), Xue & Zhu (2006, 2007a,b), Stute et al. (2007), among others. Qin & Zhang (2007) employed an empirical likelihood method to seek a constrained empirical likelihood estimation of the response mean with the assumption that responses are MAR. With the non-parametric kernel regression imputation scheme, Wang & Rao (2002a) developed imputed empirical likelihood approaches for constructing confidence intervals of . Their main idea is to first impute the missing Y -values by a kernel regression imputation and then construct a complete data empirical likelihood for from the imputed data set as if they were independent and identically distributed (i.i.d.) observations. However, the imputed data are not i.i.d. because the plug-in estimator is used. As a consequence, the empirical log-likelihood ratio under imputation is asymptotically distributed as a scaled chi-square variable. Therefore, the empirical log-likelihood ratio cannot be applied directly to make a statistical inference on . This motivates them to adjust the empirical log-likelihood ratio so that the adjusted empirical log-likelihood ratio is asymptotically chi-squared. The adjustment is to multiply by an adjustable factor to get the adjusted empirical likelihood ratio. However, there are two issues: the first issue is that the adjustment factor is very complicated and contains several unknowns to be estimated; the second issue is that the undersmoothing for estimating the unknown function brings a difficulty in selecting bandwidths. In addition, we need to point out that theorem 2.1 in Hjort et al. (2009) cannot be applied directly in practice although they have provided a general framework for the ratio based on plug-in estimation, because their theorem does not answer how to construct an auxiliary random vector for a special model. In this paper, we construct a weight-corrected empirical log-likelihood ratio for such that the ratio is asymptotically chi-squared. With auxiliary information, we also construct a weight-corrected empirical log-likelihood ratio for , and it is shown that the ratio has an asymptotic chi-squared distribution. To compare the empirical likelihood method with the normal approximation method, we also construct a weighted estimator and a maximum empirical likelihood estimator for , and show their asymptotic behaviours. Our results can be used directly to construct the confidence intervals for . Zhu & Xue (2006) proposed the bias-corrected method for constructing the empirical likelihood ratio. One main feature of our approach is to directly calibrate the empirical likelihood ratio so that the resulting ratio is asymptotically chi-squared. As the ratio does not need to be multiplied by an adjustment factor, this avoids a difficulty in estimating an unknown adjustment factor. This is especially attractive in some cases when the adjustment factor is difficult to estimate efficiently. In addition, we do not need undersmoothing in selecting bandwidth. The range of bandwidths contains the optimal bandwidth for estimating the regression function, and the existing data-driven algorithm is valid for selecting an optimal bandwidth. The rest of this paper is organized as follows. In section 2, our methods are elaborated, and some of our main results are given. In section 3, a simulation study is conducted to compare the empirical likelihood with the normal approximation method in terms of coverage accuracies and average lengths of confidence intervals. In section 4, the concluding remarks are given. Proofs of the theorems are relegated to the Appendix. © 2009 Board of the Foundation of the Scandinavian Journal of Statistics. Empirical likelihood for response mean Scand J Statist 36 673 2. Methods and results Throughout this paper, we make the MAR assumption for Y values. The MAR assumption implies that and Y are conditionally independent given X , that is, P( = 1|Y , X ) = P( = 1|X ), denoted by p(x). p(·) is called the selection probability function. 2.1. Weight-corrected empirical likelihood To construct the empirical likelihood ratio function for , Wang & Rao (2002a) applied kernel regression imputation to introduce the auxiliary random variables Ỹ i = i Yi + (1 − i )m̂b (Xi ), i = 1, . . ., n, where m̂b (x) is a truncated version of the estimator of m(x) = E(Y | X = x); that is, (nhd )−1 ni= 1 i Yi Kh (Xi − x) m̂b (x) = . max{b, (nhd )−1 ni= 1 i Kh (Xi − x)} (1) (2) Here each of h = hn and b = bn is a sequence of positive constants tending to zero, while Kh (·) = K (·/h), and K (·) is a kernel function. Using Ỹ i , Wang & Rao (2002a) constructed an ˜ estimated empirical log-likelihood ratio function, say l(). However, the asymptotic ˜ is not standard chi-squared. Actually, l() ˜ is asymptotically distributed as distribution of l() ˜ must be adjusted because a scaled chi-square variable with one degree of freedom. Thus, l() it cannot be used directly to make a statistical inference for . The adjustment is to multiply by an adjustable factor that is estimated. Now, we directly construct a weight-corrected empirical log-likelihood ratio statistic for such that the statistic is asymptotically chi-square distributed without the need for an adjustment factor. Since Ỹ i contains the estimator m̂b (Xi ), there exists the bias m̂b (Xi ) − m(Xi ) in Ỹ i . To reduce the bias, we use the approach of weighted imputation. Therefore, a new auxiliary variable Ŷ i , i = 1, . . ., n, depending on estimated response probabilities p̂(Xi ), is defined by i i Yi + 1− m̂b (Xi ), Ŷ i = (3) p̂(Xi ) p̂(Xi ) where m̂b (x) is defined in (2), and p̂(x) is the estimator of p(x); that is, n i = 1 i La (Xi − x) p̂(x) = . max {1, ni= 1 La (Xi − x)} (4) Here a = an is a sequence of positive constants tending to zero, while La (·) = L(·/a), and L(·) is a kernel function. Therefore, a weight-corrected empirical log-likelihood ratio function for is defined as: ˆ = −2 max l() n log(npi ), i =1 where the maximum is taken over all sets of non-negative numbers p1 , . . ., pn that sum to n = . By the Lagrange multiplier method, when min1≤i≤n Ŷ i < < 1 and such that i = 1 pi Ŷ i ˆ can be represented as: max1≤i≤n Ŷ i , the ratio l() ˆ =2 l() n log(1 + (Ŷ i − )), i =1 © 2009 Board of the Foundation of the Scandinavian Journal of Statistics. (5) 674 L. Xue Scand J Statist 36 where = () is the solution of the equation n i =1 Ŷ i − = 0. 1 + (Ŷ i − ) (6) ˆ has asymptotically chi-squared Since the bias in (3) is corrected, it can be derived that l() distribution. This result is given in theorem 1. Denote by f (x) and F (x) the probability density and distribution function of X , respec tively. Let g(x) = p(x) f (x). Assume that Z = di= 1 |zi | for any vector Z = (z1 , . . ., zd )T . The following conditions are needed for our results. (C1) The selection probability function p(x), the X -density f (x) and m(x) all have bounded partial derivatives up to order r with r ≥ 2 and r > d/2, and infx p(x) > 0. (C2) supx E(Y 2 | X = x) < ∞. √ (C3) nE[|m(X )|I {g(X ) < 2b}] → 0, where b is defined as in m̂b (x). √ (C4) nP{X > Mn } → 0, where 0 < Mn → ∞. (C5) K (·) is a non-negative and bounded kernel function of order r with compact support, where r ≥ 2 and r > d/2. (C6) L(·) is a bounded kernel function of order r with r ≥ 2 and r > d/2, and c1 I {u ≤ } ≤ L(u) ≤ c2 I {u ≤ } for some finite constants > 0 and c2 ≥ c1 > 0. (C7) nh2d b4 → ∞ and nh4r b−4 → 0, where r is the order of the kernel K. (C8) na2d Mn−2d → ∞ and na4r → 0, where r is the order of the kernel L. Remark 1. Conditions (C1), (C2), (C5) and (C6) are standard assumptions for nonparametric regression problems. Especially, p(x) being bounded away from zero in (C1) implies that data cannot be missing with probability 1 anywhere in the domain of the X variable. Conditions (C3) and (C4) are commonly used for avoiding the boundary problem. Condition (C3) has been used by Zhu & Fang (1996) and Wang & Rao (2002a). Condition (C4) can be satisfied in the following three cases: (a) the distribution of X has compact support; (b) X has a density function f (x), and there exist positive constants and such that f (x) ≤ exp(−x ) when x is large enough; (c) X has a density function f (x), and there exist positive constants and such that f (x) ≤ x− when x is large enough. For example, the uniform distribution satisfies (a), the normal and exponential distribution satisfy (b) and the Cauchy distribution satisfies (c). Also, conditions (C3) and (C4) are simultaneously satisfied for the following two cases used in the simulation study: (i) X follows a truncated normal distribution; (ii) X follows a standard √ exponential distribution, and m(X ) is proportional to exp(−cX ) for c > 0 such that nb1 + c → 0 4r −4 and Mn = 2 ln n. In condition (C7), nh b → 0 is required to control the bias induced by kernel smoothing whereas nh2d b4 → ∞ leads to consistent estimation of m(x). When we assume that b = O(n− ) for some small 0 < < 1/4, condition (C7) means that the convergence rate of h has a range between cn−(1−4)/(2d) and c̄n−(1 + 4)/(4r) for some positive constants c < c̄. Thus, when estimating the regression function, the optimal convergence rate n−1/(2r + d) is within the range. For instance, in the univariate case d = 1, r = 2 and = 1/12, the range is between cn−1/3 and c̄n−1/6 , and the optimal bandwidth c0 n−1/5 can be found within the range, where c0 is a positive constant. Therefore, the optimal bandwidth can be chosen by using the cross-validation method. Condition (C8) can be explained similarly. It is worth pointing out that condition (C7) relaxes condition (C.hn ) in Wang & Rao (2002a), and overcomes the difficulty in selecting bandwidths. © 2009 Board of the Foundation of the Scandinavian Journal of Statistics. Empirical likelihood for response mean Scand J Statist 36 675 D Let −→ represent convergence in distribution, and let χ2r be a chi-square variable with r ˆ is asymptotically chi-square distributed with degrees of freedom. Theorem 1 shows that l() one degree of freedom. Theorem 1 D ˆ −→ Suppose that conditions (C1)–(C8) hold. If is the true parameter, then l() χ21 . Let χ21 (1 − ) be the 1 − quantile of χ21 for 0 < < 1. Using theorem 1, we obtain an approximate 1 − confidence interval for , defined by ˜ ={˜ | l( ˜ ≤ χ2 (1 − )}. ˆ ) I () 1 Theorem 1 can also be used to test the hypothesis H0 : = 0 . One could reject H0 at level ˆ 0 ) > χ2 (1 − ). if l( 1 ˜ the imputed file should provide the response probabilities For the confidence intervals I (), ˜ can be p̂(Xi ) in order to compute Ŷ i given by (3). In this way, the confidence interval I () computed. The curse of dimensionality is an issue with the kernel estimators m̂b (x) and p̂(x) when the dimension d of X is high. Since the target of the inference is a finite dimensional rather than m(x) and p(x), the curse of dimensionality will affect small to moderate sample performance of the proposed estimator as long as the biases of the kernel estimators are controlled. When d ≥ 4, controlling the bias requires the order of the kernel r > 2, the so-called high-order kernel, so that nh4r b−4 → 0 and na4r → 0 instead of nh8 b−4 → 0 and na8 → 0 when a conventional second-order kernel is used. Using a high-order kernel may occasionally cause p̂(x) to not be a proper selection probability function as the kernel function L(·) may be negative. In this case, we can re-adjust the weights in p̂(x) by using a similar method used by Hall & Murison (1993) for high-order kernel density estimators. Suppose that we have an auxiliary parametric model for p(x), re-denoted by p(x, ), where is a q × 1 unknown parameter vector. Then we can use a parametric estimation method to ˆ Hence we can obtain an auxiliary variable Y̌i by substituting p̂(Xi ) get an estimator of , say . ˇ by substitutˆ of (3) with p(x, ), and obtain a weight-corrected empirical likelihood ratio l() ˆ withY̌i . It can be shown that l() ˆ have the same asymptotic chi-square ˇ and l() ing Ŷ i of l() distribution. The choice of p(x, ) can be the logistic regression function. Similarly, we can also assume that p(x) is a semi-parametric regression function, and the corresponding results can be derived. In either of these cases, the method does not require high-dimensional smoothing operations. 2.2. Weight-corrected empirical likelihood with auxiliary information We assume that auxiliary information on X of the form E{A(X )} = 0 is available, where A(X ) = (A1 (X ), . . ., Aq (X ))T , q > 0, is a known vector (or scalar) function, for example, when the mean or median of X is known in the scalar X case. By using the auxiliary information, an empirical log-likelihood ratio for is defined as: lˆAI () = −2 max n log(np̃i ), i =1 where the maximum is taken over all sets of non-negative numbers p̃1 , . . ., p̃n that sum to 1 and such that ni= 1 p̃i A(Xi ) = 0 and ni= 1 p̃i Ŷ i = . Denote i () = (AT (Xi ), Ŷ i − )T . Provided that the origin is inside the convex hull of the points 1 (), . . ., n (), the method of Lagrange multipliers leads to the representation © 2009 Board of the Foundation of the Scandinavian Journal of Statistics. 676 L. Xue lˆAI () = 2 Scand J Statist 36 n log(1 + T i ()), (7) i =1 where satisfies i () 1 = 0. n i = 1 1 + T i () n (8) We have the following result. Theorem 2 Suppose that conditions (C1)–(C8) hold, and let E{A(X )AT (X )} be a positive definite matrix. D If is the true parameter, then lˆAI () −→ χ2q + 1 . Similar to theorem 1, the result of theorem 2 can be used to construct a confidence interval ˆ in subsection 2.1 in the case of no for . It may be noted that lˆAI () is reduced to l() auxiliary information. 2.3. Normal approximation-based method We now turn to the estimation of . The practical motivation for imputation is to provide users with a completed (or imputed) data file with missing Yi replaced by m̂b (Xi ). The user then computes the estimate of from the imputed file {(Ỹ i , i ); 1 ≤ i ≤ n}, where Ỹ i is defined in (1). Note that Xi may be available only to the imputer and not reported on the data file. Using the imputed data file, Wang & Rao (2002a) proposed the following estimator for : 1 ˆWR = n n Ỹ i . (9) i =1 The estimator is similar to the estimator ˆC proposed in Cheng (1994). It is shown that ˆWR and ˆC have the same asymptotic variance (Wang & Rao, 2002a). We propose a weighted imputation estimator for , which is defined by 1 ˆWI = n n Ŷ i , (10) i =1 where the Ŷ i are defined in (3). Alternatively, a variant of ˆWI previously considered by Cheng (1990) is the sample average of all the regression estimates; that is, 1 ˜ = n n m̂(Xi ), i =1 where m̂(·) is the Nadaraya–Watson kernel estimator of m(·) based on (Xi , Yi ) for ˜ our estimator ˆWI fully employs the information i ∈ {i : i = 1}. Compared with the estimator , in the sample{(Xi , Yi , i ); 1 ≤ i ≤ n}. Remark 2. If the analyst is using the incomplete data file {(Xi , Yi , i ); 1 ≤ i ≤ n}, then imputation is not needed and in this case the objective is to give an efficient estimator of using the auxiliary variable Xi observed on all the units in the sample. Under this scenario, our estimator ˆWI is simply a difference estimator under two-phase sampling used in the survey context, where simple random sampling is used in the first phase and Poisson sampling with probabilities p̂(Xi ) in the second phase. © 2009 Board of the Foundation of the Scandinavian Journal of Statistics. Empirical likelihood for response mean Scand J Statist 36 677 ˆ We may also maximize {−l()} to obtain an estimator for the parameter , say ˆME , called the maximum empirical likelihood estimator. It can be shown that ˆME = ˆWI + oP n−1/2 , (11) that is, ˆWI and ˆME are asymptotically equal. The asymptotic normality for ˆWI and ˆME are given in theorem 3. Theorem 3 Suppose that conditions (C1)–(C8) hold. Then √ D n(ˆ − ) −→ N(0, V ), where ˆ can be taken to be ˆWI and ˆME , and V = E{ 2 (X )/p(X )} + var(m(X )) with 2 (x) = var(Y | X = x). It has been shown by Robins et al. (1995) and Hahn (1998) that V is the lower bound for the asymptotic variance of any regular estimator in a semi-parametric missing data problem. From theorem 3 and lemma A.1 in Wang & Rao (2002a), it follows that ˆWR and ˆWI defined by (9) and (10) have the same asymptotic variance, and hence theorem 3 is also valid for the estimator ˆWR . Therefore, we recommend ˆWR as point estimator because it can be computed from the imputed data file. By the ‘plug-in’ method, we can define a consistent estimator of the asymptotic variance ˆ say V̂ ; that is, V of , 1 ˆ 2, (Ŷ i − ) n i =1 n V̂ = ˆ where ˆ is taken to be ˆWR or ˆWI . The estimator V̂ is simpler than the estimator V̂ n () defined by Wang & Rao (2002a). From theorem 3, we obtain √ 1/2 D n(ˆ − )/ V̂ −→ N(0, 1). Using this result, we obtain a normal approximation-based confidence interval for , namely ˆ − z1−/2 V̂ /n, ˆ + z1−/2 V̂ /n , where z1−/2 is the 1 − /2 quantile of the standard normal distribution, and ˆ is taken to be ˆWR or ˆWI . 3. Simulations In this section, we present a simulation study to compare five methods in terms of coverage accuracies and average lengths of confidence intervals based on them. The five methods are: the weight-corrected empirical likelihood (WCEL) proposed in subsection 2.1; the adjusted empirical likelihood (AEL) suggested in Wang & Rao (2002a); the weight-corrected empirical likelihood with auxiliary information (WCELA) introduced in subsection 2.2; the adjusted empirical likelihood with auxiliary information (AELA) proposed in Wang & Rao (2002a) and the normal approximation (NA) methods based on ˆWI and ˆWR . For convenience, in what follows, NA(ˆWI ) and NA(ˆWR ) denote the corresponding normal approximation confidence intervals for ˆWI and ˆWR . The first regression model is: (12) Y = (X − 1)2 + |X |, © 2009 Board of the Foundation of the Scandinavian Journal of Statistics. 678 L. Xue Scand J Statist 36 where X follows the truncated normal distribution with truncation constant 4, in which the normal distribution has mean 1 and variance 1, and follows the normal distribution with mean 0 and variance 0.16. The kernel functions K (x) and L(x) were, respectively, taken to be 0.75(1 − x2 )I {|x| ≤ 1} and 0.5I {|x| ≤ 1}, where I {·} is the indicator function. We used the cross-validation method to select the optimal bandwidth of h. However, since such a selection involves the value of b, we have to consider the selection of b when we choose h. Looking back at b defined in m̂b (x), its function is to avoid the technical problems at the boundary of the support domain of X. Clearly, if the density function of X , f (x), is bounded away from zero, b can be selected as a small positive value. In practice, when we have a dataset, the values of the density at these data points are non-zero in many cases. Therefore, the selection of b is less important than that of h. This observation leads us to propose the following approach: specify a value of b, say b̃ = n−1/8 . The cross-validation criterion is given by 1 {Yi − m̂(−i) (Xi ; h)}2 , b̃ n i =1 n cv(h) = (Xi ; h) is the Nadaraya–Watson estimator of m(Xi ) that is computed when the ith where m̂(−i) b̃ observation Xi is deleted. A cross-validation bandwidth hcv is then obtained by minimizing cv(h) with respect to h; that is, hcv = infh > 0 cv(h). The optimal bandwidth of a, say acv , can also be selected by using the cross-validation criterion. Mn was taken to be 2 ln n. It is easily shown that hcv , acv , b̃ and Mn selected by this approach satisfy conditions (C3), (C4), (C7) and (C8). Therefore, we use the bandwidths hcv and acv to compute the WCEL and WCELA ratios, and use the bandwidth hcv n−2/15 to compute the AEL and AELA ratios, because the AEL and AELA methods require undersmoothing the regression estimate. We generated 5000 Monte Carlo random samples of size n = 30, 60 and 100 based on the following three selection probability functions proposed by Wang & Rao (2002a). Case 1. p1 (x) = 0.8 + 0.2|x − 1| if |x − 1| ≤ 1, and 0.95 elsewhere. Case 2. p2 (x) = 0.9 − 0.2|x − 1| if |x − 1| ≤ 4.5, and 0.1 elsewhere. Case 3. p3 (x) = 0.6 for all x. The auxiliary information E(X ) = 1 was used when we calculated the empirical coverage and average lengths of confidence intervals for WCELA and AELA. To assess whether or not coverage errors are symmetric between the two tails, we report also the percentage PL of intervals in which the lower limit is greater than the true value of and the percentage PR of intervals in which the higher limit is smaller than the true value of . The empirical coverage in percentage, (PL , PR ), and average lengths of the confidence intervals, with a nominal level 1 − = 0.95, were computed with 5000 simulation runs. The simulation results are reported in Table 1. From Table 1, we have the following observations. (1) In the case where no auxiliary information is available, WCEL performs better than AEL because the associated confidence intervals have uniformly shorter average lengths and higher coverage accuracies. In addition, WCEL and AEL have slightly longer interval lengths, but higher coverage probabilities, than NA(ˆWI ) and NA(ˆWR ). The √ size of the Monte Carlo errors is 0.95 × 0.05/5000 ≈ 0.00308 for = 0.05. The coverage probabilities for the empirical likelihood are close to the confidence levels claimed when the sample size is 100. (2) For the case when auxiliary information is available, we observe that the empirical coverage levels for confidence intervals based on WCELA are uniformly higher than those based on AELA, and the average lengths of the confidence intervals based on © 2009 Board of the Foundation of the Scandinavian Journal of Statistics. Empirical likelihood for response mean Scand J Statist 36 679 Table 1. For model (12), the empirical coverage (EC) in percentage, indicators of symmetry (PL , PR ) and average lengths (AL) for confidence intervals for under different selection probability functions p(x) and different sample sizes n when the nominal level is 0.95 WCEL AEL WCELA AELA NA(ˆWI ) NA(ˆWR ) 30 EC (PL , PR ) AL 60 EC (PL , PR ) AL 100 EC (PL , PR ) AL 91.52 (2.58, 5.90) 0.9865 93.10 (2.04, 4.86) 0.7107 94.12 (2.06, 3.82) 0.5542 91.42 (2.64, 5.94) 0.9871 93.04 (2.08, 4.88) 0.7110 94.08 (2.10, 3.82) 0.5543 90.56 (1.86, 7.58) 1.0575 93.18 (1.80, 5.02) 0.8003 95.48 (1.46, 3.06) 0.6495 90.48 (1.80, 7.72) 1.0585 93.14 (1.80, 5.06) 0.8010 95.46 (1.54, 3.00) 0.6502 90.20 (1.06, 8.74) 0.9774 91.44 (1.02, 7.54) 0.7103 93.12 (1.06, 5.82) 0.5540 90.16 (1.08, 8.76) 0.9775 91.38 (1.04, 7.58) 0.7103 93.10 (1.06, 5.84) 0.5440 p2 (x) 30 EC (PL , PR ) AL 60 EC (PL , PR ) AL 100 EC (PL , PR ) AL 90.92 (2.06, 7.02) 0.9881 92.62 (1.72, 5.66) 0.7226 93.66 (1.78, 4.56) 0.5628 90.72 (1.76, 7.52) 0.9883 92.56 (1.58, 5.86) 0.7227 93.58 (1.48, 4.94) 0.5629 89.16 (1.36, 9.48) 1.0641 92.52 (1.58, 5.9) 0.8108 94.68 (1.38, 3.94) 0.6535 89.06 (1.38, 9.56) 0.0650 92.50 (1.46, 6.04) 0.8112 94.60 (1.18, 4.22) 0.6539 88.98 (0.82, 10.2) 0.9796 91.12 (0.94, 7.94) 0.7148 92.62 (0.90, 6.48) 0.5583 88.52 (0.92, 10.56) 0.9797 90.72 (0.92, 8.36) 0.7148 91.68 (0.80, 7.52) 0.5584 p3 (x) 30 EC (PL , PR ) AL 60 EC (PL , PR ) AL 100 EC (PL , PR ) AL 90.50 (2.38, 7.12) 1.0164 92.52 (2.12, 5.36) 0.7390 93.58 (1.94, 4.48) 0.5751 90.20 (2.54, 7.26) 1.0169 92.50 (2.04, 5.46) 0.7393 93.52 (1.80, 4.68) 0.5755 89.12 (1.92, 8.96) 1.0944 92.48 (1.74, 5.78) 0.8301 94.58 (1.46, 3.96) 0.6673 88.98 (1.98, 9.04) 1.0953 92.46 (1.86, 5.68) 0.8308 94.56 (1.50, 3.94) 0.6675 88.88 (0.94, 10.18) 0.9806 90.92 (1.04, 8.04) 0.7171 92.32 (0.82, 6.86) 0.5603 88.64 (1.30, 10.06) 0.9807 90.14 (1.24, 8.62) 0.7172 91.82 (1.12, 7.06) 0.5603 p(x) n p1 (x) Feature WCEL, weight-corrected empirical likelihood; AEL, adjusted empirical likelihood; WCELA, WCEL with auxiliary information; AELA, AEL with auxiliary information. WCELA are uniformly shorter than those based on AELA. Also, when n = 100, WCELA obviously outperforms WCEL, which does not use the auxiliary information, and hence also all NA in term of coverage accuracies of confidence intervals. When n = 30, 60, WCELA performed poorly, because its ratio has higher dimension than the WCEL ratio. (3) The empirical likelihood confidence intervals have more balanced tail error rates than the normal approximation confidence intervals as shown by the values of (PL , PR ) for all the cases considered. The normal approximation-based confidence intervals produced larger differences between PL and PR . (4) All the empirical coverage accuracies increase and the average lengths decrease as n increases. Also, the coverage accuracies and average lengths depend on the selection probability function p(x). In case 1, all the methods generally perform better than in the other two cases, because the missing rate of case 1 is lower than those of cases 2 and 3, where the average missing rates corresponding to the preceding three cases are approximately 0.09, 0.26 and 0.40, respectively. Generally, the empirical coverage accuracies decrease and average lengths increase as the missing rate increases for every fixed sample size. These findings basically agree with those that were discovered by Wang & Rao (2002a). Our simulation results for the case of normal X agree with those for the truncated normal X case. Our regularity conditions are satisfied for the latter case, and it is interesting that the © 2009 Board of the Foundation of the Scandinavian Journal of Statistics. 680 L. Xue Scand J Statist 36 results remain valid for the normal X case although it is not easy for us to prove that the normal distribution satisfies conditions (C3) and (C4) for the polynomial model considered before. The choice of the trimming constants b seems not so sensitive with regard to coverage accuracy and interval length, although it is important. When b was taken to be n−1/7 and n−1/9 , results similar to Table 1 were obtained. We now consider the regression model with a two-dimensional covariate (X1 , X2 ), that is, Y = 5 exp(−0.3X1 − 1.2X2 ) + , (13) where ∼ N(0, 0.32 ), X1 and X2 are independent standard exponential variables with mean 1 and variance 1. The selection probability is taken as: P( = 1 | X ) = exp(1 + 0.7X1 + 3X2 ) . 1 + exp(1 + 0.7X1 + 3X2 ) (14) The kernel functions were taken to be the product kernel K (x1 , x2 ) = K0 (x1 )K0 (x2 ) and L(x1 , x2 ) = L0 (x1 )L0 (x2 ), where K0 (x) = (15/16)(1 − x2 )2 I {|x| ≤ 1} and L0 (x) = 0.5I {|x| ≤ 1}. The optimal bandwidths hopt and aopt were selected by using the cross-validation method. We also used auxiliary information E(Xi ) = 1 when we calculated the empirical coverage and average lengths of the confidence intervals for WCELA and AELA. We generated 5000 Monte Carlo random samples of size n = 30, 50, 100 and 200. The empirical coverage in percentage, (PL , PR ) and average lengths of the confidence intervals, with a nominal level 1 − = 0.95, were computed with 5000 simulation runs. The simulation results are reported in Table 2. From Table 2, we obtain the following conclusions. When no auxiliary information is available, WCEL performs slightly better than AEL. Also, WCEL and AEL obviously outperform all NA in term of coverage accuracies of confidence intervals, and WCEL and AEL provide more balanced tail error rates than NA, but they have slightly longer interval lengths. When auxiliary information is available, the average lengths of the confidence intervals based on WCELA are shorter than those based on AELA, and the empirical coverage levels based on WCELA are slightly higher than those based on AELA. Also, when n = 100 and 200, WCELA and AELA perform better than the other four methods because the associated confidence intervals have uniformly shorter average lengths and higher coverage accuracies. Table 2. For model (13) and selection probability function (14), the empirical coverage (EC) in percentage, indicators of symmetry (PL , PR ) and average lengths (AL) for confidence intervals for under different sample sizes n when the nominal level is 0.95 Feature WCEL AEL WCELA AELA NA(ˆWI ) NA(ˆWR ) 30 EC (PL , PR ) AL 93.70 (2.24, 4.06) 0.8643 93.58 (2.18, 4.24) 0.8649 90.62 (1.4, 7.98) 0.6190 90.18 (1.32, 8.5) 0.6197 91.94 (2.2, 5.86) 0.8787 91.94 (2.16, 5.9) 0.8788 50 EC (PL , PR ) AL 94.40 (2.02, 3.58) 0.6907 94.38 (2.02, 3.6) 0.6910 93.40 (1.28, 5.32) 0.5076 93.38 (1.22, 5.4) 0.5090 93.54 (1.94, 4.52) 0.6904 93.52 (1.9, 4.58) 0.6904 100 EC (PL , PR ) AL 94.54 (2.12, 3.34) 0.4933 94.52 (2.1, 3.38) 0.4935 95.34 (1.04, 3.62) 0.3765 95.32 (1.0, 3.68) 0.3774 94.18 (2.02, 3.8) 0.4925 94.14 (2.02, 3.84) 0.4926 200 EC (PL , PR ) AL 94.78 (1.92, 3.3) 0.3501 94.68 (1.92, 3.4) 0.3502 96.04 (0.96, 3.0) 0.2694 96.02 (0.9, 3.08) 0.2700 94.52 (1.72, 3.76) 0.3499 94.50 (1.72, 3.78) 0.3499 n WCEL, weight-corrected empirical likelihood; AEL, adjusted empirical likelihood; WCELA, WCEL with auxiliary information; AELA, AEL with auxiliary information. © 2009 Board of the Foundation of the Scandinavian Journal of Statistics. Scand J Statist 36 Empirical likelihood for response mean 681 In addition, all the empirical coverage accuracies increase and the average lengths decrease as n increases. 4. Concluding remarks In this paper, we have proposed a bias-correction technique for constructing an empirical likelihood ratio when the response might be missing at random. A bias-corrected empirical likelihood approach to inference for the mean of the response variable was developed. A non-parametric version of Wilks’ theorem was proved for the weight-corrected empirical log-likelihood ratio by showing that it has an asymptotic chi-square distribution. Also, with auxiliary information, a weight-corrected empirical log-likelihood ratio was derived, and it was shown that the ratio is asymptotically chi-squared. In addition, a normal approximationbased method was considered. The advantage of the empirical likelihood method was indicated with a simulation study. Our method for the response mean is distinguished from those of Wang & Rao (2002a) and Qin & Zhang (2007), which focus on constructing weightcorrected empirical likelihood ratios. Also, the bias-correction technique proposed in this paper might be used to study a class of semi-parametric regression models. The methodology that is presented here can also be generalized to estimate other marginal parameters or functions, such as var(Y ), the cumulative distribution function F (y) of Y and the quantiles of F (y). These relevant problems obviously merit further study. Acknowledgements The author thanks the editor, the associate editor and two referees for their thoughtful and constructive comments and suggestions. The research was supported by the National Natural Science Foundation of China (10871013), the Beijing Natural Science Foundation (1072004) and the PhD Program Foundation of Ministry of Education of China (20070005003). References Chen, J., Fan, J., Li, K. H. & Zhou, H. (2006). Local quasi-likelihood estimation with data missing at random. Statist. Sinica 16, 1044–1070. Chen, S. X. & Hall, P. (1993). Smoothed empirical likelihood confidence intervals for quantiles. Ann. Statist. 21, 1166–1181. Chen, J. & Sitter, R. R. (1999). A pseudo-empirical likelihood approach to the effective use of auxiliary information in complex surveys. Statist. Sinica 9, 385–406. Cheng, P. E. (1990). Applications of kernel regression estimation: a survey. Commun. Statist. A 19, 4103– 4134. Cheng, P. E. (1994). Nonparametric estimation of mean functionals with data missing at random. J. Amer. Statist. Assoc. 89, 81–87. Hahn, J. (1998). On the role of the propensity score in efficient semiparametric estimation of average treatment effects. Econometrica 66, 315–331. Hall, P. & La Scala, B. (1990). Methodology and algorithms of empirical likelihood. Int. Statist. Rev. 58, 109–127. Hall, P. & Murison, R. D. (1993). Correcting the negativity of high-order kernel density estimators. J. Multivariate Anal. 47, 103–122. Hjort, N. L., McKeague, I. W. & Van Keilegom, I. (2009). Extending the scope of empirical likelihood. Ann. Statist. 37, 1039–1111. Kitamura, Y. (1997). Empirical likelihood methods with weakly dependent processes. Ann. Statist. 25, 2084–2102. Neyman, J. (1938). Contribution to the theory of sampling human population. J. Amer. Statist. Assoc. 33, 101–106. Owen, A. B. (1988). Empirical likelihood ratio confidence intervals for a single function. Biometrika 75, 237–249. © 2009 Board of the Foundation of the Scandinavian Journal of Statistics. 682 L. Xue Scand J Statist 36 Owen, A. B. (1990). Empirical likelihood ratio confidence regions. Ann. Statist. 18, 90–120. Peng, L. (2004). Empirical-likelihood-based confidence interval for the mean with a heavy-tailed distribution. Ann. Statist. 32, 1192–1214. Qin, J. & Zhang, B (2007). Empirical-likelihood-based inference in missing response problems and its application in observational studies. J. R. Statist. Soc. Ser. B Stat. Methodol. 69, 101–122. Robins, J. M., Rotnitzky, A. & Zhao, L. P. (1995). Analysis of semiparametric regression models for repeated outcomes in the presence of missing data. J. Amer. Statist. Assoc. 90, 106–121. Rubin, D. B. (1976). Inference and missing data. Biometrika 63, 581–592. Spiegelman, C. & Sacks, J. (1980). Consistent window estimation in nonparametric regression. Ann. Statist. 5, 595–620. Stute, W., Xue, L. G. & Zhu, L. X. (2007). Empirical likelihood inference in nonlinear error in covariables models with validation data. J. Amer. Statist. Assoc. 102, 332–346. Wang, Q. H., Linton, O. & Härdle, W. (2004). Semiparametric regression analysis with missing response at random. J. Amer. Statist. Assoc. 99, 334–345. Wang, Q. H. & Rao, J. N. K. (2002a). Empirical likelihood-based inference under imputation for missing response data. Ann. Statist. 30, 896–924. Wang, Q. H. & Rao, J. N. K. (2002b). Empirical likelihood-based inference in linear models with missing data. Scand. J. Statist. 29, 563–576. Xue, L. G. & Zhu, L. X. (2006). Empirical likelihood for single-index models. J. Multivariate Anal.. 97, 1295–1312. Xue, L. G. & Zhu, L. X. (2007a). Empirical likelihood for a varying coefficient model with longitudinal data. J. Amer. Statist. Assoc. 102, 642–654. Xue, L. G. & Zhu, L. X. (2007b). Empirical likelihood semiparametric regression analysis for longitudinal data. Biometrika 94, 921-937. Zhu, L. X. & Fang, K. T. (1996). Asymptotics for kernel estimate of sliced inverse regression. Ann. Statist. 34, 1053–1069. Zhu, L. X. & Xue, L. G. (2006). Empirical likelihood confidence regions in a partially linear single-index model. J. R. Statist. Soc. B 68, 549–570. Received April 2007, in final form March 2009 Liugen Xue, College of Applied Sciences, Beijing University of Technology, Beijing 100124, P.R. China. E-mail: lgxue@bjut.edu.cn Appendix In this Appendix, we provide proofs of theorems 1–3. The following lemmas are useful for proving the theorems. Lemma 1 Suppose that conditions (C1)–(C3) and (C5) hold. We then have, uniformly over 1 ≤ i ≤ n, E{m̂b (Xi ) − m(Xi )}2 = O((nhd b2 )−1 ) + O(h2r b−2 ) + o(n−1/2 ), where m̂b (·) is defined in (2). Proof. Denote gb (x) = max{b, g(x)} and mb (x) = m(x)g(x)/gb (x). We have, for all 1 ≤ i ≤ n, E{mb (Xi ) − m(Xi )}2 ≤ cE[|m(X )|I {g(X ) < b}] = o(n−1/2 ), where c is a positive constant. Therefore, to prove lemma 1, we only need to show that, for all 1 ≤ i ≤ n, E{m̂b (Xi ) − mb (Xi )}2 = O((nhd b2 )−1 ) + O(h2r b−2 ) + o(n−1/2 ). © 2009 Board of the Foundation of the Scandinavian Journal of Statistics. (A1) Empirical likelihood for response mean Scand J Statist 36 683 Let ĝ(x) = (nhd )−1 n j Kh (Xj − x), ĝ b (x) = max{b, ĝ(x)}, j =1 n (x) = (nhd )−1 n j {Yj − m(Xj )}Kh (Xj − x), j =1 n (x) = (nhd )−1 n j {m(Xj ) − m(x)}Kh (Xj − x), j =1 Qn (x) = m(x){ĝ(x)gb (x) − g(x)ĝ b (x)}/{gb (x)ĝ b (x)}, and denote Tn (x) = m̂b (x) − mb (x). By direct calculation, it can be verified that Tn (x) = {n (x) + n (x)}/ ĝ b (x) + Qn (x). Consequently, for all 1 ≤ i ≤ n, we have E{Tn2 (Xi )} ≤ 3b−2 E{2n (Xi )} + 3b−2 E{2n (Xi )} + 3E{Qn2 (Xi )}. (A2) It can be, that, for all 1 ≤ i ≤ n, E{2n (Xi )} = O((nhd )−1 ), (A3) E{2n (Xi )} = O((nhd−2 )−1 ) + O(h2r ), (A4) E{Qn2 (Xi )} = O((nhd b2 )−1 ) + O(h2r b−2 ) + o(n−1/2 ). (A5) This together with (A2)–(A5) proves (A1), and hence lemma 1 is proved. Lemma 2 Suppose that conditions (C1), (C4) and (C6), except for m(x) in (C1), hold. Then we have, uniformly over 1 ≤ i ≤ n, E{p̂(Xi ) − p(Xi )}2 = O((nad )−1 Mnd ) + O(a2r ) + o(n−1/2 ), where p̂(·) is defined in (4). Proof. Following the lines of Spiegelman & Sacks (1980), we can prove that, uniformly over 1 ≤ i ≤ n, E{1/Cn (Xi )} = O((nad )−1 Mnd ) + o(n−1/2 ), where Cn (Xi ) = max 1, c1 I {Xk − Xi ≤ a} . k =i / Denote n Wnj (x) = La (Xj − x)/ max 1, La (Xk − x) . k =1 © 2009 Board of the Foundation of the Scandinavian Journal of Statistics. (A6) 684 L. Xue Scand J Statist 36 By direct calculation, we obtain n n 2 2 Wnj (Xi ){j − p(Xj )} + 3E E{p̂(Xi ) − p(Xi )}2 ≤ 3E Wnj (Xi ){p(Xj ) − p(Xi )} j =1 + 3E n 2 j =1 Wnj (Xi ) − 1 p(Xi ) j =1 ≡ J1 + J2 + J3 . (A7) By othogonality and (A6), we can derive that J1 = O((nad )−1 Mnd ) + o(n−1/2 ), (A8) J2 = O((nad )−1 Mnd ) + o(n−1/2 ) + O(a2r ), (A9) J3 = O((nad )−1 Mnd ) + o(n−1/2 ). (A10) Substituting (A8)–(A10) into (A7), the proof of lemma 2 is completed. Lemma 3 Suppose that conditions (C1)–(C8) hold. If is the true parameter, then 1 D √ (Ŷ i − ) −→ N(0, V ), n i =1 n where V is defined in theorem 3. Proof. We also use the notations of lemmas 1 and 2. It is straightforward to obtain 1 √ (Ŷ i − ) = T1 + T2 + T3 , n i =1 n where n 1 i {Yi − m(Xi )} + {m(Xi ) − } , T1 = √ n i = 1 p(Xi ) n 1 1 1 = T2 √ − i {Yi − m(Xi )}, n i = 1 p̂(Xi ) p(Xi ) n 1 i T3 = √ {m̂b (Xi ) − m(Xi )}. 1− p̂(Xi ) n i =1 √ Since nT1 is a sum of i.i.d. random variables, by the central limit theorem we get that D T1 −→ N(0, V ). We can prove that T = oP (1), = 2, 3, from which lemma 3 follows. Lemma 4 Suppose that conditions (C1)–(C8) hold. If is the true parameter, then 1 P (Ŷ i − )2 −→ V, n i =1 n where V is defined in theorem 3. © 2009 Board of the Foundation of the Scandinavian Journal of Statistics. Empirical likelihood for response mean Scand J Statist 36 685 Lemma 5 Suppose that conditions (C1)–(C8) hold. Then max |Ŷ i | = oP (n1/2 ) and = OP (n−1/2 ). 1≤i≤n By lemmas 1 and 2, and using some arguments similar to those used in the proof of lemma 3, we can prove lemma 4. Similar to the proof of lemmas A.3 and A.4 in Wang & Rao (2002a), we also can prove lemma 5. Here, their proofs are omitted. Proof of theorem 1. Using (5), (6) and lemmas 3–5, and similar to the proof of theorem 1 in Wang & Rao (2002a), we can obtain that 2 n −1 n 1 2 ˆl() = √1 + oP (1). (Ŷ i − ) (Ŷ i − ) n i =1 n i =1 This together with lemmas 3 and 4 proves theorem 1. Proof of theorem 2. By (7) and (8), and similar to the proof of theorem 1, it can be shown that T n n 1 −1 ˆl n, AI () = √1 () Vn, AI √ () + oP (1), (A11) n i =1 i n i =1 i where Vn, AI = Vn2 , Vn Vn1 Vn2 1 (Ŷ i − )2 , n i =1 n Vn = 1 A(Xi )AT (Xi ), n i =1 n Vn1 = 1 A(Xi )(Ŷ i − ). n i =1 n Vn2 = Similar to the proof of theorem 3.2 in Wang & Rao (2002a), it can be shown that 1 D √ () −→ N(0, VAI ), n i =1 i n where VAI = V1 V2 V2 V (A12) with V1 = E{A(X )AT (X )}, V2 = E{A(X )(m(X ) − )} and V is defined in theorem 3. Similar to the proof of (A.55) in Wang & Rao (2002a), we can prove that P Vn, AI −→ VAI . (A13) Therefore, theorem 2 follows immediately from (A11) to (A13). Proof of theorem 3. From (10) and (11), we get that √ 1 n(ˆ − ) = √ (Ŷ i − ) + oP (1). n i =1 n This together with lemma 3 proves theorem 3. © 2009 Board of the Foundation of the Scandinavian Journal of Statistics.