Download 2004MinnP6.1

THE SUITABILITY OF ALTERNATIVE SURVEY-DERIVED SMALL AREA STATISTICS IN INTERNATIONAL COMPARISONS: FINDINGS FROM THE EURAREA PROJECT Patrick Heady and Martin Ralphs Spatial Analysis and Modelling Branch (Research) Methodology Group, Office for National Statistics 1 Drummond Gate, London SW1V 2QQ. Abstract In the use of survey-derived statistics for small areas (using model-assisted and modelbased methods) the predominant focus has been on alternative methods’ ability to provide estimates for individual small areas. The quality of the estimates for a set of areas has generally been thought of in terms of the average value of such area-specific measures as bias or mean square error. Recently, however, there has been growing interest in estimators which conserve key properties of the overall set of area characteristics. Perhaps the most important characteristic of this kind is how well the estimators represent the true extent of geographical inequality. In general “design-based” estimators can be expected to overstate the extent of geographical inequality, while model-based estimators are likely to understate it. A consequence of this is that comparisons of spatial inequality between different states may be powerfully affected by differences in the small area estimation methods they use. The Eurarea project, carried out by NSIs and academics, used simulation methods, based on simulated samples from 100-percent census and register data-bases for six European countries, to measure the practical performance of a range of small area estimators. These included direct estimators, GREGs, regression synthetic estimators, composite estimators, and estimators that borrowed strength from patterns of temporal and spatial autocorrelation. In this paper, we use some of these results to illustrate the performance of different types of estimators for estimating area-specific totals and explore the extent of the distortion to spatial distributions that can occur from using small area estimation methods that are optimal in other respects, relating this particularly to the issue of international comparisons of spatial inequality. 1 INTRODUCTION The provision of high quality small area statistics is a growing priority for European governments, primarily so that resource allocation can be optimally directed to tackle problems such as poor housing and health, unemployment and low pay. A problem arises because the availability of small area statistics for key variables of interest is limited by the cost of data collection. It is prohibitively expensive to carry out detailed local surveys with comprehensive coverage for very small areas, and while sample surveys are carried out to collect information about key themes of local interest, these are designed primarily for efficient estimation at national level. Small area estimation techniques (henceforth abbreviated to SAE) can help to overcome these problems by using local sample information, usually coupled with ancillary data from secondary sources such as population censuses or administrative data, to provide area-specific estimates with higher precision than a direct estimate based solely on the sample. The Eurarea project (see Heady and Hennell, 2001, for a full description) has investigated the performance of standard and innovative methods for SAE in the European context, with the objective of providing advice to Eurostat and to European NSIs on the appropriate use of SAE methods in the context of official statistics. The full range of results obtained, including methodological findings, specimen programs and recommendations regarding statistical policy will be published in the Eurarea project reference volume and made available on the project website later this year1. The most significant experimental capability developed within Eurarea has been the facility to evaluate estimator performance through large-scale simulations based on population register and census datasets in six different European countries. Repeated samples are drawn in realistic ways from these bases, estimation strategies applied to them and the results compared with the true values contained within the databases across many replications. This has enabled us to assess empirically the predictive power of the estimators and the reliability of their in-built error prediction procedures. In this paper, we introduce small area estimation approaches and use some of the empirical results produced by the Eurarea Consortium to illustrate the performance of several commonly applied SAE methods in respect of two key user requirements – the provision of reliable area-specific estimates and, of particular concern to policy makers responsible for resource allocation, the modelling of the distribution of area values. We move on to consider the implications of our findings in the light of these two objectives. SMALL AREA ESTIMATION METHODS Since the most pressing issues facing European ISIs at the moment are whether to use model-based approaches at all, and how to make the best use of the data provided by the different national statistical systems, it was appropriate to concentrate on relatively straightforward methods in Eurarea. The SAE methods that we present here are therefore 1 http://www.statistics.gov.uk/eurarea, which will be available for public access in summer 2004. 2 not comprehensive. Rao (2003) and Pfefferman (2002) present comprehensive reviews of the methods that are available and the interested reader is directed to these references for more information. In Eurarea, our primary interest was in comparing the effectiveness of estimators drawn from both the design-based and model-based families (Särndal 1984) and in considering how performance was affected by a range of external factors such as sampling methods, the sizes of the small areas for which estimates were required and the treatment of both binary and continuous target variables. For the purposes of this paper, we introduce four of the basic estimators from Eurarea. Our main objective is to demonstrate how the predictive power of our estimation can be enhanced through the deployment of different approaches, and to illustrate the limitations that apply when we use particular SAE strategies based on design and model-based approaches or combinations of the two. The types of estimate produced by design and model-based estimation procedures differ fundamentally. In the case of a design-based estimator, the estimate produced is unique to each individual small area under consideration. The estimate is unbiased for that area, in the sense that, under repeated sampling the mean of successive estimates will tend towards the true value. For model-based estimators, the situation is somewhat different. A model-based estimator utilises ancillary information to produce an estimate of the target variable that is applicable to all small areas that share similar characteristics. Thus, if two small areas have exactly the same ancillary information, exactly the same estimate will be produced for each by the model-based procedure. Unlike the situation with design-based estimators, the discrepancies between the true value for a particular area and the model-estimates generated from successive samples will not tend to average out over the long run. The estimators We considered the performance of four basic estimator types, which we define and discuss below. We use the following standard notation in all our equations: 1. 2. 3. 4. 5. 6. 7. Y denotes the survey variable of interest; X denotes ancillary data; Lowercase letters refer to sample statistics and uppercase to population statistics; Indices i and d refer to individuals and small areas (domains), respectively; w refers to the sample inclusion probabilities of individuals ( w id is the weight for individual i in area d); n is the sample size and N is the population size; s refers to the sample; A bar above a variable refers to the mean – e.g. y d is the sample mean of y for area d. A hat above a variable refers to an estimate – e.g., Yˆ is an estimate of Y . 8. d d 9. u and e refer to area- and unit-level random effects. 10. In order to simplify the notation we sometimes use  d to refer to an area-specific quantity. Thus we might write  for Y , and ˆ for Yˆ . d d d d 3 11. I d   is an indicator variable, taking the value 1 if the condition inside the bracket applies to area d, and 0 if it does not. 1. Direct Estimator The direct estimator is defined as the -weighted Horvitz-Thompson estimator for each area (Särndal et al., 1992), where  is the probability of inclusion in the sample. This design-based estimator is the local average value of the target variable for sampled units in each area and is given by the formula: 1 DIRECT Yˆd  wid  wid yid , where N̂ d  i ˆ N d is d s d In practice, the direct estimator is highly vulnerable to sample size and coverage, and can only be computed for areas which are sampled. We include it here to provide a benchmark against which we can compare the performance of more sophisticated modelbased and composite approaches. 2. Generalised Regression Estimator (GREG): The design-based GREG (Generalised REGression estimator) is obtained by adjusting the direct estimator for an area for differences between the sample and population area means of covariates. The adjustments are calculated by using a model relating y and X. As a standard, the ordinary regression model is used and this has been applied in Eurarea. The formula for the GREG estimator is:  1 1 YˆdGREG  wid yid   Xd   Nˆ d is d Nˆ d  of p population mean covariates. T  wid xid  βˆ where Xd  ( X d ,1 ,..., X d , p )T is a vector  is d  3. Area-level Synthetic Estimator: Synthetic estimators assume a model that describes the relationship between the target variable y and set of ancillary data X. Through this modelled relationship, the ancillary data can be used to predict the mean of y for all target areas. The estimator and its variance are developed on the assumption that the model used accurately describes the population. If the model can draw sufficient power from the available ancillary data, the method can provide substantial gains over direct estimates which rely solely upon survey data. In practice, area models have been used more extensively than other synthetic estimators for SAE work. In this example, a linear model with area-level covariates is fitted to the sample area means of the target variable. The model is y.d  XTd βˆ  ud  eid , and the estimator is 4 YˆdSYNTH  XTd βˆ , where u d and eid are independent variables with mean 0 and variances  u2 and  e2 . The variance term  u2 is the main component the MSE. 4. Composite Estimator: Composite estimators attempt to improve performance by combining the strengths of synthetic and design-based estimators. The example given here is a weighted combination of the area-synthetic and direct estimators. The estimator is given by the following formula: ˆ u2 YˆdCOMP   d YˆdDIRECT  (1   d ) X.Td βˆ SYNTHETIC where  d  and and nd is the sample size ˆ e2 2 ˆ u  nd of area d. The gamma term is a weight based on the modelled variability of the areas and the sampling variability of the data collected in each area. It is used to adjust the contribution of the direct and synthetic components of the estimator. When sampling variability is high (and the reliability of the direct estimator is therefore questionable), the composite estimator is weighted in favour of its synthetic component. When the variance of the direct estimator is low, the estimator is weighted in favour of its direct component. In situations where an area is unsampled, the estimate will be based wholly on the synthetic component of the estimator. ESTIMATING AREA-SPECIFIC VALUES The primary goal of SAE is the precise estimation of area-specific parameters to produce optimal local estimates. We now consider how well each of the above estimators performs when predicting individual area values. For this we need some criterion of what counts as a good predictor. The criterion we adopt here is the minimisation of squared area loss for each area. A good estimate for area d is thus one which minimises   2 MSE d   E  ˆd  d    The results we present are summaries derived from Eurarea simulation studies of estimator performance for three target variables: equivalised household income, proportion of single person households and the ILO-definition unemployment rate. The simulation process consisted of drawing samples, using approximations to the sampling designs that would be used in practice, applying the various estimation procedures, and comparing their estimates to the true values for all areas in the study data-set. The process was repeated a large number of times (typically 500). This enabled us to produce an empirical summary of the MSE properties of each estimator, in the form of the average empirical mean squared error, which we define thus: 5    1 D K ˆ (k ) AEMSE Yˆd   Yd  Yd DK d 1 k 1  2 where Yˆd( k ) is the estimate of the target variable for area d in simulation k, Yd is the true mean of Y for area d and K is the number of replicates in the simulation. The smaller this quantity is, the better the estimator has performed over the whole set of areas. In Figure 1 we summarise this information for six countries in European NUTS5 areas. We do not show the MSE results themselves in the graph; instead we show the mean value of the rank achieved by each estimator across all of the simulation runs, since this allows us to compare results from different countries concisely. The graph shows that model-based approaches (the synthetic and composite estimators) consistently exhibit improved MSE performance over the design-based estimators (the Direct and GREG) for these small geographical areas. Estimator Performance by Mean Rank - NUTS5 4.50 4.00 3.50 Mean Rank 3.00 Direct 2.50 GREG 2.00 Area Synthetic Composite 1.50 1.00 0.50 0.00 Income Household Unemployment Target Variable Figure 1 – Estimator performance by mean rank based on average MSE across simulation runs at NUTS5 level. Lower ranks indicate better performance. Estimator performance is influenced by a range of contributory factors. We will now consider some of the most significant ones and then discuss their impact upon the results that we achieve. We have stated that design-based estimators are particularly vulnerable to sample size and consistency and that these estimates can only be produced for areas that contain a sample. This is a problem in small area estimation, since many national surveys feature clustered sampling designs which provide data for only a subset of the areas of interest. 6 To produce stable estimates with an acceptable level of variability, these estimators require large samples. Once again, this is problematic, since sample sizes will typically be very small, particularly at geographical levels below NUTS3. Model-based estimators are much less vulnerable to sample size than design-based estimators. Instead, their predictive power is reliant on the use of ancillary information in the form of a set of X variables (covariates) to which the sampled values of the target variable are related. In this case, it is the choice of an appropriate set of ancillary data that is of critical importance – such data must be available for all of the small areas for which estimates are required and, ideally, will be strongly related to the target variable for maximum effectiveness. The main limitation of the model-based approach is estimator bias. Because synthetic estimators apply a globally fitted model consistently to all target areas, they tend to underestimate extreme values, “shrinking” these towards the global mean. Additionally, they may systematically underestimate or overestimate values for particular subtypes of area if the causes of variation in the target variable particular to these subtypes are not captured by the ancillary information in the model. ESTIMATING THE DISTRIBUTION OF AREA VALUES While the performance of estimators for particular areas is a relevant criterion when the estimates will be used to decide on resource allocation to particular areas, there are other policy applications for which it is more important that the set of estimates produced by SAE reflect the overall distribution of area values over the different areas in the country. This is important if the government wishes to assess the overall extent of geographic inequality for the variable concerned, or if the applications for funding by some higherlevel institution (such as the European Community) depend on the number of areas in a country which fall below some specified threshold. From this point of view, a reasonably good set of estimates might be one for which the empirical standard deviation of the true area values was close to the empirical standard deviation of the estimated area values – i.e. one for which  2 1 D 1 D ˆ ˆ d    d     D  D d 1 d 1  2 (1) where {  d } is the set of true area values; { ˆd } is a set of estimated area values generated by applying the estimator concerned on any one occasion; 7  is the mean of the true values of the D areas, and ˆ is the corresponding mean of the estimated values. Of course the standard deviation does not fully specify the distribution, since two distributions with the same standard deviation could still have different shapes, in the sense of being differently skewed, or having a different degree of kurtosis. Ideally the empirical distribution functions of the two sets of estimates should resemble each other – i.e. the value of   0 F (s)  F *(s) ds should be as small as possible, where 1 F ( s)   I d  d  s  and D d F *( s)  (2)  1 I d ˆd  s  D d  (3) If the value of the integral actual was zero, i.e. if F  s  and F ( s ) were equal over their whole range, it would mean that any functions calculated on the whole set of estimates – such as the proportion of areas for which  was below some critical value, or a measure of inequality such as the Gini Coefficient – would also apply to the set of true values. However, since the expressions given in (2) and (3) are rather intractable, we will base the analysis that follows on the comparison of standard deviations. Since the equivalence of the true and estimated standard deviations is a necessary condition for the equivalence of the two distributions, any difficulties revealed by the comparison of standard deviations will apply even more strongly to comparisons of the estimated and true distributions. Rather than being based on a single application of each estimation method, the comparisons are derived from the average over many simulations. I.e.  2 1 K  1 D ˆ( k ) ˆ 1 D is compared with       d    d  K k 1  D d 1 D d 1   2  where k indexes the simulation runs and K, as before, is usually 500. 8 Standard Deviation of True and Estim ated Area Means - UK Incom e NUTS 3 30.000 25.000 20.000 15.000 10.000 5.000 0.000 Area-level Synthetic Composite (Area Synthetic / Direct) True Population Mean GREG Direct Estim ator Type Figure 2 – Comparing the true standard deviation of area means with that produced by different estimation strategies for income at NUTS3 in Northwest England and North Wales. In Figure 2, we compare the true standard deviation of area means for NUTS3 areas in Northwest England and North Wales with the standard deviations of estimates of these means produced using Direct, GREG, area synthetic and composite methods described above. The figure shows that the direct estimator tends to overestimate extremes in the distribution, and as a result the standard deviation of area values is over-inflated. The area level synthetic estimator has the opposite effect, and tends to “shrink” the estimates towards the centre of the distribution. The result is understatement of extreme values, often referred to as “over-shrinkage” in this context, which is equally problematic when our goal is the description of the overall distribution. These results are in line with the findings of Spjøtvoll and Thomsen (1987), who demonstrated similar results for direct and model-based estimators of participation in the Norwegian Labour Force and industrial groupings. The underlying reason for this pattern is that, in design based estimation, the sampling process adds an additional layer of variability to the underlying variability of the true area values, and therefore   var ˆdesign  var( )  var( sampling _ residual )  var   On the other hand the synthetic estimator approximates the value produced by regressing the true area values on the covariates, and so in this case var( )  var(ˆ synth )  var( regression _ residual ) 9 which implies that   var ˆ synth  var    var(regression _ residual )  var   Composite estimators combine the strengths of design-based and synthetic estimators. In the Eurarea simulations, they usually produced results that were closer to the true distribution than other estimators (Figure 2 supports this), but even here the results tended to be less spread than the true values. We can visualise these findings more fully using maps. The series of maps in Figure 3 shows the true set of area means for the proportion of single person households in NUTS5 areas in northwest England and north Wales, followed by maps of the predicted values of this target variable produced by a direct estimator, the area-level synthetic estimator and the composite estimator. Figure 3 – Comparing estimation results for the proportion of single person households in Northwest England and North Wales at NUTS5 level. The true area proportions are at top left, with the direct estimates at top right, area-level synthetic estimates at bottom left and composite estimates at bottom right. All estimates are based on a single sample. Figure 3 illustrates how well the direct, synthetic and composite estimators capture the distribution of area values. Comparing the true distribution at top left with the 10 distribution produced by the direct estimator at top right, it is clear that the direct estimates massively exaggerate differences between areas. If we then compare the arealevel synthetic estimates at bottom left and composite estimators at bottom right, we see that these provide much closer approximations to the true distribution of area proportions. This demonstrates the gain that can be achieved through the use of model-assisted methods over design-based approaches, particularly when sample sizes are small or nonexistent and high quality ancillary information is available. However, we also see the effects of shrinkage on the distribution. The synthetic estimator moderates extremes so that the low proportions of single person households in the south-central area of the map are overestimated and the very high proportions in the south-western area are correspondingly underestimated. The composite estimator performs slightly better, but differences between composite and synthetic are relatively minor in this particular case. This behaviour is problematic if our goal is an international or regional comparison of areas for allocation purposes such as the distribution of European structural funds. The issues raised become particularly serious if the methods used to produce small area estimates are inconsistent between countries. Consider a case of two countries with the same level of regional inequality, but where regional statistics are produced in one country using direct estimates and in the other using synthetic estimates. The country that employs direct estimates will appear to exhibit significant regional stratification, while the country that uses synthetic estimates will appear to be comparatively more egalitarian. This will affect funding allocation in both countries, with one gaining at the expense of the other despite the fact that the underlying distributions in both countries are identical. A consistent approach to the provision of regional estimates across member states or appropriate procedures to adjust local estimates is needed to overcome this problem. Adjustments for over-shrinkage It is not possible to produce optimal estimates of both local values and the distribution of area values simultaneously. However, given an underlying model approach, it is possible to adjust the estimates so that they do have a between area spread that is closer to the true distribution. It is important to note that this gain can only be made at the cost of local precision. Adjustment methods are proposed by both Spjøtvoll and Thomsen (1987), and Shen and Louis (1998). However, so far as we know the amount of empirical work on the practicability and performance of the adjusted estimators is rather limited. This paper has not attempted to assess these suggestions. However, the extent of the over-shrinkage, and over-inflation, problems documented here suggests that there is a strong case for further empirical investigations of the performance of estimators that aim to preserve the underlying distribution of area values – with a view to identifying best practice in the context of European official statistics. 11 ACKNOWLEDGEMENTS The results described in this paper were obtained by the Eurarea Consortium within the EURAREA information societies technology project (IST 2000-26290) which has benefited from the financial support of the European Union. We present these findings on behalf of the consortium. However, the opinions presented in this paper are the sole responsibility of the authors. REFERENCES Heady, P. and Hennell, S., 2001, Enhancing small area estimation techniques to meet European needs. Statistics in Transition 5: 195-203. Pfeffermann, D., 2002, Small area estimation – new developments and directions. International Statistical Review 70: 125-143. Rao, J.N.K., 2003, Small area estimation. Wiley. Särndal C.E., 1984, Design-consistent versus model-dependent estimation for small domains. JASA (79) 642-631. Särndal, Swensson and Wretman, 1997, Model Assisted Survey Sampling, SpringerVerlag, New York, Corrected 4th printing. Shen W, Louis T.A. (1998) Triple-goal estimates in two-stage hierarchical models. JRSS(B) 60: 455-471. Spjøtvoll E. and Thomsen, I. 1987, Application of some Empirical Bayes methods to small area statistics, Bulletin of the International Statistical Institute, Vol. 4, p. 435 – 450. 12

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download 2004MinnP6.1