Download 2004MinnP6.1

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Linear regression wikipedia , lookup

Forecasting wikipedia , lookup

Data assimilation wikipedia , lookup

Regression analysis wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Choice modelling wikipedia , lookup

Time series wikipedia , lookup

Least squares wikipedia , lookup

Coefficient of determination wikipedia , lookup

Maximum likelihood estimation wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Robust statistics wikipedia , lookup

Transcript
THE SUITABILITY OF ALTERNATIVE
SURVEY-DERIVED SMALL AREA
STATISTICS IN INTERNATIONAL
COMPARISONS: FINDINGS FROM THE
EURAREA PROJECT
Patrick Heady and Martin Ralphs
Spatial Analysis and Modelling Branch (Research)
Methodology Group,
Office for National Statistics
1 Drummond Gate, London SW1V 2QQ.
Abstract
In the use of survey-derived statistics for small areas (using model-assisted and modelbased methods) the predominant focus has been on alternative methods’ ability to provide
estimates for individual small areas. The quality of the estimates for a set of areas has
generally been thought of in terms of the average value of such area-specific measures as
bias or mean square error.
Recently, however, there has been growing interest in estimators which conserve key
properties of the overall set of area characteristics. Perhaps the most important
characteristic of this kind is how well the estimators represent the true extent of
geographical inequality. In general “design-based” estimators can be expected to
overstate the extent of geographical inequality, while model-based estimators are likely to
understate it. A consequence of this is that comparisons of spatial inequality between
different states may be powerfully affected by differences in the small area estimation
methods they use.
The Eurarea project, carried out by NSIs and academics, used simulation methods, based
on simulated samples from 100-percent census and register data-bases for six European
countries, to measure the practical performance of a range of small area estimators. These
included direct estimators, GREGs, regression synthetic estimators, composite estimators,
and estimators that borrowed strength from patterns of temporal and spatial autocorrelation.
In this paper, we use some of these results to illustrate the performance of different types
of estimators for estimating area-specific totals and explore the extent of the distortion to
spatial distributions that can occur from using small area estimation methods that are
optimal in other respects, relating this particularly to the issue of international
comparisons of spatial inequality.
1
INTRODUCTION
The provision of high quality small area statistics is a growing priority for European
governments, primarily so that resource allocation can be optimally directed to tackle
problems such as poor housing and health, unemployment and low pay. A problem arises
because the availability of small area statistics for key variables of interest is limited by
the cost of data collection. It is prohibitively expensive to carry out detailed local surveys
with comprehensive coverage for very small areas, and while sample surveys are carried
out to collect information about key themes of local interest, these are designed primarily
for efficient estimation at national level.
Small area estimation techniques (henceforth abbreviated to SAE) can help to overcome
these problems by using local sample information, usually coupled with ancillary data
from secondary sources such as population censuses or administrative data, to provide
area-specific estimates with higher precision than a direct estimate based solely on the
sample. The Eurarea project (see Heady and Hennell, 2001, for a full description) has
investigated the performance of standard and innovative methods for SAE in the
European context, with the objective of providing advice to Eurostat and to European
NSIs on the appropriate use of SAE methods in the context of official statistics. The full
range of results obtained, including methodological findings, specimen programs and
recommendations regarding statistical policy will be published in the Eurarea project
reference volume and made available on the project website later this year1.
The most significant experimental capability developed within Eurarea has been the
facility to evaluate estimator performance through large-scale simulations based on
population register and census datasets in six different European countries. Repeated
samples are drawn in realistic ways from these bases, estimation strategies applied to
them and the results compared with the true values contained within the databases across
many replications. This has enabled us to assess empirically the predictive power of the
estimators and the reliability of their in-built error prediction procedures.
In this paper, we introduce small area estimation approaches and use some of the
empirical results produced by the Eurarea Consortium to illustrate the performance of
several commonly applied SAE methods in respect of two key user requirements – the
provision of reliable area-specific estimates and, of particular concern to policy makers
responsible for resource allocation, the modelling of the distribution of area values. We
move on to consider the implications of our findings in the light of these two objectives.
SMALL AREA ESTIMATION METHODS
Since the most pressing issues facing European ISIs at the moment are whether to use
model-based approaches at all, and how to make the best use of the data provided by the
different national statistical systems, it was appropriate to concentrate on relatively
straightforward methods in Eurarea. The SAE methods that we present here are therefore
1
http://www.statistics.gov.uk/eurarea, which will be available for public access in summer 2004.
2
not comprehensive. Rao (2003) and Pfefferman (2002) present comprehensive reviews
of the methods that are available and the interested reader is directed to these references
for more information.
In Eurarea, our primary interest was in comparing the effectiveness of estimators drawn
from both the design-based and model-based families (Särndal 1984) and in considering
how performance was affected by a range of external factors such as sampling methods,
the sizes of the small areas for which estimates were required and the treatment of both
binary and continuous target variables. For the purposes of this paper, we introduce four
of the basic estimators from Eurarea. Our main objective is to demonstrate how the
predictive power of our estimation can be enhanced through the deployment of different
approaches, and to illustrate the limitations that apply when we use particular SAE
strategies based on design and model-based approaches or combinations of the two.
The types of estimate produced by design and model-based estimation procedures differ
fundamentally. In the case of a design-based estimator, the estimate produced is unique
to each individual small area under consideration. The estimate is unbiased for that area,
in the sense that, under repeated sampling the mean of successive estimates will tend
towards the true value. For model-based estimators, the situation is somewhat different.
A model-based estimator utilises ancillary information to produce an estimate of the
target variable that is applicable to all small areas that share similar characteristics. Thus,
if two small areas have exactly the same ancillary information, exactly the same estimate
will be produced for each by the model-based procedure. Unlike the situation with
design-based estimators, the discrepancies between the true value for a particular area
and the model-estimates generated from successive samples will not tend to average out
over the long run.
The estimators
We considered the performance of four basic estimator types, which we define and
discuss below. We use the following standard notation in all our equations:
1.
2.
3.
4.
5.
6.
7.
Y denotes the survey variable of interest; X denotes ancillary data;
Lowercase letters refer to sample statistics and uppercase to population statistics;
Indices i and d refer to individuals and small areas (domains), respectively;
w refers to the sample inclusion probabilities of individuals ( w id is the weight for
individual i in area d);
n is the sample size and N is the population size;
s refers to the sample;
A bar above a variable refers to the mean – e.g. y d is the sample mean of y for area d.
A hat above a variable refers to an estimate – e.g., Yˆ is an estimate of Y .
8.
d
d
9. u and e refer to area- and unit-level random effects.
10. In order to simplify the notation we sometimes use  d to refer to an area-specific
quantity. Thus we might write  for Y , and ˆ for Yˆ .
d
d
d
d
3
11. I d   is an indicator variable, taking the value 1 if the condition inside the bracket
applies to area d, and 0 if it does not.
1. Direct Estimator
The direct estimator is defined as the -weighted Horvitz-Thompson estimator for each
area (Särndal et al., 1992), where  is the probability of inclusion in the sample. This
design-based estimator is the local average value of the target variable for sampled units
in each area and is given by the formula:
1
DIRECT
Yˆd

wid
 wid yid , where N̂ d  i
ˆ
N d is d
s d
In practice, the direct estimator is highly vulnerable to sample size and coverage, and can
only be computed for areas which are sampled. We include it here to provide a
benchmark against which we can compare the performance of more sophisticated modelbased and composite approaches.
2. Generalised Regression Estimator (GREG):
The design-based GREG (Generalised REGression estimator) is obtained by adjusting the
direct estimator for an area for differences between the sample and population area means
of covariates. The adjustments are calculated by using a model relating y and X. As a
standard, the ordinary regression model is used and this has been applied in Eurarea. The
formula for the GREG estimator is:

1
1
YˆdGREG 
wid yid   Xd 

Nˆ d is d
Nˆ d

of p population mean covariates.
T

wid xid  βˆ where Xd  ( X d ,1 ,..., X d , p )T is a vector

is d

3. Area-level Synthetic Estimator:
Synthetic estimators assume a model that describes the relationship between the target
variable y and set of ancillary data X. Through this modelled relationship, the ancillary
data can be used to predict the mean of y for all target areas. The estimator and its
variance are developed on the assumption that the model used accurately describes the
population. If the model can draw sufficient power from the available ancillary data, the
method can provide substantial gains over direct estimates which rely solely upon survey
data. In practice, area models have been used more extensively than other synthetic
estimators for SAE work.
In this example, a linear model with area-level covariates is fitted to the sample area
means of the target variable. The model is y.d  XTd βˆ  ud  eid , and the estimator is
4
YˆdSYNTH  XTd βˆ , where u d and eid are independent variables with mean 0 and variances
 u2 and  e2 . The variance term  u2 is the main component the MSE.
4. Composite Estimator:
Composite estimators attempt to improve performance by combining the strengths of
synthetic and design-based estimators. The example given here is a weighted
combination of the area-synthetic and direct estimators. The estimator is given by the
following formula:
ˆ u2
YˆdCOMP
  d YˆdDIRECT  (1   d ) X.Td βˆ SYNTHETIC where  d 
and and nd is the sample size
ˆ e2
2
ˆ u 
nd
of area d. The gamma term is a weight based on the modelled variability of the areas and
the sampling variability of the data collected in each area. It is used to adjust the
contribution of the direct and synthetic components of the estimator. When sampling
variability is high (and the reliability of the direct estimator is therefore questionable), the
composite estimator is weighted in favour of its synthetic component. When the variance
of the direct estimator is low, the estimator is weighted in favour of its direct component.
In situations where an area is unsampled, the estimate will be based wholly on the
synthetic component of the estimator.
ESTIMATING AREA-SPECIFIC VALUES
The primary goal of SAE is the precise estimation of area-specific parameters to produce
optimal local estimates. We now consider how well each of the above estimators
performs when predicting individual area values. For this we need some criterion of
what counts as a good predictor. The criterion we adopt here is the minimisation of
squared area loss for each area. A good estimate for area d is thus one which minimises


2
MSE d   E  ˆd  d 


The results we present are summaries derived from Eurarea simulation studies of
estimator performance for three target variables: equivalised household income,
proportion of single person households and the ILO-definition unemployment rate. The
simulation process consisted of drawing samples, using approximations to the sampling
designs that would be used in practice, applying the various estimation procedures, and
comparing their estimates to the true values for all areas in the study data-set. The process
was repeated a large number of times (typically 500).
This enabled us to produce an empirical summary of the MSE properties of each
estimator, in the form of the average empirical mean squared error, which we define
thus:
5
 

1 D K ˆ (k )
AEMSE Yˆd 
 Yd  Yd
DK d 1 k 1

2
where Yˆd( k ) is the estimate of the target variable for area d in simulation k, Yd is the true
mean of Y for area d and K is the number of replicates in the simulation. The smaller this
quantity is, the better the estimator has performed over the whole set of areas.
In Figure 1 we summarise this information for six countries in European NUTS5 areas.
We do not show the MSE results themselves in the graph; instead we show the mean
value of the rank achieved by each estimator across all of the simulation runs, since this
allows us to compare results from different countries concisely. The graph shows that
model-based approaches (the synthetic and composite estimators) consistently exhibit
improved MSE performance over the design-based estimators (the Direct and GREG) for
these small geographical areas.
Estimator Performance by Mean Rank - NUTS5
4.50
4.00
3.50
Mean Rank
3.00
Direct
2.50
GREG
2.00
Area Synthetic
Composite
1.50
1.00
0.50
0.00
Income
Household
Unemployment
Target Variable
Figure 1 – Estimator performance by mean rank based on average MSE across
simulation runs at NUTS5 level. Lower ranks indicate better performance.
Estimator performance is influenced by a range of contributory factors. We will now
consider some of the most significant ones and then discuss their impact upon the results
that we achieve.
We have stated that design-based estimators are particularly vulnerable to sample size
and consistency and that these estimates can only be produced for areas that contain a
sample. This is a problem in small area estimation, since many national surveys feature
clustered sampling designs which provide data for only a subset of the areas of interest.
6
To produce stable estimates with an acceptable level of variability, these estimators
require large samples. Once again, this is problematic, since sample sizes will typically
be very small, particularly at geographical levels below NUTS3.
Model-based estimators are much less vulnerable to sample size than design-based
estimators. Instead, their predictive power is reliant on the use of ancillary information in
the form of a set of X variables (covariates) to which the sampled values of the target
variable are related. In this case, it is the choice of an appropriate set of ancillary data
that is of critical importance – such data must be available for all of the small areas for
which estimates are required and, ideally, will be strongly related to the target variable
for maximum effectiveness.
The main limitation of the model-based approach is estimator bias. Because synthetic
estimators apply a globally fitted model consistently to all target areas, they tend to
underestimate extreme values, “shrinking” these towards the global mean. Additionally,
they may systematically underestimate or overestimate values for particular subtypes of
area if the causes of variation in the target variable particular to these subtypes are not
captured by the ancillary information in the model.
ESTIMATING THE DISTRIBUTION OF AREA VALUES
While the performance of estimators for particular areas is a relevant criterion when the
estimates will be used to decide on resource allocation to particular areas, there are other
policy applications for which it is more important that the set of estimates produced by
SAE reflect the overall distribution of area values over the different areas in the country.
This is important if the government wishes to assess the overall extent of geographic
inequality for the variable concerned, or if the applications for funding by some higherlevel institution (such as the European Community) depend on the number of areas in a
country which fall below some specified threshold.
From this point of view, a reasonably good set of estimates might be one for which the
empirical standard deviation of the true area values was close to the empirical standard
deviation of the estimated area values – i.e. one for which

2
1 D
1 D ˆ ˆ
d  
 d     D 
D d 1
d 1

2
(1)
where
{  d } is the set of true area values;
{ ˆd } is a set of estimated area values generated by applying the estimator concerned on
any one occasion;
7
 is the mean of the true values of the D areas, and ˆ is the corresponding mean of the
estimated values.
Of course the standard deviation does not fully specify the distribution, since two
distributions with the same standard deviation could still have different shapes, in the
sense of being differently skewed, or having a different degree of kurtosis. Ideally the
empirical distribution functions of the two sets of estimates should resemble each other –
i.e. the value of


0
F (s)  F *(s) ds
should be as small as possible, where
1
F ( s)   I d  d  s  and
D d
F *( s) 
(2)

1
I d ˆd  s

D d

(3)
If the value of the integral actual was zero, i.e. if F  s  and F ( s ) were equal over their
whole range, it would mean that any functions calculated on the whole set of estimates –
such as the proportion of areas for which  was below some critical value, or a measure
of inequality such as the Gini Coefficient – would also apply to the set of true values.
However, since the expressions given in (2) and (3) are rather intractable, we will base
the analysis that follows on the comparison of standard deviations. Since the equivalence
of the true and estimated standard deviations is a necessary condition for the equivalence
of the two distributions, any difficulties revealed by the comparison of standard
deviations will apply even more strongly to comparisons of the estimated and true
distributions. Rather than being based on a single application of each estimation method,
the comparisons are derived from the average over many simulations. I.e.

2
1 K  1 D ˆ( k ) ˆ
1 D
is
compared
with



   d  
 d 
K k 1  D d 1
D d 1
 
2

where k indexes the simulation runs and K, as before, is usually 500.
8
Standard Deviation of True and Estim ated Area Means - UK Incom e NUTS 3
30.000
25.000
20.000
15.000
10.000
5.000
0.000
Area-level
Synthetic
Composite
(Area
Synthetic /
Direct)
True
Population
Mean
GREG
Direct
Estim ator Type
Figure 2 – Comparing the true standard deviation of area means with that produced
by different estimation strategies for income at NUTS3 in Northwest England and
North Wales.
In Figure 2, we compare the true standard deviation of area means for NUTS3 areas in
Northwest England and North Wales with the standard deviations of estimates of these
means produced using Direct, GREG, area synthetic and composite methods described
above. The figure shows that the direct estimator tends to overestimate extremes in the
distribution, and as a result the standard deviation of area values is over-inflated. The
area level synthetic estimator has the opposite effect, and tends to “shrink” the estimates
towards the centre of the distribution. The result is understatement of extreme values,
often referred to as “over-shrinkage” in this context, which is equally problematic when
our goal is the description of the overall distribution. These results are in line with the
findings of Spjøtvoll and Thomsen (1987), who demonstrated similar results for direct
and model-based estimators of participation in the Norwegian Labour Force and
industrial groupings. The underlying reason for this pattern is that, in design based
estimation, the sampling process adds an additional layer of variability to the underlying
variability of the true area values, and therefore


var ˆdesign  var( )  var( sampling _ residual )  var  
On the other hand the synthetic estimator approximates the value produced by regressing
the true area values on the covariates, and so in this case
var( )  var(ˆ synth )  var( regression _ residual )
9
which implies that


var ˆ synth  var    var(regression _ residual )  var  
Composite estimators combine the strengths of design-based and synthetic estimators. In
the Eurarea simulations, they usually produced results that were closer to the true
distribution than other estimators (Figure 2 supports this), but even here the results tended
to be less spread than the true values.
We can visualise these findings more fully using maps. The series of maps in Figure 3
shows the true set of area means for the proportion of single person households in
NUTS5 areas in northwest England and north Wales, followed by maps of the predicted
values of this target variable produced by a direct estimator, the area-level synthetic
estimator and the composite estimator.
Figure 3 – Comparing estimation results for the proportion of single person
households in Northwest England and North Wales at NUTS5 level. The true area
proportions are at top left, with the direct estimates at top right, area-level synthetic
estimates at bottom left and composite estimates at bottom right. All estimates are
based on a single sample.
Figure 3 illustrates how well the direct, synthetic and composite estimators capture the
distribution of area values. Comparing the true distribution at top left with the
10
distribution produced by the direct estimator at top right, it is clear that the direct
estimates massively exaggerate differences between areas. If we then compare the arealevel synthetic estimates at bottom left and composite estimators at bottom right, we see
that these provide much closer approximations to the true distribution of area proportions.
This demonstrates the gain that can be achieved through the use of model-assisted
methods over design-based approaches, particularly when sample sizes are small or nonexistent and high quality ancillary information is available. However, we also see the
effects of shrinkage on the distribution. The synthetic estimator moderates extremes so
that the low proportions of single person households in the south-central area of the map
are overestimated and the very high proportions in the south-western area are
correspondingly underestimated. The composite estimator performs slightly better, but
differences between composite and synthetic are relatively minor in this particular case.
This behaviour is problematic if our goal is an international or regional comparison of
areas for allocation purposes such as the distribution of European structural funds. The
issues raised become particularly serious if the methods used to produce small area
estimates are inconsistent between countries. Consider a case of two countries with the
same level of regional inequality, but where regional statistics are produced in one
country using direct estimates and in the other using synthetic estimates. The country
that employs direct estimates will appear to exhibit significant regional stratification,
while the country that uses synthetic estimates will appear to be comparatively more
egalitarian. This will affect funding allocation in both countries, with one gaining at the
expense of the other despite the fact that the underlying distributions in both countries are
identical. A consistent approach to the provision of regional estimates across member
states or appropriate procedures to adjust local estimates is needed to overcome this
problem.
Adjustments for over-shrinkage
It is not possible to produce optimal estimates of both local values and the distribution of
area values simultaneously. However, given an underlying model approach, it is possible
to adjust the estimates so that they do have a between area spread that is closer to the true
distribution. It is important to note that this gain can only be made at the cost of local
precision. Adjustment methods are proposed by both Spjøtvoll and Thomsen (1987), and
Shen and Louis (1998). However, so far as we know the amount of empirical work on the
practicability and performance of the adjusted estimators is rather limited. This paper has
not attempted to assess these suggestions. However, the extent of the over-shrinkage, and
over-inflation, problems documented here suggests that there is a strong case for further
empirical investigations of the performance of estimators that aim to preserve the
underlying distribution of area values – with a view to identifying best practice in the
context of European official statistics.
11
ACKNOWLEDGEMENTS
The results described in this paper were obtained by the Eurarea Consortium within the
EURAREA information societies technology project (IST 2000-26290) which has
benefited from the financial support of the European Union. We present these findings
on behalf of the consortium. However, the opinions presented in this paper are the sole
responsibility of the authors.
REFERENCES
Heady, P. and Hennell, S., 2001, Enhancing small area estimation techniques to meet
European needs. Statistics in Transition 5: 195-203.
Pfeffermann, D., 2002, Small area estimation – new developments and directions.
International Statistical Review 70: 125-143.
Rao, J.N.K., 2003, Small area estimation. Wiley.
Särndal C.E., 1984, Design-consistent versus model-dependent estimation for small
domains. JASA (79) 642-631.
Särndal, Swensson and Wretman, 1997, Model Assisted Survey Sampling, SpringerVerlag, New York, Corrected 4th printing.
Shen W, Louis T.A. (1998) Triple-goal estimates in two-stage hierarchical models.
JRSS(B) 60: 455-471.
Spjøtvoll E. and Thomsen, I. 1987, Application of some Empirical Bayes methods to
small area statistics, Bulletin of the International Statistical Institute, Vol. 4, p. 435 – 450.
12