* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Robustness
		                    
		                    
								Survey							
                            
		                
		                
                            
                            
								Document related concepts							
                        
                        
                    
						
						
							Transcript						
					
					Robust Statistics Osnat Goren-Peyser 25.3.08 Outline 1. 2. 3. 4. 5. 6. Introduction Motivation Measuring robustness M estimators Order statistics approaches Summary and conclusions 1. Introduction Problem definition     Let x   x1 , x2 ,..., xn  be a set of iid variables with distribution F, sorted in increasing order so that x1  x 2  ...  x n x m  is the value of the m’th order statistics An estimator ˆ  ˆ  x1 , x2 ,..., xn  is a function of the observations. We are looking for estimators such that   ˆ Asymptotic values of estimator    ˆ F  Define: ˆ  ˆ  F  such that ˆn     p where ˆ  F  is the asymptotic value of estimator ˆn at F. Estimator ˆn is consistent for θ if: ˆ  F    We say that ˆn is asymptotically normal with parameters θ,V(θ) if n ˆn     N  0,V    d   Efficiency  An unbiased estimator is efficient if  I   var ˆ  1 Where I   is Fisher information  An unbiased estimator is asymptotic efficient if  lim I   var ˆ  1 n  Relative efficiency   For fixed underlying distribution, assume two unbiased estimators ˆ1 and ˆ2 of  . we say ˆ is more efficient than ˆ2 if 1      var ˆ1  var ˆ2 The relative efficiency (RE) of ˆ1 estimator with respect to ˆ2 is defined as the ratio of their variances.     RE ˆ2 ;ˆ1  var ˆ1   var ˆ2 Asymptotic Relative efficiency   The asymptotic relative efficiency (ARE) is the limit of the RE as the sample size n→∞ For two estimators ˆ 1 ,ˆ 2 which are each consistent for θ and also asymptotically normal [1]: var ˆ 1 ARE ˆ 2 ;ˆ 1  lim n  var ˆ 2       The location model    xi=μ+ui ; i=1,2,…,n Where μ is the unknown location parameter, ui are the errors, and xi are the observations. The errors ui‘s are i.i.d random variables each with the same distribution function F0. The observations xi‘s are i.i.d random variables with common distribution function: F(x) = F0(x- μ) Normality     Classical statistical methods rely on the assumption that F is exactly known 2 The assumption that F  N  ,  is a normal distribution is commonly used But normality often happens approximately  robust methods Approximately normality      The majority of observations are normally distributed some observations follow a different pattern (not normal) or no pattern at all. Suggested model: a mixture model A mixture model      Formalizing the idea of F being approximate normal Assume that a proportion 1-ε of the observations is generated by the normal model, while a proportion ε is generated by an unknown model. The mixture model: F=(1- ε)G+ εH F is a contamination “neighborhood” of G and also called the gross error model F is called a normal mixture model when both G and H are normal 2. Motivation Outliers    Outlier is an atypical observation that is well separated from the bulk of the data. Statistics derived from data sets that include outliers will often be misleading Even a single outlier can have a large distorting influence on a classical statistical methods Without the outliers 100 90 With the outliers 180 2 1 160 80 140 70 120 60 100 50 80 40 60 30 outliers 40 20 20 10 0 -4 -2 0 values 2 4 0 -15 Estimators not sensitive to outliers are said to be robust -10 -5 values 0 5 Mean and standard deviation  The sample mean is defined by 1 n x   xi n i 1    A classical estimation for the location (center) of the data 2 2 For N  ,  , the sample mean is unbiased with N  ,  n    The sample standard deviation (SD) is defied by 1 n 2 s x  x    i n  1 i 1  A classical estimation for the dispersion of the data How much influence a single outlier can have on these classical estimators?  Example 1 – the flour example    Consider the following 24 determinations of the copper content in wholemeal flour (in parts per million), sorted in ascending order [6] 2.20,2.20,2.40,2.40,2.50,2.70,2.80,2.90, 3.03,3.03,3.10,3.37,3.40,3.40,3.40,3.50, 3.60,3.70,3.70,3.70,3.70,3.77,5.28,28.95 The value 28.95 considered as an outlier. Two cases:   Case A - Taking into account the whole data Case B - Deleting the suspicious outlier Example 1 – PDFs Case A Case B x  4.28, s  5.30 x  3.21, s  0.69 Case B - Deleting the outlier Case A - Using the whole data for estimation 0.18 0.18 data sample mean 0.14 0.14 0.12 0.12 0.1 0.08 0.1 0.08 0.06 0.06 0.04 0.04 0.02 0.02 0 0 5 mean 10 15 Observation value 20 25 data sample mean 0.16 Probability Probability 0.16 30 outlier 0 2 2.5 3 3.5 4 Observation value mean 4.5 5 5.5 Example 1 – arising question   Question: How much influence a single outlier can have on sample mean and sample SD? Assuming the outlier value 28.95 is replaced by an arbitrary value varying from −∞ to +∞:     The value of the sample mean changes from −∞ to +∞. The value of the sample SD changes from −∞ to +∞. Conclusion: A single outlier has an unbounded influence on these two classical estimators! This is related to sensitivity curve and influence function, as we will see later. Handling outliers approaches  Detect and remove outliers from the data set     Manual screening The normal Q-Q plot The “three-sigma edit” rule Robust estimators! Manual screening Why screen the data and remove outliers is not sufficient?  Users do not always screen the data.  Outliers are not always errors!    Outliers may be correct, and very important for seeing the whole picture including extreme cases. It can be very difficult to spot outliers in multivariate or highly structured data. It is a subjective decision   Without any unified criterion: Different users  different results It is difficult to determine the statistical behavior of the complete procedure The Q-Q normal plot    Manual screening tool for an underlying Normal distribution A quantile-quantile plot of the sample quantiles of X versus theoretical quantiles from a normal distribution. If the distribution of X is normal, the plot will be close to linear. The “three-sigma edit” rule   Outlier detection tool for an underlying Normal distribution Define the ratio between xi distance to the sample mean and the sample SD: ti     xi  x s The “three-sigma edit rule”: Observations with |ti|>3 are deemed as suspicious Example 1 - The largest observation in the flour data has ti=4.65, and so is suspicious Disadvantages:   In a very small samples the rule is ineffective Masking: When there are several outliers, their effects may interact in such a way that some or all of them remain unnoticed Example 2 – Velocity of light  Consider the following 20 determinations of the time (in microseconds) needed for light to travel a distance of 7442 m [6]. 28,26,33,24,34,-44,27,16,40,-2 29, 22,24,21,25,30, 23,29,31,19   The actual times are the table values × 0.001 + 24.8. The values -2 and -44 suspicious as outliers. Example 2 – QQ plot QQ Plot of Sample Data versus Standard Normal 24.84 24.83 Quantiles of Input Sample 24.82 24.81 24.8 24.79 Outlier -2 24.78 24.77 24.76 24.75 -2 Outlier -44 -1.5 -1 -0.5 0 0.5 Standard Normal Quantiles 1 1.5 2 Example 2 – Masking  Results: ti  xi  2  1.35 , ti  xi  44  3.73  Based on the three-sigma edit rule:   the value of |ti | for the observation −2 does not indicate that it is an outlier. the value −44 “masks” the value −2. Detect and remove outliers   There are many other methods for detecting outliers. Deleting an outlier poses a number of problems:   Affects the distribution theory  Underestimating data variability Depends on the user’s subjective decisions  difficult to determine the statistical behavior of the complete procedure. Robust estimators provide automatic ways of detecting, and removing outliers Example 1 – Comparing the sample median to the sample mean Case C - Testing the median estimator data Case A Case A Case B Case B 0.16 0.14 Probability 0.12 sample mean sample median sample mean sample median 0.1 • Case A: med = 3.3850 • Case B: med = 3.3700 • The sample median fits the bulk of the data in both cases • The value of the sample median does not change from −∞ to +∞ as was the case for the sample mean. 0.08 0.06 0.04 0.02 0 2 2.5 3 3.5 4 Observation value 4.5 5 5.5 Median A Mean B Mean A Median B • The sample median is a good robust alternative to the sample mean. Robust alternative to mean   Sample median is a very old method for estimating the “middle” of the data. The sample median is defined for some integer m by if n is odd , n  2m  1  x m   Med  x    x m  x m 1     if n is even, n  2m  2      For large n and F  N  ,  2 , the sample median is approximately N   ,  2 2n  At normal distribution: ARE(median;mean)=2/π≈64% Single outlier affection      The sample mean can be upset completely by a single outlier The sample median is little affected by a single outlier The median is resistant to gross errors whereas the mean is not The median will tolerate up to 50% gross errors before it can be made arbitrarily large Breakdown point:   Median: 50% Mean: 0% Mean & median – robustness vs. efficiency  For mixture mode: ε 1.4 1.68 1.8 1.8 4 1 1.57 1.75 1.7 2.5 1.84 5 1 1.57 2.2 1.7 3.4 1.86 6 1 1.57 2.75 1.71 4.5 1.87  2 2n 1       10 1 1.57 5.95 1.72 10.9 1.9 20 1 1.57 20.9 1.73 40.9 1.92 nvar(med) The sample median variance is approximately The gain in robustness due to using the median is paid for by a loss in efficiency when F is very close to normal nvar(med) 1.57 nvar(mean) 1 nvar(mean) n  3 The sample mean variance is 1      2 0.1 τ nvar(med)   0.05 nvar(mean)  F  1    N   ,1   N  , 2 0 So why not always use the sample median?    If the data do not contain outliers, the sample median has statistical performance which is poorer than that of the classical sample mean Robust estimation goal: “The best of both worlds” We shall develop estimators which combine the low variance of the mean at the normal, with the robustness of the median under contamination. 3. Measuring robustness Analysis tools    Sensitivity curve (SC) Influence function (IF) Breakdown point (BP) Sensitivity curve   SC measures the effect of different locations of an outlier on the sample The sensitivity curve of an estimator ̂ for the samples x1 , x2 ,..., xn is: SC  x0   ˆ  x1 , x2 ,..., xn , x0   ˆ  x1 , x2 ,..., xn   where x0 is the location of a single outlier Bounded SC(x0)  high robustness! SC of mean & median -3 5 4 4 3 3 2 2 1 1 SC  n=200 F=N(0,1) SC  5 -3 Sample mean x 10 0 -1 -2 -2 -3 -3 -4 -4 -5 0 outlier 5 10 Sample median 0 -1 -5 -10 x 10 -5 -10 -5 0 outlier 5 10 Standardized sensitivity curve  The standardized sensitivity curve is defined by SCn  x0   ˆ  x1 , x2 ,..., xn , x0   ˆ  x1 , x2 ,..., xn  1  n  1 What happens if we add one more observation to a very large sample? Influence function    The influence function of an estimator̂ (Hampel, 1974) is an asymptotic version of its sensitivity curve. It is an approximation to the behavior of θ∞ when the sample contains a small fraction ε of identical outliers. It is defined as ˆ  1    F   x0   ˆ  F  IFˆ  x0 , F   lim  0   ˆ     1    F   0   0  Where  x is the point-mass at x0 and  stands for “limit from the right” 0 IF main uses  ˆ  1    F   x   is the asymptotic value of the estimate when the underlying distribution is F and a fraction ε of outliers is equal to x0. 0 IF has two main uses:  Assessing the relative influence of individual observation towards the value of estimate   Unbounded IF  less robustness Allowing a simple heuristic assessment of the asymptotic variance of an estimate IF as limit version of SC  The SC is a finite sample version of IF  If ε is small    ˆ  1    F   x   ˆ  F    IFˆ  x0 , F   0  bias  ˆ 1    F   x0  ˆ  F    IFˆ  x0 , F  for large n: SCn  x0   IFˆ  x0 , F  where ε=1/(n+1) Breakdown Point     BP is the proportion of arbitrarily large observations an estimator can handle before giving an arbitrarily large result Maximum possible BP = 50% High BP  more robustness! As seen before:   Mean: 0% Median: 50% Summary    SC measures the effect of different outliers on estimation IF is the asymptotic behavior of SC The IF and the BP consider extreme situations in the study of contamination.   IF deals with “infinitesimal” values of ε BP deals with the largest ε an estimate can tolerate. 4. M estimators Maximum likelihood of μ   Consider the location model, and assume that F0 has density f 0 The likelihood function is: n L  x1 , x2 ,..., xn ;     f 0  xi    i 1  The maximum likelihood estimate (MLE) of μ is ˆ  arg max L  x1 , x2 ,..., xn ;    M estimators of location (μ)   MLE-like estimators: generalizing ML estimators If we have density f 0 , which is everywhere positive The MLE would solve: n (*)  Let    ˆ  arg min    xi        log f0 i 1 , if this exists, then n (**)   x  ˆ   0 i 1    i M estimator can almost equivalently described by ρ or Ψ If ρ is everywhere differentiable and ψ is monotonic, then the forms (*) and (**) are equivalent [6] If ψ continuous and increasing, the solution is unique [6] Special cases The sample mean The sample median    x   x2 2     x  x    x  x 1    x   sign  x   0 1  x0 x0 x0  I  x  0  I  x  0 n  1   xi  ˆ   0  ˆ  n  xi i 1 n n i 1   sign  x  ˆ   0  i i 1 n   I  x  ˆ   0  I  x  ˆ   0  0 i 1 i i  #  xi  ˆ   #  xi  ˆ   0  ˆ  Med  x  Special cases: ρ and ψ Squared errors Absolute errors 4 2 2 1 0 -4 0 -4 -2 0 2 x Squared errors 4 4 1 2 0.5 (x) (x) Sample mean (x) 3 (x) 6 0 -2 -4 -4 -2 0 2 x Absolute errors 4 0 -0.5 -2 0 x 2 4 -1 -4 -2 0 x 2 4 Sample median Asymptotic behavior of location M estimators  For a given distribution F, assume ρ is differentiable and ψ is increasing, and define 0  0  F  as the solution of EF  x  0   0  For large n, ˆ   0 p and the distribution of estimator is approximately N  0 , n  with  If ̂ (***)    EF   x  0 is uniquely defined then is consistent at F [3] 2  ̂  EF0   x  2 Desirable properties  M estimators are robust to large proportions of outliers   The IF is proportional to ψ   Ψ function may be chosen to bound the influence of outliers and achieve high efficiency M estimators are asymptotically normal   When ψ is odd, bounded and monotonically increasing, the BP is 0.5 Can be also consistent for μ M estimators can be chosen to completely reject outliers (called redescending M estimators) , while maintaining a large BP and high efficiency Disadvantages   They are in general only implicitly defined and must be found by iterative search They are in general will not be scale equivariant Huber functions    A popular family of M estimator (Huber,1964) This estimator is an odd nondecreasing ψ function which minimizes the asymptotic variance among all estimator satisfying: IF ( x)  c   where c   2 Advantages:   Combines sample mean for small errors with sample median for gross errors Boundedness of ψ Huber ρ and ψ functions  2  x  k  x    2 2 k x  k   if x  k if x  k With derivative 2 k  x  , where   x  k  x    sign  x  k  0  x   sign  x  if x  k , k  0 if x  k , k  0 Huber ρ and ψ functions ρ (x) Ψ(x) Ricardo A. Maronna, R. Douglas Martin and V´ıctor J. Yohai, Robust Statistics: Theory and Methods , 2006 John Wiley & Sons Huber functions – Robustness & efficiency tradeoff Increasing v  Special cases:   median  Decreasing robustness  mean Ricardo A. Maronna, R. Douglas Martin and V´ıctor J. Yohai, Robust Statistics: Theory and Methods , 2006 John Wiley & Sons  K=0  sample median K ∞  sample mean Asymptotic variances at normal mixture model with G = N(0, 1) and H = N(0, 10), The larger the asymptotic variance, the less efficient estimator, but the more robust. Efficiency come on the expense of robustness Redescending M estimators    Redescending M estimators have ψ functions which are nondecreasing near the origin but then decrease toward the axis far from the origin They usually satisfy ψ(x)=0 for all |x|≥r, where r is the minimum rejection point. Accept the completely outliers rejection ability, they:     Do not suffer form masking effect Has a potential to have high BP Their ψ functions can be chosen to redescend smoothly to zero  information in moderately large outliers is not ignored completely  improve efficiency! A popular family of redescending M estimators (Tukey), called Bisquare or biweight Bisquare ρ and ψ functions   2 3   if x  k 1  1   x / k     x    if x  k 1 2 With derivative 6  x  / k , where   x   x   x 1      k  2 2   I  x  k  Bisquare ρ and ψ functions ρ (x) Ψ(x) Ricardo A. Maronna, R. Douglas Martin and V´ıctor J. Yohai, Robust Statistics: Theory and Methods , 2006 John Wiley & Sons Bisquare function – efficiency   ARE(bisquare;MLE)= ARE 0.8 0.85 0.9 0.95 k 3.14 3.44 3.88 4.68 Achieved ARE close to 1 Choice of ψ and ρ    In practical, the choice of ρ and ψ function is not critical to gaining a good robust estimate (Huber, 1981). Redescending and bounded ψ functions are to be preferred Bounded ρ functions are to be preferred Bisquare function is a popular choice 5. Order statistics approaches The β-trimmed mean     Let   0, 0.5 and m     n  1  The β-trimmed mean is defined by nm 1 x  x i   n  2m i  m 1 Where [.] stands for the integer part and x i  denotes the ith order statistic. x is the sample mean after the m largest and the m smallest observation have been discard The β-trimmed mean – cont.  Limit cases     the sample mean  the sample median Distribution of trimmed mean    β=0 β 0.5 The exact distribution is intractable For large n the, distribution under location model is approximately normal BP of β % trimmed mean = β % Example 1 – trimmed mean All data Delete outlier Mean 4.28 3.2 Median 3.38 3.37 Trimmed mean 10% 3.2 3.11 Trimmed mean 25% 3.17 3.17 Median and trimmed mean are less sensitive to outliers existence The W-Winsorized mean  xw,  The W-Winsorized mean is defied by n  m 1 1    m  1 x m1   x i    m  1 x n  m   n i m 2   The m smallest observations are replaced by the (m+1)’st smallest observation, and the m largest observations are replaced by the (m+1)’st largest observations Trimmed and W-Winsorized mean disadvantages    Uses more information from the sample than the sample median Unless the underlying distribution is symmetric, they are unlikely to produce an unbiased estimator for either the mean or the median. Does not have a normal distribution L estimators   Trimmed and Winsorized mean are special cases of L estimators L estimators is defined as Linear combinations of order statistics: n ˆ    i x i  i 1 Where the  i ' s are given constants 1 For β-trimmed mean:  i  I m 1  i  n  m n  2m L vs. M estimators  M estimators are:     More flexible Can be generalized straightforwardly to multi-parameter problems Have high BP L estimator  Less efficient because they completely ignore part of the data 6. Summary & conclusions SC of location M estimators n=20 xi~N(0,1) Trimmed mean: α=25% Huber: k=1.37 Bisquare: k=4.68 The effect of increasing contamination on a sample Replace m points by a fixed value x0=1000  biased SC  m  ˆ  x0 ,..., x0 , xm1 ,..., xn   ˆ  x1 , x2 ,..., xn  n=20 xi~N(0,1) Trimmed mean: α=8.5% Huber: k=1.37 Bisquare: k=4.68  IF of location M estimator   IF is proportional to ψ (Huber, 1981) In general   x0  ˆ   IFˆ  x0 , F   E   x  ˆ   BP of location M estimator  In general    When ψ is odd, bounded and monotonically increasing, the BP is 50% Assume k1      , k2      are finite, then the BP is: min  k1 , k2  k1  k2 Special cases:   Sample mean = 0% Sample median = 50% Comparison between different location estimators Estimator BP SC/IF/ψ Redescending ψ Efficiency in mixture model unbounded No low Median 50% bounded No low Huber 50% bounded No high Bounded at 0 Yes high bounded No Mean 0% Bisquare 50% x% trimmed mean x% Conclusions    Robust statistics provides an alternative approach to classical statistical methods. Robust statistics seeks to provide methods that emulate classical methods, but which are not unduly affected by outliers or other small departures from model assumptions. In order to quantify the robustness of a method, it is necessary to define some measures of robustness Efficiency vs. Robustness   Efficiency can be achieved by taking ψ proportional to the derivative of the loglikelihood defined by the density of F: ψ(x)=-c(f’/f)(x), where c is constant≠0 Robustness is achieved by choosing ψ that is smooth and bounded to reduce the influence of a small proportion of observations References 1. 2. 3. 4. 5. 6. 7. 8. Robert G. Staudte and Simon J. Sheather, Robust estimation and testing, Wiley 1990. Elvezio Ronchetti, “THE HISTORICAL DEVELOPMENT OF ROBUST STATISTICS”, ICOTS-7, 2006: Ronchetti, University of Geneva, Switzerland, 2006. Huber, P. (1981). Robust Statistics. New York: Wiley. Hampel F. R., Ronchetti, E. M., Rousseeuw, P. J. and Stahel, W. A. (1986). Robust Statistics: The Approach Based on Influence Functions. New York: Wiley. Tukey Ricardo A. Maronna, R. Douglas Martin and V´ıctor J. Yohai, Robust Statistics: Theory and Methods , 2006 John Wiley & Sons B. D. Ripley , M.Sc. in Applied Statistics MT2004, Robust Statistics, 1992–2004. Robust statistics From Wikipedia.
 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                            