Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Statistical Methods for Interval Censored Complex Survey Data C. M. Suchindran Department of Biostatistics Carolina Population Center 1 Motivating Example Outcome of Interest Time ( Age at onset ) to first transition to obesity Data Add Health Study ( Wave I –IV) ( Complex Survey Design) Recorded Data of age at first transition to obesity: Obesity Status at each wave 2 Motivating Example Add Health Study Age Data Format Wave I II III IV (12-20) (13-21) (18-26) (24-32) Case 1 Age at Wave I 19 Obese at Wave I Case 2 Age at Wave I 16 Not obese at Wave I; Obese at Wave II (age 17) Case 3 Age at Wave II 15 Not obese at Wave II; Obese at Wave III ( age 20) Case 4 Age at Wave III 18 Not obese at Wave III; Obese at Wave IV (age 24) Case 5 Age at Wave IV 28 Not obese at Wave IV Case Ltime Rtime 1 . 19 2 16 17 3 15 20 4 18 24 5 28 . Need to take into account sampling weights ( design) Subject Information in overlapping intervals 3 Special Cases 1. Status determination at one time point only (Current Status Data) Obesity status ascertained at Wave IV only Time to obesity is smaller or larger than age at Wave IV 2. Observations in non-overlapping Intervals All subjects were observed in fixed intervals Baseline, one year, two years , etc In this talk we will examine interval censored data from an arbitrary number of observation times with overlap 4 Available Software SAS Macro %EMIMC ; %ICSTEST PROC LIFETEST, LIFEREG, MCMC , NLMIXED STATA Stset, streg INTCENS , STPM ( Parametric models), GLAMM Ml for svy ( Tools for programmers of new survey commands) R package Icens Intcox smoothSurv BayesSurv ( Needs to manipulate the procedures to include survey design) MPlus 5 Conventional Survival Analysis Use of right censored data (Know the exact time of event or censoring time) 1. Descriptive Analysis Kaplan-Meier Estimate of Survival ( Survival rate estimated at times of occurrence of events) ( Non-parametric estimation) 2. Hazards Modeling log= λi (t ) 3. log λ0i (t ) + β + β x + ... + β x * 0 * 1 i1 * p ip Accelerated Failure time models log Ti =β 0 + β1 xi1 + ...... + β p xip + σε i When the distribution is Weibull λ0 (t ) = α log t β * j βj = − σ 1 α = σ Choice of log logistic results in proportional odds model 6 Descriptive Analysis Estimation of Survival Function Reference: Ying So, Gordon Johnston and Se Hee Kim. (2010). “Analyzing Interval Censored Survival Data,” SAS Global Forum Paper 257-2010. Outcome variable of interest: Time to event (denoted as T) Descriptive Measures: 1. 2. 3. Survival Function S(t) = P (T >t) Probability (Time to event exceeds t) Cumulative distribution function F(t)= P(T ≤ t ) (Cum Failure by time t) F(t) =1-S(t) Probability of event occurrence in the interval ( L, R) = P(L <T<R) S(L)-S(R) or F (R)-F(L) 7 Nonparametric Estimation of Survival Function Notations: T ~ Time to event Survival Function S (t) = P( T>t); Probability that event time exceeds t Cumulative distribution function F (t) = 1-S(t) = Prob (event occurs before t) Probability of event occurring in a interval (t1 t2 ) = S (t1 ) − S (t2 ) ( Li , Ri ] Interval in which event is known to occur Li Left end point (may be zero or .) Ri Right end point (may be · or usually observed ( L , R ] i ∞ ) i 8 Example Data 1; Estimation of Turnbull Intervals I = {( q1 , p1 ],( q2 , p2 ], ...( qm , pm ]} id ltime rtime scw 1 . 7 5 2 . 8 10 3 6 10 15 Set of all left and right censored in 4 7 16 15 such a way that qj is a left end 5 7 14 25 point, pj is the right end point and 6 17 . 33 there is no other left or end point 7 37 44 33 between qj and pj 8 45 . 34 9 46 . 8 10 46 . 12 Possible Event times 0 6 7 8 10 14 16 17 37 44 45 46 (INFINITY) L L L R R R R L L R L L R L L R Possible probability estimation intervals (6, 7) (7. 8) (37. 44) ( 46. infinity) 9 Contribution of Each Individual to Turnbull Intervals ID Event times (6, 7] (7, 8] (37, 44) (46, infinity) 1 (0, 7) 1 0 0 0 2 (0, 8) 1 1 0 0 3 (6, 10) 1 1 0 0 4 (7, 16) 0 1 0 0 5 (7, 14) 0 1 0 0 6 (17, .) 0 0 1 1 7 (37, 44) 0 0 1 0 8 (45, .) 0 0 0 1 9 (46, .) 0 0 0 1 10 (46, .) 0 0 0 1 10 EM Algorithm (Iterative Procedure) for Nonparametric Estimation Step 1 : Start with an initial probability distribution (P) (Uniform in all Turnbull Intervals) [.25 .25 .25 .25] Step 2 : For Each ID calculate the probability (E Step) ˆP (t ) = P(t j ) I {t j ∈ ( Li , Ri )} i j ∑ Pˆ (tk ) tk ∈( Li , Ri ) For ID 2 Pˆ (t j ) = [0 .25/(.25+.25) .25/(.25+.25) 0] Step 3 : M Step Update P̂ for each Turnbull Interval n 1 Pˆ (t j ) = ∑ Pˆi (t j ) n i =1 Iterate until convergence 11 Proc IML; Q =(1 1 1 1); N=10; a={1 0 0 0, 1 1 0 0, 1 1 0 0, 0 1 0 0, 0 1 0 0, 0 0 1 1, 0 0 1 0, 0 0 0 1, 0 0 0 1, 0 0 0 1}; p = {0.2 0.3 0.15 0.35}; x=J(4,1,1); EPS =.0001; y =J(1,10,1); do until (ABS(Q-p) <EPS); Q =p; c=a#p; rowsum = c*x; astar=c/rowsum; p =y*astar/10; END; print p; Lower Upper Probability Surv 6 7 37 46 7 8 44 + 0.1667 0.3333 0.1250 0.3750 .8333 .5000 .3750 12 SAS EMICM MACRO %inc "H:/cpcnew2/xmacro.txt"; %inc "H:/cpcnew2/emicm.txt"; %EMICM(data = example, left =ltime, right = rtime, group =group, options =plot, title ="NPMLE", title2="example ", timelabel="time to event" ); Standard Cumulative Survival Error Lower Upper Probability Probability Probability Survival 6 7 0.1667 0.1667 0.8333 0.1362 7 8 0.3333 0.5000 0.5000 0.1581 37 44 0.1250 0.6250 0.3750 0.1575 46 . 0.3750 1.0000 0.0000 0.0000 13 LIFEREG PROCEDURE IN SAS ods graphics on; proc lifereg; model ( ltime rtime) = /d = weibull ; probplot pupper =10 printprobs maxitem =(1000, 25) ppout; inset; run; 14 The LIFEREG Procedure Cumulative Probability Estimates Pointwise 95% Confidence Lower Lifetime . 7 8 44 Upper Cumulative Lifetime Probability 6 7 37 46 0.0000 0.1667 0.5000 0.6250 Limits Lower Upper 0.0000 0.0249 0.2245 0.3032 0.0000 0.6106 0.7755 0.8645 Standard Error 0.0000 0.1459 0.1581 0.1606 15 Introducing Sampling Weights in IML Macro Proc IML; step1 = J(10,4, 1); weight = {5 ,10, 15, 15, 25, 33, 33, 34, 8, 12}; pweight = step1#weight; N= sum(weight); Q = {1 1 1 1}; anw={1 0 0 0, 1 1 0 0, 1 1 0 0, 0 1 0 0, 0 1 0 0, 0 0 1 1, 0 0 1 0, 0 0 0 1, 0 0 0 1, 0 0 0 1}; a= anw#pweight; p = {0.25 0.25 0.25 0.25}; x=J(4,1,1); EPS =.0001; y =J(1,10,1); do until (ABS(Q-p) <EPS); Q =p; c=a#p; rowsum = c*x; astar=c/rowsum; p =weight`*astar/N; END; print pweight; Interval Probability 6-7 0.0409 7-8 0.3274 37-44 0.2396 46+ 0.3920 16 Sampling Weights and SAS EMICM MACRO id ltime rtime group weight 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 7 7 7 7 7 8 8 8 8 8 8 8 8 8 8 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 5 5 5 5 5 10 10 10 10 10 10 10 10 10 10 Standard Cumulative Survival Error Lower Upper Probability Probability Probability Survival 6 7 37 46 7 8 44 . 0.0409 0.3275 0.2396 0.3920 0.0409 0.3684 0.6080 1.0000 0.9591 0.6316 0.3920 0.0000 0.0163 0.0350 0.0396 0.0000 Replicate Observations 17 Estimation of Survival Function with PROC LIFEREG Incorporating Sampling Weights The LIFEREG Procedure ods graphics on; proc lifereg data = example ; model ( ltime rtime) = /d = weibull ; weight scw; probplot pupper =10 printprobs maxitem =(1000, 25) ppout; inset; run; quit; ods graphics off; Cumulative Probability Estimates Pointwise 95% Confidence Lower Upper Cumulative Limits Lifetime Lifetime Probability Lower Upper . 7 8 44 6 7 37 46 0.0000 0.0409 0.3684 0.6080 0.0000 0.0000 0.0173 0.0936 0.3028 0.4392 0.5287 0.6819 Standard Error 0.0000 0.0177 0.0350 0.0394 18 id ltime rtime group 1 45 . 1 2 25 37 1 3 37 0 1 4 6 10 1 5 46 . 1 6 0 5 1 7 0 7 1 8 26 40 1 9 18 . 1 10 46 . 1 47 8 12 0 48 0 5 0 49 30 34 0 50 0 22 0 51 5 8 0 52 13 . 0 DATA SET 2 19 Standard_ Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 group 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 Lower 4 5 8 11 12 16 18 19 21 22 23 24 30 31 33 34 35 44 48 4 6 7 11 15 17 24 25 33 34 Upper 5 8 9 12 13 17 19 20 22 23 24 25 31 32 34 35 36 48 60 5 7 8 12 16 18 25 26 34 35 Probability 0.0433 0.0433 0.0000 0.0692 0.0000 0.1454 0.1411 0.1157 0.0000 0.0000 0.0000 0.0999 0.0709 0.0000 0.0000 0.0000 0.1608 0.0552 0.0552 0.0526 0.0343 0.1066 0.0673 0.0000 0.0000 0.0900 0.0000 0.0794 0.0000 Cumulative_ Probability 0.0433 0.0866 0.0866 0.1558 0.1558 0.3012 0.4423 0.5580 0.5580 0.5580 0.5580 0.6579 0.7288 0.7288 0.7288 0.7288 0.8896 0.9448 1.0000 0.0526 0.0870 0.1935 0.2609 0.2609 0.2609 0.3509 0.3509 0.4303 0.4303 Survival_ Probability 0.9567 0.9134 0.9134 0.8442 0.8442 0.6988 0.5577 0.4420 0.4420 0.4420 0.4420 0.3421 0.2712 0.2712 0.2712 0.2712 0.1104 0.0552 0.0000 0.9474 0.9130 0.8065 0.7391 0.7391 0.7391 0.6491 0.6491 0.5697 0.5697 std Error_ Survival 0.0332 0.0415 0.0415 0.0564 0.0564 0.0776 0.0825 0.0763 0.0763 0.0763 0.0763 0.0756 0.0670 0.0670 0.0670 0.0670 0.0483 0.0358 0.0000 0.0368 0.0483 0.0608 0.0647 0.0647 0.0647 0.0715 0.0715 0.0742 0.0742 Unweighted Analysis SAS MACRO EMICM 20 21 Generalized Log-rank Test I (Zhao & Sun, 2004) Test Statistic and Covariance Matrix 0 1 group U 8.5121 -8.5121 12.0650 -12.0650 Chi-Square 6.0055 cov(U) -12.0650 12.0650 DF Pr > Chi-Square 1 0.0143 Generalized Log-Rank Test II (Sun, Zhao & Zhao, 2005) xi(x)=xlog(x) Test Statistic and Covariance Matrix group 0 1 U 9.2978 13.8302 -9.2978 -13.8302 ChiSquare 6.2507 cov(U) -13.8302 13.8302 DF Pr>ChiSquare 1 0.0124 22 Regression Analysis Approach 1 (Marginal Models) Treat cluster correlation as a nuisance parameter and solve weighted likelihood. Obtain variance estimates using anyone of the following method: Robust Variance (Huber method); Jackknife or Bootstrap Approach 2 (Random Effects models) Allow for explicit estimation of within and between cluster variance Random effects Models are used. Variance estimation uses Robust “sandwich” estimators, bootstrap and Jackknife procedures (For multilevel model sandwich estimators are difficult to compute; Jackknife may be unreliable; bootstrap Approach 3 (Bayesian Models) 23 Partial List of Example Data ( 94 observations , 8 clusters ) id 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ltime 45 25 37 6 46 26 18 46 46 24 46 27 36 7 rtime group 37 10 5 7 40 34 16 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 village 1 1 1 1 1 2 2 2 2 3 3 3 3 3 3 3 ctype 2 4 2 4 2 3 3 4 2 2 2 2 2 4 2 4 w1 2268 684 5809 436 3949 1666 1775 1962 4519 2196 1635 4895 5072 5303 1440 3388 w2 88 88 88 88 88 76 76 76 76 64 64 64 64 64 64 64 w 199584 60192 511192 38368 347512 126616 134900 149112 343444 140544 104640 313280 324608 339392 92160 216832 scw weight2 0.86 0.54 0.26 0.16 2.21 1.38 0.17 0.1 1.5 0.94 0.67 0.55 0.72 0.58 0.79 0.64 1.82 1.48 0.64 0.53 0.48 0.4 1.43 1.18 1.48 1.23 1.55 1.28 0.42 0.35 0.99 0.82 SCW (scaled weight Method I; weight2 ( Scaled weight method II) 24 Weibull Regression ( Marginal Model ) Accelerated Failure Time Model (PROC LIFEREG) β 0 + β1 (Group )i + σε log(Tij ) = where ε is distributed as extreme value distribution. Hazards Model log λ (= t ) α log t + β 0* + β1* (Group ) Note βj 1 β = − ,α = −1 σ σ * j 25 Jackknife (with no stratification) A single replicate is created by removing from the sample, all units associated with a given PSU and inflating the original weights by a factor to keep the sum of weights the same. The standard deviation of the estimates derived after removing one PSU at a time is an estimate of the variance (Typically average deviation square from the parameter estimate based on all data) 26 PROC LIFEREG proc lifereg data = onelevel; model ( ltime rtime) = group /d = weibull ; weight w; run; 27 The LIFEREG Procedure Model Information Data Set Dependent Variable Dependent Variable Weight Variable Number of Observations Noncensored Values Right Censored Values Left Censored Values Interval Censored Values Name of Distribution Log Likelihood WORK.ONELEVEL Log(ltime) Log(rtime) w 94 0 37 6 51 Weibull -135.0600573 Analysis of Maximum Likelihood Parameter Estimates 95% Confidence Standard Parameter DF Estimate Error Limits Intercept 1 3.4272 0.2063 3.0228 3.8316 group 1 0.5785 0.2453 0.0976 1.0594 Scale 1 0.7306 0.1028 0.5544 0.9627 Weibull Shape 1 1.3688 0.1927 1.0387 1.8036 Jackknife standard Error of Group parameter estimate = 0.1914 ChiSquare Pr > ChiSq 275.96 5.56 <.0001 0.0184 28 Random Effects Models L. Grilli and M. Pratesi (2004), “Weighted Estimation in Multilevel Ordinal and Binary Models in the Presence of Informative Sampling Designs,” Survey Methodology, 30(1): 93-103. Maximization of the weighted log-likelihood of the model using SAS Procedure NLMIXED (Provides unbiased parameter estimates. Need re-sampling methods such as bootstrapping or Jackknife to get proper variance estimates) The procedure requires sampling weights at each level (For Add Health data see Chantala, Blanchet and Suchindran ( 2006), “Software to compute sampling weights for Multilevel Analysis,” CPC Website) 29 Multilevel model Choice of weights (In a two level set up) ω Level 2 (cluster level) weights to adjust for differential probabilities of cluster selection (Level 2 weights denoted as j) Level 1 weights: weight for individual I in cluster j denoted as The final weight is ωi| j ω j * ωi| j 30 Scaling of Weights For multilevel modeling scaling of weights is recommended: Scaling of level 1 weights (Pfeffermann et al 1998) Method I ωi| j ω= = ωi| j * i| j ωi| j nj ∑ω i =1 * nj i| j Note that the sum of weights in cluster j is the number of subjects in cluster j ( n j ) Method I is recommended when the outcome variable depends on the selection of subjects at level 1 31 Scaling of Weights Method II ( Pfefefrmann et al 1998) nj ωi*| j = ∑ω i =1 nj i| j 2 ω ∑ i| j * ωi| j i =1 Method II is recommended when the outcome depends on sampling at both levels (For Add Health Data): Kim Chantala, Dan Blanchette and C. M. Suchindran, “Software to compute sampling weights in Multilevel Analysis,” CPC website . 32 Scaling of Weights Method III (Korn and Graubard, 2003) Level I weights = 1 And level 2 weights = nj ω *j = ∑ ω j * ωi| j i =1 Use when level 1 weights and outcome are not correlated References: Pfeffermann, D. et al (1998), “Weighting for unequal selection probabilities in multilevel models,” JRSS-B, 60: 23-40. Korn, E. L. and Graubard (2003), “Estimating variance components by using survey data,” JRSS-B, 65:175-190. 33 Weibull Regression (Random Effects) Model Accelerated Failure Time Model (NLMIXED) log(Tij ) = β 0 + β1 (Group )i + σε i where ε is distributed as extreme value distribution Random Effects Model (NLMIXED) Add a random effect term to the linear predictor. We assume that the random variable is distributed as Normal (0, θ) 34 PROC NLMIXED with both level weights proc nlmixed data = combweight qpoints =10; parms beta0 0 beta1 = 0 p = .92 theta =.01; bounds p >0 , theta >0; ebetaxb = exp(-(beta0+beta1*group+b)); lamda= exp(-beta0); s_1 = exp(-(ltime*ebetaxb)**(1/p)); S_u = exp (-(rtime*ebetaxb)**(1/p)); f_t = (lamda*p)*(lamda*ltime)**(p-1)*ebetaxb**1/p; if ( ltime ^=. and rtime ^= . and ltime =rtime) then lik = f_t; else if ( ltime^=. and rtime =.) then lik = S_1; else if ( ltime =. and rtime ^=.) then lik = 1-s_u; else lik = S_1 -S_u; llik =scw*log(lik); model ctype ~ general(llik); random b ~normal ( 0, theta) subject =village; replicate w2; run; Fit Statistics -2 Log Likelihood AIC (smaller is better) AICC (smaller is better) BIC (smaller is better) 9245.6 9253.6 9254.0 9269.1 Parameter Estimates Standard Parameter Estimate Error beta0 3.4697 0.04048 beta1 0.5625 0.05186 p 0.6159 0.01394 theta 0.09658 0.01932 DF 357 357 357 357 t Value 85.72 10.85 44.18 5.00 Pr > |t| <.0001 <.0001 <.0001 <.0001 35 Use of Cfactor Option to make level 2 weight integer proc nlmixed data = combweight qpoints =10 cfactor=100 ; parms beta0 0 beta1 = 0 p = .92 theta =.01; bounds p >0 , theta >0; ebetaxb = exp(-(beta0+beta1*group+b)); lamda= exp(-beta0); s_1 = exp(-(ltime*ebetaxb)**(1/p)); S_u = exp (-(rtime*ebetaxb)**(1/p)); f_t = (lamda*p)*(lamda*ltime)**(p-1)*ebetaxb**1/p; if ( ltime ^=. and rtime ^= . and ltime =rtime) then lik = f_t; else if ( ltime^=. and rtime =.) then lik = S_1; else if ( ltime =. and rtime ^=.) then lik = 1-s_u; else lik = S_1 -S_u; llik =scw*log(lik); model ctype ~ general(llik); random b ~normal ( 0, theta) subject =village; replicate w2; run; 36 Original Weights (No multiplier to make integer weights) With Cfactor statement (original weights*100) Standard Parameter beta0 beta1 p theta Estimate 3.4697 0.5625 0.6159 0.09659 Error 0.04048 0.05186 0.01394 0.01932 Parameter beta0 beta1 p theta Estimate 3.4697 0.5625 0.6159 0.09659 Standard Error 0.04048 0.05186 0.01394 0.01932 Jackknife Variance 0.1826 37 Alternate Level 1 Weights ( Method II) Fit Statistics -2 Log Likelihood AIC (smaller is better) AICC (smaller is better) BIC (smaller is better) 7898.7 7906.7 7907.1 7922.2 Parameter Estimates Parameter beta0 beta1 p theta Standard Estimate Error DF t Value 3.4851 0.03044 357 114.51 0.4589 0.04293 357 10.69 0.6109 0.01481 357 41.26 0.03480 0.01230 357 2.83 Pr > |t| <.0001 <.0001 <.0001 0.0049 38 Markov Chain Monte Carlo (MCMC) Simulation Procedure Designed to fit Bayesian Models Makes inference based on posterior distributions of the parameters MCMC procedure generates samples from the desired posterior distributions and make inference Priors for parameters (Vague priors for the parameters) Gamma prior for positive density support (for variance) 39 SAS PROC MCMC ods graphics on; proc mcmc outpost =postout seed =1234 nbi=6000 nmc=60000 ntu =3000 missing=AC DIC statistics = (summary interval); ods select postSummaries Postintervals dic; array var1[8]; parms p = .92 beta0 = 3.3582 beta1 = .5213 var1: 0; parms s2g .1239; prior beta: ~ normal (0, sd =10000); prior var1: ~normal(0, var =s2g) ; prior p ~gamma (shape =0.001, is =.001); prior s2g: ~ general(-log(s2g)); ebetaxb = exp(-(beta0+beta1*group+var1[village])); lamda= exp(-beta0); s_1 = exp(-(ltime*ebetaxb)**(1/p)); S_u = exp (-(rtime*ebetaxb)**(1/p)); f_t = (lamda*p)*(lamda*ltime)**(p-1)*ebetaxb**1/p; if ( ltime ^=. and rtime ^= . and ltime =rtime) then lik = f_t; else if ( ltime^=. and rtime =.) then lik = S_1; else if ( ltime =. and rtime ^=.) then lik = 1-s_u; else lik = S_1 -S_u; llik = scw*log(lik); 40 The MCMC Procedure Posterior Summaries Parameter N Mean Standard Deviation p beta0 beta1 s2g 60000 60000 60000 60000 0.6096 3.3764 0.5476 0.0206 0.0793 0.1280 0.2235 0.0191 Percentiles 25% 50% 0.5555 3.2958 0.3868 0.00143 0.6041 3.3782 0.5303 0.0123 75% 0.6542 3.4646 0.6937 0.0377 Posterior Intervals Parameter Alpha Equal-Tail Interval p beta0 beta1 s2g 0.050 0.050 0.050 0.050 0.4745 3.1165 0.1769 6.374E-6 0.7893 3.6251 1.0318 0.0567 HPD 0.4730 3.1118 0.1737 2.009E-6 Interval 0.7859 3.6191 1.0022 0.0530 41 Summary of results ( Testing of Group Variable ) Method Estimate Std. Error Log rank test P value 0.0124 NLMIXED ( Unweighted ) 0.5585 0.2088 0.0318 LIFEREG ( weighted level 1 No random effect 0.5374 0.1783 0.0026 NLMIXED ( Both level weights; Level I Method I ) 0.5625 0.0518 (P < .0001) 0.1826 Jackknife std error NLMIXED ( Scaling Method II ) 0.4589 0.0429 Jackknife PROC MCMC (P<.0001) ? 0.5476 0.1280 Credible Interval ( (0.1769, 1.0318) 42 Questions? 43