Can someone explain my Cox model to me in plain English?
I fitted the following Cox regression model to all of my data using the
cph
function. My data are saved in an object calledData
. The variablesw
,x
, andy
are continuous;z
is a factor of two levels. Time is measured in months. Some of my patients are missing data for variablez
(NB: I have duly noted Dr. Harrell’s suggestion, below, that I impute these values so as to avoid biasing my model, and will do so in the future).> fit <- cph(formula = Surv(time, event) ~ w + x + y + z, data = Data, x = T, y = T, surv = T, time.inc = 12) Cox Proportional Hazards Model Frequencies of Missing Values Due to Each Variable Surv(time, event) w x y z 0 0 0 0 14 Model Tests Discrimination Indexes Obs 152 LR chi2 8.33 R2 0.054 Events 64 d.f. 4 g 0.437 Center 0.7261 Pr(> chi2) 0.0803 gr 1.548 Score chi2 8.07 Pr(> chi2) 0.0891 Coef S.E. Wald Z Pr(>|Z|) w -0.0133 0.0503 -0.26 0.7914 x -0.0388 0.0351 -1.11 0.2679 y -0.0363 0.0491 -0.74 0.4600 z=1 0.3208 0.2540 1.26 0.2067
I also tried to test the assumption of proportional hazards by using the
cox.zph
command, below, but do not know how to interpret its results. Puttingplot()
around the command gives an error message.cox.zph(fit, transform="km", global=TRUE) rho chisq p w -0.1125 1.312 0.2520 x 0.0402 0.179 0.6725 y 0.2349 4.527 0.0334 z=1 0.0906 0.512 0.4742 GLOBAL NA 5.558 0.2347
First Problem
- Can someone explain the results of the above output to me in plain English? I have a medical background and no formal training in statistics.
Second Problem
As suggested by Dr. Harrell, I would like to internally validate my model by performing 100 iterations of 10-fold cross-validation using the
rms
package (from what I understand, this would entail building100 * 10 = 1000
different models and then asking them to predict the survival times of patients that they had never seen).I tried using the
validate
function, as shown.> v1 <- validate(fit, method="crossvalidation", B = 10, dxy=T) > v1 index.orig training test optimism index.corrected n Dxy -0.2542 -0.2578 -0.1356 -0.1223 -0.1320 10 R2 0.0543 0.0565 0.1372 -0.0806 0.1350 10 Slope 1.0000 1.0000 0.9107 0.0893 0.9107 10 D 0.0122 0.0128 0.0404 -0.0276 0.0397 10 U -0.0033 -0.0038 0.0873 -0.0911 0.0878 10 Q 0.0155 0.0166 -0.0470 0.0636 -0.0481 10 g 0.4369 0.4424 0.6754 -0.2331 0.6700 10
How do you perform the 100x resampling? I think my above code only performs the cross-validation once.
I then wanted to know how good my model was at prediction. I tried the following:
> c_index <- abs(v1[1,5])/2 + 0.5 > c_index [1] 0.565984
Does this mean that my model is only very slightly better than flipping a coin?
Third Problem
Dr. Harrell points out that I have assumed linearity for the covariate effects, and that the number of events in my sample is just barely large enough to fit a reliable model if all covariate effects happen to be linear.
- Does this mean that I should include some sort of interaction term in my model? If so, any advice as to what to put?
Answer
To get started, consider a few things. First, you are excluding too many observations with missing data and this will cause a bias. Consider multiple imputation. Second, there is a plot method for cox.zph
which is useful in assessing proportional hazards. Third, you have assumed linearity for the covariate effects. Fourth, the number of events in your training sample is just barely large enough to fit a reliable model if all covariate effects happen to be linear (which is rare). And your test sample would have to have perhaps 400 events before it would yield a reliable assessment of prediction accuracy. It is not clear that you had enough data to split the data into two parts. Resampling validation (100 repeats of 10-fold cross-validation, or use the bootstrap) is a better solution. Both your original external validation (functions rcorr.cens
and val.surv
) and resampling internal validation (functions validate
, calibrate
) are implemented in the R rms
package. Case studies for the rms
package are found in my course notes at http://biostat.mc.vanderbilt.edu/rms (and I have a 3-day course on this in Nashville next month). Note that $2\times 2$ tables are not appropriate for use with continuous data.
Attribution
Source : Link , Question Author : Alexander , Answer Author : Frank Harrell