# Calibration of Cox regression survival analysis

1. To perform calibration of a Cox regression model (i.e. assessing for the agreement between the predicted and the observed outcome), what is the best method to present the accuracy of the model in predicting the actual event?

2. As far as I understand, we can calculate the actual outcome probability by observing the number of events that occurred in a number of subjects with similar/same predicted probability from the Cox model. To perform the above calculation, do we stratify the predicted risk into several groups (<15%, 15-30%, 30-45% etc.), and within each risk group we use the number of subjects as the denominator for the calculation of actual outcome?

3. What method do we use to compare the predicted outcome with the actual outcome? Is it good enough if we simply present the predicted and actual risk% in each risk group in table format? Can `rms` package in R do all calibrations for you?

4. Can we use `pec::predictSurvProb()` to give the absolute risk of event for each individual? Can we specify the time point for the risk/hazard function for each individual to be at the ENDPOINT of follow up?

5. When interpreting the results, do we use the mean follow up period (in years) as the time point on which the predicted risk and actual risk are based? (E.g. Individual A has 30% risk of event at 6.5 years (mean follow up period))

6. Is the goodness-of-fit test for Cox regression (Gronnesby and Borgan test) simply a means for calibration for cox regression? Or does it mean something else?

7. To compare models with net reclassification, how many subjects and outcomes do we need for such method to become valid?

1. Cox models do not predict outcomes! “Best” methods depend on whether you obtain a risk score (as with Framingham) or absolute risk (as with Gail Breast Cancer Risk). You need to tell us exactly what you’re fitting

2. With absolute risk prediction, you can split groups according to their risk deciles and calculate proportions of observed vs. expected outcome frequencies. This is basically the Hosmer Lemeshow test. But, in order to use this test, you need to have an absolute risk prediction! You cannot, say, split the groups by risk score deciles and use the empirical risk as the risk prediction, this strips off too much information and leads to some counter intuitive results.

3. The bioconductor package has a suite of tools related to ROC analyses, predictiveness curves, etc.

4. Nowhere in Ulla’s package is mention made of estimating smoothed baseline hazard estimates. This is necessary to obtain risk prediction from survival models… because of censoring! Here’s an example of that method being applied. I would accept no less from the package.

5. No, don’t use mean follow up. You should report total person years follow-up, along with censoring rate, and event rate. The Kaplan Meier curve kinda shows you all of that.

6. I’m sure Sir David Cox is not fond of G&B’s test. The power of the Cox model is that it can give consistent inference without necessarily having predictive accuracy: a tough concept for many to grasp. Tsiatis’ book “semiparametric inference” has a lot to say about this. However, if you aim to take the Cox model one step further and create predictions from it, then I think the G&B test is very good for that purpose.

7. Reclassification indices are proportions of individuals being shuffled into different (more discriminating) risk categories comparing two competing risk prediction models (see Pencina). It’s important to realize (Kerr 2011) that you can calculate confidence intervals for this value… not using the bootstrap (or any limit theory treating the model as fixed) but using the double bootstrap (bootstrap sample, refit model, bootstrap sample again, calibrate models).