I am trying to run a Cox regression on a sample 2,000,000 row dataset as follows using only R. This is a direct translation of a PHREG in SAS. The sample is representative of the structure of the original dataset.
## library(survival) ### Replace 100000 by 2,000,000 test <- data.frame(start=runif(100000,1,100), stop=runif(100000,101,300), censor=round(runif(100000,0,1)), testfactor=round(runif(100000,1,11))) test$testfactorf <- as.factor(test$testfactor) summ <- coxph(Surv(start,stop,censor) ~ relevel(testfactorf, 2), test) # summary(summ) ## user system elapsed 9.400 0.090 9.481
The main challenge is in the compute time for the original dataset (2m rows). As far as I understand, in SAS this could take up to 1 day, … but at least it finishes.
Running the example with only 100,000 observations take only 9 seconds. Thereafter the time increases almost quadratically for every 100,000 increment in the number of observations.
I have not found any means to parallelize the operation (e.g., we can leverage a 48-core machine if this was possible)
biglmnor any package from Revolution Analytics is available for Cox regression, and so I cannot leverage those.
Is there a means to represent this in terms of a logistic regression (for which there are packages in Revolution) or if there are any other alternatives to this problem? I know that they are fundamentally different, but it’s the closest I can assume as a possibility given the circumstances.
I run cox regression on a 7’000’000 observation dataset using R and this is not a problem. Indeed, on bivariate models I get the estimates in 52 seconds. I suggest that it is -as often with R- a problem related to the RAM available. You may need at least 12GB to run the model smoothly.