I am trying to run a Cox regression on a sample 2,000,000 row dataset as follows using only R. This is a direct translation of a PHREG in SAS. The sample is representative of the structure of the original dataset.

`## library(survival) ### Replace 100000 by 2,000,000 test <- data.frame(start=runif(100000,1,100), stop=runif(100000,101,300), censor=round(runif(100000,0,1)), testfactor=round(runif(100000,1,11))) test$testfactorf <- as.factor(test$testfactor) summ <- coxph(Surv(start,stop,censor) ~ relevel(testfactorf, 2), test) # summary(summ) ## user system elapsed 9.400 0.090 9.481`

The main challenge is in the compute time for the original dataset (2m rows). As far as I understand, in SAS this could take up to 1 day, … but at least it finishes.

Running the example with only 100,000 observations take only 9 seconds. Thereafter the time increases almost quadratically for every 100,000 increment in the number of observations.

I have not found any means to parallelize the operation (e.g., we can leverage a 48-core machine if this was possible)

Neither

`biglm`

nor any package from Revolution Analytics is available for Cox regression, and so I cannot leverage those.

Is there a means to represent this in terms of a logistic regression (for which there are packages in Revolution) or if there are any other alternatives to this problem?I know that they are fundamentally different, but it’s the closest I can assume as a possibility given the circumstances.

**Answer**

I run cox regression on a 7’000’000 observation dataset using R and this is not a problem. Indeed, on bivariate models I get the estimates in 52 seconds. I suggest that it is -as often with R- a problem related to the RAM available. You may need at least 12GB to run the model smoothly.

**Attribution***Source : Link , Question Author : xbsd , Answer Author : Mesozoik*