Fast ways in R to get the first row of a data frame grouped by an identifier [closed]

Sometimes I need to get only the first row of a data set grouped by an identifier, as when retrieving age and gender when there are multiple observations per individual. What’s a fast (or the fastest) way to do this in R? I used aggregate() below and suspect there are better ways. Before posting this question I searched a bit on google, found and tried ddply, and was surprised that it was extremely slow and gave me memory errors on my dataset (400,000 rows x 16 cols, 7,000 unique IDs), whereas the aggregate() version was reasonably fast.

(dx <- data.frame(ID = factor(c(1,1,2,2,3,3)), AGE = c(30,30,40,40,35,35), FEM = factor(c(1,1,0,0,1,1))))
# ID AGE FEM
#  1  30   1
#  1  30   1
#  2  40   0
#  2  40   0
#  3  35   1
#  3  35   1
ag <- data.frame(ID=levels(dx$ID))
ag <- merge(ag, aggregate(AGE ~ ID, data=dx, function(x) x[1]), "ID")
ag <- merge(ag, aggregate(FEM ~ ID, data=dx, function(x) x[1]), "ID")
ag
# ID AGE FEM
#  1  30   1
#  2  40   0
#  3  35   1
#same result:
library(plyr)
ddply(.data = dx, .var = c("ID"), .fun = function(x) x[1,])

UPDATE: See Chase’s answer and Matt Parker’s comment for what I consider to be the most elegant approach. See @Matthew Dowle’s answer for the fastest solution which uses the data.table package.

Answer

Is your ID column really a factor? If it is in fact numeric, I think you can use the diff function to your advantage. You could also coerce it to numeric with as.numeric().

dx <- data.frame(
    ID = sort(sample(1:7000, 400000, TRUE))
    , AGE = sample(18:65, 400000, TRUE)
    , FEM = sample(0:1, 400000, TRUE)
)

dx[ diff(c(0,dx$ID)) != 0, ]

Attribution
Source : Link , Question Author : lockedoff , Answer Author : Chase

Leave a Comment