How do I remove all but one specific duplicate record in an R data frame? [closed]

I have a data frame that contains some duplicate ids. I want to remove records with duplicate ids, keeping only the row with the maximum value.

So for structured like this (other variables not shown):

id var_1
1 2
1 4
2 1
2 3
3 5
4 2

I want to generate this:

id var_1
1 4
2 3
3 5
4 2

I know about unique() and duplicated(), but I can’t figure out how to incorporate the maximization rule…

Answer

One way is to reverse-sort the data and use duplicated to drop all the duplicates.
For me, this method is conceptually simpler than those that use apply. I think it should be very fast as well.

# Some data to start with:
z <- data.frame(id=c(1,1,2,2,3,4),var=c(2,4,1,3,5,2))
# id var
#  1   2
#  1   4
#  2   1
#  2   3
#  3   5
#  4   2

# Reverse sort
z <- z[order(z$id, z$var, decreasing=TRUE),]
# id var
#  4   2
#  3   5
#  2   3
#  2   1
#  1   4
#  1   2

# Keep only the first row for each duplicate of z$id; this row will have the
# largest value for z$var
z <- z[!duplicated(z$id),]

# Sort so it looks nice
z <- z[order(z$id, z$var),]
# id var
#  1   4
#  2   3
#  3   5
#  4   2

Edit: I just realized that the reverse sort above doesn’t even need to sort on id at all. You could just use z[order(z$var, decreasing=TRUE),] instead and it will work just as well.

One more thought… If the var column is numeric, then there’s a simple way to sort so that id is ascending, but var is descending. This eliminates the need for the sort at the end (assuming you even wanted it to be sorted).

z <- data.frame(id=c(1,1,2,2,3,4),var=c(2,4,1,3,5,2))

# Sort: id ascending, var descending
z <- z[order(z$id, -z$var),]

# Remove duplicates
z <- z[!duplicated(z$id),]
# id var
#  1   4
#  2   3
#  3   5
#  4   2

Attribution
Source : Link , Question Author : Abe , Answer Author : wch

Leave a Comment