Is there a function in R that takes the centers of clusters that were found and assigns clusters to a new data set

I have two parts of a multidimensional data set, let’s call them train and test. And I want to built a model based on the train data set and then validate it on the test data set.
The number of clusters is known.

I tried to apply k-means clustering in R and I got an object that contains the centers of clusters:

kClust <- kmeans(train, centers=N, nstart=M)

Is there a function in R that takes the centers of clusters that were found and assigns clusters to my test data set?

What are the other methods/algorithms that I can try?

Answer

You can compute the cluster assignments for a new data set with the following function:

clusters <- function(x, centers) {
  # compute squared euclidean distance from each sample to each cluster center
  tmp <- sapply(seq_len(nrow(x)),
                function(i) apply(centers, 1,
                                  function(v) sum((x[i, ]-v)^2)))
  max.col(-t(tmp))  # find index of min distance
}

# create a simple data set with two clusters
set.seed(1)
x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2),
           matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2))
colnames(x) <- c("x", "y")
x_new <- rbind(matrix(rnorm(10, sd = 0.3), ncol = 2),
               matrix(rnorm(10, mean = 1, sd = 0.3), ncol = 2))
colnames(x_new) <- c("x", "y")

cl <- kmeans(x, centers=2)

all.equal(cl[["cluster"]], clusters(x, cl[["centers"]]))
# [1] TRUE
clusters(x_new, cl[["centers"]])
# [1] 2 2 2 2 2 1 1 1 1 1

plot(x, col=cl$cluster, pch=3)
points(x_new, col= clusters(x_new, cl[["centers"]]), pch=19)
points(cl[["centers"]], pch=4, cex=2, col="blue")

cluster assignment

or you could use the flexclust package, which has an implemented predict method for k-means:

library("flexclust")
data("Nclus")

set.seed(1)
dat <- as.data.frame(Nclus)
ind <- sample(nrow(dat), 50)

dat[["train"]] <- TRUE
dat[["train"]][ind] <- FALSE

cl1 = kcca(dat[dat[["train"]]==TRUE, 1:2], k=4, kccaFamily("kmeans"))
cl1    
#
# call:
# kcca(x = dat[dat[["train"]] == TRUE, 1:2], k = 4)
#
# cluster sizes:
#
#  1   2   3   4 
#130 181  98  91 

pred_train <- predict(cl1)
pred_test <- predict(cl1, newdata=dat[dat[["train"]]==FALSE, 1:2])

image(cl1)
points(dat[dat[["train"]]==TRUE, 1:2], col=pred_train, pch=19, cex=0.3)
points(dat[dat[["train"]]==FALSE, 1:2], col=pred_test, pch=22, bg="orange")

flexclust plot

There are also conversion methods to convert the results from cluster functions like stats::kmeans or cluster::pam to objects of class kcca and vice versa:

as.kcca(cl, data=x)
# kcca object of family ‘kmeans’ 
#
# call:
# as.kcca(object = cl, data = x)
#
# cluster sizes:
#
#  1  2 
#  50 50 

Attribution
Source : Link , Question Author : user2598356 , Answer Author : rcs

Leave a Comment