I have two variables – X and Y and I need to make cluster maximum (and optimal) = 5. Let’s ideal plot of variables is like following:
I would like to make 5 clusters of this. Something like this:
Thus I think this is mixture model with 5 clusters. Each clusters have center point and a confidence circle around it.
The clusters are not always pretty like this, they look like the following, where sometime two clusters are close together or one or two clusters are completely missing.
How can fit mixture model and perform classification (clustering) in this situation effectively?
Example:
set.seed(1234) X <- c(rnorm(200, 10, 3), rnorm(200, 25,3), rnorm(200,35,3), rnorm(200,65, 3), rnorm(200,80,5)) Y <- c(rnorm(1000, 30, 2)) plot(X,Y, ylim = c(10, 60), pch = 19, col = "gray40")
Answer
Here is script for using mixture model using mcluster.
X <- c(rnorm(200, 10, 3), rnorm(200, 25,3), rnorm(200,35,3), rnorm(200,65, 3), rnorm(200,80,5))
Y <- c(rnorm(1000, 30, 2))
plot(X,Y, ylim = c(10, 60), pch = 19, col = "gray40")
require(mclust)
xyMclust <- Mclust(data.frame (X,Y))
plot(xyMclust)
In a situation where there are less than 5 clusters:
X1 <- c(rnorm(200, 10, 3), rnorm(200, 25,3), rnorm(200,35,3), rnorm(200,80,5))
Y1 <- c(rnorm(800, 30, 2))
xyMclust <- Mclust(data.frame (X1,Y1))
plot(xyMclust)
xyMclust4 <- Mclust(data.frame (X1,Y1), G=3)
plot(xyMclust4)
In this case we are fitting 3 clusters. What if we fit 5 clusters ?
xyMclust4 <- Mclust(data.frame (X1,Y1), G=5)
plot(xyMclust4)
It can force to make 5 clusters.
Also let’s introduce some random noise:
X2 <- c(rnorm(200, 10, 3), rnorm(200, 25,3), rnorm(200,35,3), rnorm(200,80,5), runif(50,1,100 ))
Y2 <- c(rnorm(850, 30, 2))
xyMclust1 <- Mclust(data.frame (X2,Y2))
plot(xyMclust1)
mclust
allows model-based clustering with noise, namely outlying observations that do not belong to any cluster. mclust
allows to specify a prior distribution to regularize the fit to the data. A function priorControl
is provided in mclust for specifying the prior and its parameters. When called with its defaults, it invokes another function called defaultPrior
which can serve as a template for specifying alternative priors. To include noise in the modeling, an initial guess of the noise observations must be supplied via the noise component of the initialization argument in Mclust
or mclustBIC
.
The other alternative would be to use mixtools
package that allows you to specify mean and sigma for each components.
X2 <- c(rnorm(200, 10, 3), rnorm(200, 25,3), rnorm(200,35,3),
rnorm(200,80,5), rpois(50,30))
Y2 <- c(rnorm(800, 30, 2), rpois(50,30))
df <- cbind (X2, Y2)
require(mixtools)
out <- mvnormalmixEM(df, lambda = NULL, mu = NULL, sigma = NULL,
k = 5,arbmean = TRUE, arbvar = TRUE, epsilon = 1e-08, maxit = 10000, verb = FALSE)
plot(out, density = TRUE, alpha = c(0.01, 0.05, 0.10, 0.12, 0.15), marginal = TRUE)
Attribution
Source : Link , Question Author : rdorlearn , Answer Author : John