# How to fit mixture model for clustering

I have two variables – X and Y and I need to make cluster maximum (and optimal) = 5. Let’s ideal plot of variables is like following: I would like to make 5 clusters of this. Something like this: Thus I think this is mixture model with 5 clusters. Each clusters have center point and a confidence circle around it.

The clusters are not always pretty like this, they look like the following, where sometime two clusters are close together or one or two clusters are completely missing.  How can fit mixture model and perform classification (clustering) in this situation effectively?

Example:

``````set.seed(1234)
X <- c(rnorm(200, 10, 3), rnorm(200, 25,3),
rnorm(200,35,3), rnorm(200,65, 3), rnorm(200,80,5))
Y <- c(rnorm(1000, 30, 2))
plot(X,Y, ylim = c(10, 60), pch = 19, col = "gray40")
``````

Here is script for using mixture model using mcluster.

``````X <- c(rnorm(200, 10, 3), rnorm(200, 25,3), rnorm(200,35,3), rnorm(200,65, 3), rnorm(200,80,5))
Y <- c(rnorm(1000, 30, 2))
plot(X,Y, ylim = c(10, 60), pch = 19, col = "gray40")

require(mclust)
xyMclust <- Mclust(data.frame (X,Y))
plot(xyMclust)
``````  In a situation where there are less than 5 clusters:

``````X1 <- c(rnorm(200, 10, 3), rnorm(200, 25,3), rnorm(200,35,3),  rnorm(200,80,5))
Y1 <- c(rnorm(800, 30, 2))
xyMclust <- Mclust(data.frame (X1,Y1))
plot(xyMclust)
`````` `````` xyMclust4 <- Mclust(data.frame (X1,Y1), G=3)
plot(xyMclust4)
`````` In this case we are fitting 3 clusters. What if we fit 5 clusters ?

``````xyMclust4 <- Mclust(data.frame (X1,Y1), G=5)
plot(xyMclust4)
``````

It can force to make 5 clusters. Also let’s introduce some random noise:

``````X2 <- c(rnorm(200, 10, 3), rnorm(200, 25,3), rnorm(200,35,3),  rnorm(200,80,5), runif(50,1,100 ))
Y2 <- c(rnorm(850, 30, 2))
xyMclust1 <- Mclust(data.frame (X2,Y2))
plot(xyMclust1)
``````

`mclust` allows model-based clustering with noise, namely outlying observations that do not belong to any cluster. `mclust` allows to specify a prior distribution to regularize the fit to the data. A function `priorControl` is provided in mclust for specifying the prior and its parameters. When called with its defaults, it invokes another function called `defaultPrior` which can serve as a template for specifying alternative priors. To include noise in the modeling, an initial guess of the noise observations must be supplied via the noise component of the initialization argument in `Mclust` or `mclustBIC`. The other alternative would be to use `mixtools` package that allows you to specify mean and sigma for each components.

``````X2 <- c(rnorm(200, 10, 3), rnorm(200, 25,3), rnorm(200,35,3),
rnorm(200,80,5), rpois(50,30))
Y2 <- c(rnorm(800, 30, 2), rpois(50,30))
df <- cbind (X2, Y2)
require(mixtools)
out <- mvnormalmixEM(df, lambda = NULL, mu = NULL, sigma = NULL,
k = 5,arbmean = TRUE, arbvar = TRUE, epsilon = 1e-08,  maxit = 10000, verb = FALSE)
plot(out, density = TRUE, alpha = c(0.01, 0.05, 0.10, 0.12, 0.15),  marginal = TRUE)
`````` 