# Finding category with maximum likelihood method

Let’s say that we had an information for men and women heights.

R code:

``````set.seed(1)
Women=rnorm(80, mean=168, sd=6)
Men=rnorm(120, mean=182, sd=7)
par(mfrow=c(2,1))
hist(Men, xlim=c(150, 210), col="skyblue")
hist(Women, xlim=c(150, 210), col="pink")
``````

Unfortunately something happened and we lost the information who is women and who is men.

R code:

``````heights=c(Men, Women)
par(mfrow=c(1,1))
hist(heights, col="gray70")
rm(women, men)
``````

Could we somehow estimate women and men mean heights and standard deviation using maximum likelihood method?

We know that men and women heights are normally distributed.

This is a classic unsupervised learning problem that has a simple maximum likelihood solution. The solution is a motivating example for the expectation maximization algorithm. The process is:

1. Initialize group assignment
2. Estimate the group-wise means and likelihoods.
3. Calculate the likelihood of membership for each observation to either group
4. Assign group labels based on MLE

Repeat steps 2-4 until convergence, i.e. no reassigned group.

WLOG I can assume I know there are 80 out of all 200 who are women. Another thing to note, if we don’t build in the assumption that women are shorter than men, a clustering algo isn’t too discerning about which group is labeled as which, and it’s interesting to note the cluster label assignment can be reversed.

``````    set.seed(1)
Women=rnorm(80, mean=168, sd=6)
Men=rnorm(120, mean=182, sd=7)
AllHeight <- c(Women, Men)
trueMF <- rep(c('F', 'M'), c(80, 120))

## case1  assume women are shorter, so assign first
## 80 lowest height
MF <- ifelse(order(AllHeight) <= 80, 'F', 'M')

## case 2 try randomly allocating
# MF <- sample(trueMF, replace = F)

steps <- 0

repeat {
steps <- steps + 1
mu <- tapply(AllHeight, MF, mean)
sd <- tapply(AllHeight, MF, sd)
logLik <- mapply(dnorm, x=list(AllHeight), mean=mu, sd=sd,
log=T)
MFnew <- c('F', 'M')[apply(logLik, 1, which.max)]
if (all(MF==MFnew)) break
else MF <- MFnew
}

## case 1:
# 85% correct
# 2 steps
# Means
# F        M
# 168.7847 183.5424

## case 2:
## 15% correct
## 7 steps
# F        M
# 183.5424 168.7847

## what else?
``````