Let’s say that we had an information for men and women heights.
R code:
set.seed(1) Women=rnorm(80, mean=168, sd=6) Men=rnorm(120, mean=182, sd=7) par(mfrow=c(2,1)) hist(Men, xlim=c(150, 210), col="skyblue") hist(Women, xlim=c(150, 210), col="pink")
Unfortunately something happened and we lost the information who is women and who is men.
R code:
heights=c(Men, Women) par(mfrow=c(1,1)) hist(heights, col="gray70") rm(women, men)
Could we somehow estimate women and men mean heights and standard deviation using maximum likelihood method?
We know that men and women heights are normally distributed.
Answer
This is a classic unsupervised learning problem that has a simple maximum likelihood solution. The solution is a motivating example for the expectation maximization algorithm. The process is:
- Initialize group assignment
- Estimate the group-wise means and likelihoods.
- Calculate the likelihood of membership for each observation to either group
- Assign group labels based on MLE
Repeat steps 2-4 until convergence, i.e. no reassigned group.
WLOG I can assume I know there are 80 out of all 200 who are women. Another thing to note, if we don’t build in the assumption that women are shorter than men, a clustering algo isn’t too discerning about which group is labeled as which, and it’s interesting to note the cluster label assignment can be reversed.
set.seed(1)
Women=rnorm(80, mean=168, sd=6)
Men=rnorm(120, mean=182, sd=7)
AllHeight <- c(Women, Men)
trueMF <- rep(c('F', 'M'), c(80, 120))
## case1 assume women are shorter, so assign first
## 80 lowest height
MF <- ifelse(order(AllHeight) <= 80, 'F', 'M')
## case 2 try randomly allocating
# MF <- sample(trueMF, replace = F)
steps <- 0
repeat {
steps <- steps + 1
mu <- tapply(AllHeight, MF, mean)
sd <- tapply(AllHeight, MF, sd)
logLik <- mapply(dnorm, x=list(AllHeight), mean=mu, sd=sd,
log=T)
MFnew <- c('F', 'M')[apply(logLik, 1, which.max)]
if (all(MF==MFnew)) break
else MF <- MFnew
}
## case 1:
# 85% correct
# 2 steps
# Means
# F M
# 168.7847 183.5424
## case 2:
## 15% correct
## 7 steps
# F M
# 183.5424 168.7847
## what else?
Attribution
Source : Link , Question Author : john hamilton , Answer Author : kjetil b halvorsen