Finding category with maximum likelihood method

Let’s say that we had an information for men and women heights.

R code:

set.seed(1) 
Women=rnorm(80, mean=168, sd=6) 
Men=rnorm(120, mean=182, sd=7) 
par(mfrow=c(2,1)) 
hist(Men, xlim=c(150, 210), col="skyblue") 
hist(Women, xlim=c(150, 210), col="pink")

Unfortunately something happened and we lost the information who is women and who is men.

R code:

heights=c(Men, Women) 
par(mfrow=c(1,1)) 
hist(heights, col="gray70") 
rm(women, men) 

Could we somehow estimate women and men mean heights and standard deviation using maximum likelihood method?

We know that men and women heights are normally distributed.

Answer

This is a classic unsupervised learning problem that has a simple maximum likelihood solution. The solution is a motivating example for the expectation maximization algorithm. The process is:

  1. Initialize group assignment
  2. Estimate the group-wise means and likelihoods.
  3. Calculate the likelihood of membership for each observation to either group
  4. Assign group labels based on MLE

Repeat steps 2-4 until convergence, i.e. no reassigned group.

WLOG I can assume I know there are 80 out of all 200 who are women. Another thing to note, if we don’t build in the assumption that women are shorter than men, a clustering algo isn’t too discerning about which group is labeled as which, and it’s interesting to note the cluster label assignment can be reversed.

    set.seed(1) 
    Women=rnorm(80, mean=168, sd=6) 
    Men=rnorm(120, mean=182, sd=7) 
    AllHeight <- c(Women, Men)
    trueMF <- rep(c('F', 'M'), c(80, 120))
    
    ## case1  assume women are shorter, so assign first 
    ## 80 lowest height
    MF <- ifelse(order(AllHeight) <= 80, 'F', 'M')
    
    ## case 2 try randomly allocating 
    # MF <- sample(trueMF, replace = F)
    
    steps <- 0
    
    repeat {
      steps <- steps + 1
      mu <- tapply(AllHeight, MF, mean)
      sd <- tapply(AllHeight, MF, sd)
      logLik <- mapply(dnorm, x=list(AllHeight), mean=mu, sd=sd, 
                        log=T)
      MFnew <- c('F', 'M')[apply(logLik, 1, which.max)]
      if (all(MF==MFnew)) break
      else MF <- MFnew
    }
    
    ## case 1: 
    # 85% correct
    # 2 steps
    # Means
    # F        M 
    # 168.7847 183.5424 
    
    ## case 2:
    ## 15% correct
    ## 7 steps
    # F        M 
    # 183.5424 168.7847 
    
    ## what else?

Attribution
Source : Link , Question Author : john hamilton , Answer Author : kjetil b halvorsen

Leave a Comment