Why is hierarchical softmax better for infrequent words, while negative sampling is better for frequent words?

I wonder why hierarchical softmax is better for infrequent words, while negative sampling is better for frequent words, in word2vec’s CBOW and skip-gram models. I have read the claim on https://code.google.com/p/word2vec/.

Answer

I’m not an expert in word2vec, but upon reading Rong, X. (2014). word2vec Parameter Learning Explained and from my own NN experience, I’d simplify the reasoning to this:

  • Hierarchical softmax provides for an improvement in training
    efficiency since the output vector is determined by a tree-like
    traversal of the network layers; a given training sample only has to
    evaluate/update $O(log(N))$ network units, not $O(N)$. This
    essentially expands the weights to support a large vocabulary – a
    given word is related to fewer neurons and visa-versa.
  • Negative sampling is a way to sample the training data, similar to
    stochastic gradient descent, but the key is you look for negative
    training examples. Intuitively, it trains based on sampling places it
    might have expected a word, but didn’t find one, which is faster than
    training an entire corpus every iteration and makes sense for common
    words.

The two methods don’t seem to be exclusive, theoretically, but anyway that seems to be why they’d be better for frequent and infrequent words.

Attribution
Source : Link , Question Author : Franck Dernoncourt , Answer Author : Antoine

Leave a Comment