I wonder why hierarchical softmax is better for infrequent words, while negative sampling is better for frequent words, in word2vec’s CBOW and skip-gram models. I have read the claim on https://code.google.com/p/word2vec/.
I’m not an expert in word2vec, but upon reading Rong, X. (2014). word2vec Parameter Learning Explained and from my own NN experience, I’d simplify the reasoning to this:
- Hierarchical softmax provides for an improvement in training
efficiency since the output vector is determined by a tree-like
traversal of the network layers; a given training sample only has to
evaluate/update $O(log(N))$ network units, not $O(N)$. This
essentially expands the weights to support a large vocabulary – a
given word is related to fewer neurons and visa-versa.
- Negative sampling is a way to sample the training data, similar to
stochastic gradient descent, but the key is you look for negative
training examples. Intuitively, it trains based on sampling places it
might have expected a word, but didn’t find one, which is faster than
training an entire corpus every iteration and makes sense for common
The two methods don’t seem to be exclusive, theoretically, but anyway that seems to be why they’d be better for frequent and infrequent words.