I’m modeling 15000 tweets for sentiment prediction using a single layer LSTM with 128 hidden units using a word2vec-like representation with 80 dimensions. I get a descent accuracy (38% with random = 20%) after 1 epoch. More training makes the validation accuracy start declining as the training accuracy starts climbing – a clear sign of overfitting.
I’m therefore thinking of ways to do regularization. I’d prefer not to reduce the number of hidden units (128 seems a bit low already). I currently use dropout with a probability 50%, but this could perhaps be increased. The optimizer is Adam with the default parameters for Keras (http://keras.io/optimizers/#adam).
What are some effective ways of reducing overfitting for this model on my dataset?
You could try:
- Reduce the number of hidden units, I know you said it already seems low, but given that the input layer only has 80 features, it actually can be that 128 is too much. A rule of thumb is to have the number of hidden units be in-between the number of input units (80) and output classes (5);
- Alternatively, you could increase the dimension of the input representation space to be more than 80 (however this may overfit as well if the representation is already too narrow for any given word).
A good way to fit a network is too begin with an overfitting network and then reduce capacity (hidden units and embedding space) until it no longer overfits.