Why does my LSTM take so much time to train?

I am trying to train a bidirectional LSTM to do a sequential text-tagging task (particularly, I want to do automatic punctuation).

I use letters as the building-blocks: I represent each input letter with a 50-dimensional embedding vector, fed into a single 100-dimensional hidden layer, fed into a 100-dimensional output layer, which is fed into an MLP.

For training, I have a corpus with about 18 miliion letters – there are about 70,000 sentences with about 250 letters in each sentence.
(I have used DyNet on Python 3 on Ubuntu 16.04 system).

The main problem is that training is awfully slow : each iteration of training takes about half a day. Since training usually takes about 100 iterations, it means I will have to wait over a month to get reasonable results.

I asked some other people that do deep learning, and they told me “deep learning is slow, you have to get used to it”. Still, waiting over a month for training seems horribly slow.

Are these times common for training LSTM models? If not, what am I doing wrong, and what can I do to speed up the training?


However much it pains me to say this, Deep learning is slow, get used to it.

There are some things you could do to speed up your training though:

  • What GPU are you using? A friend of mine was doing some research on LSTM’s last year and training them on her NVIDIA GTX7?? GPU. Since this was going painfully slow, they tried to train the network on a more modern CPU, which actually led to a speed-up by a non trivial factor.

  • What framework are you using? While most frameworks are somewhat comparable, I have heard rumors (https://arxiv.org/pdf/1608.07249.pdf) that some frameworks are slower than others. It might be worthwhile to switch frameworks if you’re going to be doing a lot of training.

  • Is it possible to train your network on your company/university hardware? Universities and research companies usually have some powerful hardware at their disposal. If this is not an option, maybe you can look into using some cloud-computing power.

All these solutions obviously assume your model itself is as optimal as it can be (In terms of training time and accuracy), which is also something you need to consider, but is outside of the scope of this answer.

Source : Link , Question Author : Erel Segal-Halevi , Answer Author : shmoo6000

Leave a Comment