I have a question and would like to hear what the community has to say. Suppose you are training a deep learning neural network. The implementation details are not relevant for my question. I know very well that if you choose a learning rate that is too big, you end up with a cost function that may becomes nan (if, for example, you use the sigmoid activation function). Suppose I am using the cross entropy as cost function. Typical binary classification (or even multi class with softmax) problem. I also know about why this happen. I often observe the following behaviour: my cost function decreases nicely, but after a certain number of epochs it becomes nan. Reducing the learning rate make this happen later (so after more epochs). Is this really because the (for example) gradient descent after getting very close to the minimum cannot stabilize itself and starts bouncing around wildly? I thought that the algorithm will not converge exactly to the minimum but should oscillates around it, remaining more or less stable there… Thoughts?

**Answer**

Well, if you get NaN values in your cost function, it means that the input is outside of the function domain. E.g. the logarithm of 0. Or it could be in the domain analytically, but due to numerical errors we get the same problem (e.g. a small value gets rounded to 0).

It has nothing to do with an inability to “settle”.

So, you have to determine what the non-allowed function input values for your given cost function are. Then, you have to determine why you are getting that input to your cost function. You may have to change the scaling of the input data and the weight initialization. Or you just have to have an adaptive learning rate as suggested by Avis, as the cost function landscape may be quiet chaotic. Or it could be because of something else, like numerical issues with some layer in your architecture.

It is very difficult to say with deep networks, but I suggest you start looking at the progression of the input values to your cost function (the output of your activation layer), and try to determine a cause.

**Attribution***Source : Link , Question Author : Umberto , Answer Author : andfor*