I apologize in advance for the fact that I’m still coming up to speed on this. I’m trying to understand the pros and cons of using tanh (map -1 to 1) vs. sigmoid (map 0 to 1) for my neuron activation function. From my reading it sounded like a minor thing with marginal differences. In practice for my problems I find that the sigmoid is easier to train and strangely, the sigmoid appears to find general solution better. By this I mean that when the sigmoid version is done training it does well on the reference (untrained) data set, where the tanh version seems to be able to get the correct answers on training data while doing poorly on the reference. This is for the same network architecture.
One intuition I have is that with the sigmoid, it’s easier for a neuron to almost fully turn off, thus providing no input to subsequent layers. The tanh has a harder time here since it needs to perfectly cancel its inputs, else it always gives a value to the next layer. Maybe this intuition is wrong though.
Long post. Bottom line, what’s the trade, and should it make a big difference?
In Symon Haykin’s “Neural Networks: A Comprehensive Foundation” book there is the following explanation from which I quote:
For the learning time to be minimized, the use of non-zero mean inputs should be avoided. Now, insofar as the signal vector x applied to a neuron in the first hidden layer of a multilayer perceptron is concerned, it is easy to remove the mean from each element of x before its application to the network. But what about the signals applied to the neurons in the remaining hidden and output layers of the network? The answer to this question lies in the type of activation function used in the network. If the activation function is non-symmetric, as in the case of the sigmoid function, the output of each neuron is restricted to the interval [0,1]. Such a choice introduces a source of systematic bias for those neurons located beyond the first layer of the network. To overcome this problem we need to use an antisymmetric activation function such as the hyperbolic tangent function. With this latter choice, the output of each neuron is permitted to assume both positive and negative values in the interval [−1,1], in which case it is likely for its mean to be zero. If the network connectivity is large, back-propagation learning with antisymmetric activation functions can yield faster convergence than a similar process with non-symmetric activation functions, for which there is also empirical evidence (LeCun et al. 1991).
The cited reference is:
- Y. LeCun, I. Kanter, and S.A.Solla: “Second-order properties of error surfaces: learning time and generalization”, Advances in Neural Information Processing Systems, vol. 3, pp. 918-924, 1991.
Another interesting reference is the following:
- Y. LeCun, L. Bottou, G. Orr and K. Muller: “Efficient BackProp“, in Orr, G. and Muller K. (Eds), Neural Networks: Tricks of the trade, Springer, 1998