How deep is the connection between the softmax function in ML and the Boltzmann distribution in thermodynamics?

The softmax function, commonly used in neural networks to convert real numbers into probabilities, is the same function as the Boltzmann distribution, the probability distribution over energies for en ensemble of particles in thermal equilibrium at a given temperature T in thermodynamics.

I can see some clear heuristical reasons why this is practical:

  • No matter if input values are negative, softmax outputs positive values that sum to one.
  • It’s always differentiable, which is handy for backpropagation.
  • It has a ‘temperature’ parameter controlling how lenient the network should be toward small values (when T is very large, all outcomes are equally likely, when very small, only the value with the largest input is selected).

Is the Boltzmann function only used as softmax for practical reasons, or is there a deeper connection to thermodynamics/statistical physics?


To my knowledge there is no deeper reason, apart from the fact that a lot of the people who took ANNs beyond the Perceptron stage were physicists.

Apart from the mentioned benefits, this particular choice has more advantages. As mentioned, it has a single parameter that determines the output behaviour. Which in turn can be optimized or tuned in itself.

In short, it is a very handy and well known function that achieves a kind of ‘regularization’, in the sense that even the largest input values are restricted.

Of course there are many other possible functions that fulfill the same requirements, but they are less well known in the world of physics. And most of the time, they are harder to use.

Source : Link , Question Author : bjarkemoensted , Answer Author : cherub

Leave a Comment