What does “degree of freedom” mean in neural networks?

In Bishop’s book “Pattern Classification and Machine Learning”, it describes a technique for regularization in the context of neural networks. However, I don’t understand a paragraph describing that during the training process, the number of degrees of freedom increases along with the model complexity. The relevant quote is the following:

An alternative to regularization as a way of controlling the effective
complexity of a network is the procedure of early stopping. The
training of nonlinear network models corresponds to an iterative
reduction of the error function defined with respect to a set of
training data. For many of the optimization algorithms used for
network training, such as conjugate gradients, the error is a
nonincreasing function of the iteration index. However, the error
measured with respect to independent data, generally called a
validation set, often shows a decrease at first, followed by an increase as the network starts to over-fit. Training can therefore be
stopped at the point of smallest error with respect to the validation
data set, as indicated in Figure 5.12, in order to obtain a network
having good generalization performance. The behaviour of the network
in this case is sometimes explained qualitatively in terms of the
effective number of degrees of freedom in the network, in which this
number starts out small and then to grows during the training process,
corresponding to a steady increase in the effective complexity of the
model.

It also says that the number of parameters grows during the course of training. I was assuming that by “parameters”, it refers to the number of weights controlled by the network’s hidden units. Maybe I’m wrong because the weights are prevented to increase in magnitude by the regularization process but they don’t change in number. Could it be referring to the process of finding a good number of hidden units?

What’s a degree of freedom in a neural network? What parameters increase during training?

Answer

I suspect this is what Bishop means:

If you think of a neural net as a function that maps inputs to an output, then when you first initialize a neural net with small random weights, the neural net looks a lot like a linear function. The sigmoid activation function is close to linear around zero (just do a Taylor expansion), and small incoming weights will guarantee that the effective domaine of each hidden unit is just a small interval around zero, so the entire neural net, regardless of how many layers you have, will look very much like a linear function. So you can heuristically describe the neural net as having a small number of degrees of freedom (equal to the dimension of the input). As you train the neural net, the weights can become arbitrarily large, and the neural net can better approximate arbitrary non-linear functions. So as training progresses, you can heuristically describe that change as an increase in the number of degrees of freedom, or, more specifically, in increase in the size of the class of functions that the neural net can closely approximate.

Attribution
Source : Link , Question Author : Robert Smith , Answer Author : Marc Shivers

Leave a Comment