In Bishop’s book “Pattern Classification and Machine Learning”, it describes a technique for regularization in the context of neural networks. However, I don’t understand a paragraph describing that during the training process, the number of degrees of freedom increases along with the model complexity. The relevant quote is the following:

An alternative to regularization as a way of controlling the effective

complexity of a network is the procedure of early stopping. The

training of nonlinear network models corresponds to an iterative

reduction of the error function defined with respect to a set of

training data. For many of the optimization algorithms used for

network training, such as conjugate gradients, the error is a

nonincreasing function of the iteration index. However, the error

measured with respect to independent data, generally called a

validation set, often shows a decrease at first, followed by an increase as the network starts to over-fit. Training can therefore be

stopped at the point of smallest error with respect to the validation

data set, as indicated in Figure 5.12, in order to obtain a network

having good generalization performance.The behaviour of the network

in this case is sometimes explained qualitatively in terms of the

effective number of degrees of freedom in the network, in which this

number starts out small and then to grows during the training process,

corresponding to a steady increase in the effective complexity of the

model.It also says that the number of parameters grows during the course of training. I was assuming that by “parameters”, it refers to the number of weights controlled by the network’s hidden units. Maybe I’m wrong because the weights are prevented to increase in magnitude by the regularization process but they don’t change in number. Could it be referring to the process of finding a good number of hidden units?

What’s a degree of freedom in a neural network? What parameters increase during training?

**Answer**

I suspect this is what Bishop means:

If you think of a neural net as a function that maps inputs to an output, then when you first initialize a neural net with small random weights, the neural net looks a lot like a linear function. The sigmoid activation function is close to linear around zero (just do a Taylor expansion), and small incoming weights will guarantee that the effective domaine of each hidden unit is just a small interval around zero, so the entire neural net, regardless of how many layers you have, will look very much like a linear function. So you can heuristically describe the neural net as having a small number of degrees of freedom (equal to the dimension of the input). As you train the neural net, the weights can become arbitrarily large, and the neural net can better approximate arbitrary non-linear functions. So as training progresses, you can heuristically describe that change as an increase in the number of degrees of freedom, or, more specifically, in increase in the size of the class of functions that the neural net can closely approximate.

**Attribution***Source : Link , Question Author : Robert Smith , Answer Author : Marc Shivers*