What is the connection between regularization and the method of lagrange multipliers ?

To prevent overfitting people people add a regularization term (proportional to the squared sum of the parameters of the model) with a regularization parameter $\lambda$ to the cost function of linear regression. Is this parameter $\lambda$ the same as a lagrange multiplier? So is regularization the same as the method of lagrange multiplier? Or how are these methods connected?

Say we are optimizing a model with parameters $$\vec{\theta}$$, by minimizing some criterion $$f(\vec{\theta})$$ subject to a constraint on the magnitude of the parameter vector (for instance to implement a structural risk minimization approach by constructing a nested set of models of increasing complexity), we would need to solve:

$$\mathrm{min}_\vec{\theta} f(\vec{\theta}) \quad \mathrm{s.t.} \quad \|\vec{\theta}\|^2 < C$$

The Lagrangian for this problem is (caveat: I think, its been a long day… 😉

$$\Lambda(\vec{\theta},\lambda) = f(\vec{\theta}) + \lambda\|\vec{\theta}\|^2 – \lambda C.$$

So it can easily be seen that a regularized cost function is closely related to a constrained optimization problem with the regularization parameter $$\lambda$$ being related to the constant governing the constraint ($$C$$), and is essentially the Lagrange multiplier. The $$-\lambda C$$ term is just an additive constant, so it doesn’t change the solution of the optimisation problem if it is omitted, just the value of the objective function.

This illustrates why e.g. ridge regression implements structural risk minimization: Regularization is equivalent to putting a constraint on the magnitude of the weight vector and if $$C_1 > C_2$$ then every model that can be made while obeying the constraint that

$$\|\vec{\theta}\|^2 < C_2$$

will also be available under the constraint

$$\|\vec{\theta}\|^2 < C_1$$.

Hence reducing $$\lambda$$ generates a sequence of hypothesis spaces of increasing complexity.