Modelling with more variables than data points

I’m fairly new to Machine Learning/Modelling and I’d like some background to this problem.
I have a dataset where the number of observations is $n<200$ however the number of variables is $p\sim 8000$.
Firstly does it even make sense to consider building a model on a dataset like this or should one consider a variable selection technique to start with such as ridge regression or Lasso? I’ve read that this situation can lead to over-fitting. Is that the case for all ML techniques or do some techniques handle this better than others? Without too much maths a simple explanation on why the maths start to breakdown for $p>n$ would be appreciated.


It’s certainly possible to fit good models when there are more variables than data points, but this must be done with care.

When there are more variables than data points, the problem may not have a unique solution unless it’s further constrained. That is, there may be multiple (perhaps infinitely many) solutions that fit the data equally well. Such a problem is called ‘ill-posed’ or ‘underdetermined’. For example, when there are more variables than data points, standard least squares regression has infinitely many solutions that achieve zero error on the training data.

Such a model would certainly overfit because it’s ‘too flexible’ for the amount of training data. As model flexibility increases (e.g. more variables in a regression model) and the amount of training data shrinks, it becomes increasingly likely that the model will be able to achieve a low error by fitting random fluctuations in the training data that don’t represent the true, underlying distribution. Performance will therefore be poor when the model is run on future data drawn from the same distribution.

The problems of ill-posedness and overfitting can both be addressed by imposing constraints. This can take the form of explicit constraints on the parameters, a penalty/regularization term, or a Bayesian prior. Training then becomes a tradeoff between fitting the data well and satisfying the constraints. You mentioned two examples of this strategy for regression problems: 1) LASSO constrains or penalizes the $\ell_1$ norm of the weights, which is equivalent to imposing a Laplacian prior. 2) Ridge regression constrains or penalizes the $\ell_2$ norm of the weights, which is equivalent to imposing a Gaussian prior.

Constraints can yield a unique solution, which is desirable when we want to interpret the model to learn something about the process that generated the data. They can also yield better predictive performance by limiting the model’s flexibility, thereby reducing the tendency to overfit.

However, simply imposing constraints or guaranteeing that a unique solution exists doesn’t imply that the resulting solution will be good. Constraints will only produce good solutions when they’re actually suited to the problem.

A couple miscellaneous points:

  • The existence of multiple solutions isn’t necessarily problematic. For example, neural nets can have many possible solutions that are distinct from each other but near equally good.
  • The existence of more variables than data points, the existence of multiple solutions, and overfitting often coincide. But, these are distinct concepts; each can occur without the others.

Source : Link , Question Author : PaulB. , Answer Author : user20160

Leave a Comment