A bit more info; suppose that
- you know before hand how many variables to select and that you set the complexity penalty in the LARS procedure such as to have exactly that many variables with non 0 coefficients,
- computation costs are not an issue (the total number of variable is small, say 50),
- that all the variables (y,x) are continuous.
In what setting would the LARS model (i.e. the OLS fit of those variables having non zero coefficients in the LARS fit) be most different from a model with the same number of coefficients but found through exhaustive search (a la regsubsets())?
Edit: I’m using 50 variables and 250 observations with the real coefficients drawn from a standard gaussian except for 10 of the variables having ‘real’ coefficients of 0 (and all the features being strongly correlated with one another). These settings are obviously not good as the differences between the two set of selected variables are minute. This is really a question about what type of data configuration should one simulate to get the most differences.
Here is the description of the LARS algorithm: http://www-stat.stanford.edu/~tibs/lasso/simple.html It kind of ignores the correlation between the regressors so I would venture to guess that it might miss out on the fit in case of multicollinearity.