I’m currently working on asymptotic properties of penalized regression. I’ve read a myriad of papers by now, but there is an essential issue that I cannot get my head around.

To keep things simple, I’m going to look at the minimization of

−l(β,X,Y)+nλp(β)

for some reasonable penalty function p (and l the loglikelihood). In theorems regarding the asymptotic properties of the resulting estimator, typically a requirement is imposed on λ, or more precisely two requirements: an upper and lower bound on its behaviour for large n (e.g. λ→0 and √nλ→∞ for n→∞. This is a requirement that shows up in papers by Fan en Li (SCAD), Zou (Adaptive Lasso) and some others.My issue with this is that it is never specified how to impose such boundaries. In practice, you have a single dataset and try to find the best possible value for the tuning parameter λ, but of course in this case the sample size doesn’t change and definitely is no approaching infinity.

My guess is that it means that your method to select the best value for λ (e.g. crossvalidation, AIC or BIC or similar) should be such that the limiting behaviour is as required, but noone ever proves this, or at least I have not been able to find it.

So, in short: can any of you explain to me how to work with these requirements for λ, or point me to papers/books/…/suggest a simulation experiment/ whatever that makes these issues clear. I’m hoping to prove similar asymptotic properties in settings beyond maximum likelihood, but then I need to understand the state of the art to its fullest.

EDIT:

Reading, re-reading and re-re-reading some of the papers, I finally realised that the asymptotic properties (of interest to me: consistent model selection and by extension, the oracle properties) perhaps

do notrequire the tuning parameter selection to uphold the limiting behaviour on the parameter itself. The theorems typically show that any λ series that satisfies the limiting conditions will result in an estimator with the properties of interest.As such, I just have to pick my λ that performs best, and “virtually promise” if I were to redo the analysis on a larger/smaller dataset, that I would scale that λ accordingly.

If this is correct, this only leaves me with my classical problem around crossvalidation: here, the effectiveness of the model is evaluated on (e.g.) 9/10 of the data. Even if I scale λ the right way, what guarantees that whichever criterion I’m using, scales along with it? This appears to be less of a problem with other methods of choosing the tuning parameter. Can anybody shed some light on this (I’m still trying to get my head around @Stefan Wager’s comment, so maybe it’s in there already)?

**Answer**

**Attribution***Source : Link , Question Author : Nick Sabbe , Answer Author : Community*