High dimensional, correlated data and top features/ covariates discovered; multiple hypothesis testing?

I have a dataset with about 5,000 often correlated features / covariates and a binary response. The data was given to me, I didn’t collect it. I use Lasso and gradient boosting to build models. I use iterated, nested cross validation. I report Lasso’s largest (absolute) 40 coefficients and the 40 most important features in the gradient boosted trees (there was nothing special about 40; it just seemed to be a reasonable amount of information). I also report the variance of these quantities over the folds and iterations of CV.

I kind of muse over the “important” features, making no statements about p-values or causality or anything, but instead considering this process a kind of—albeit imperfect and sort of random—insight into some phenomenon.

Assuming I have done all this correctly (e.g., executed cross validation correctly, scaled for lasso), is this approach reasonable? Are there issues with, e.g., multiple hypothesis testing, post hoc analysis, false discovery? Or other problems?


Predict the probability of an adverse event

  • Foremost, estimate the probability accurately
  • More minor–as a sanity
    check, but also to perhaps reveal some novel predictors that could be
    investigated further, inspect coefficients and importances as
    mentioned above.


  • Researchers interested in predicting this event and the people who end up having to fix the event if it occurs

What I want them to get out of it

  • Give them the ability to predict the event, if they wish to repeat the modeling process, as described, with their own data.

  • Shed some light on unexpected predictors. For example, it might turn out that something completely unexpected is the best predictor. Modelers elsewhere therefore might give more serious consideration to said predictor.


There are no problems with the accuracy of the predictions. The uncertainty in your predictions is estimated well by crossvalidation. Maybe one caveat there is that if you test a lot of parameter settings, then you overestimate the accuracy, so you should use a validation set to estimate the accuracy of your final model. Also, your data should be representative of the data that you are going to do predictions on.

It is clear to you, and it should be clear to the reader, that your predictors are not causes of the effect, they are just predictors that make a good prediction, and work well empirically. While I completely agree with your caution, inferring any causation from observational data is problematic in any case. Things like significance and such are “valid” concepts in well-designed, controlled studies, and outside of that they are merely tools that you, and others, should interpret wisely and with caution. There can be common causes, spurious effects, masking and other things going on in a normal linear regression with reported confidence intervals, as well as in a lasso model, as well as in a gradient boosted tree model.

Source : Link , Question Author : sjw , Answer Author : Gijs

Leave a Comment