I’ve heard that when many regression model specifications (say, in OLS) are considered as possibilities for a dataset, this causes multiple comparison problems and the p-values and confidence intervals are no longer reliable. One extreme example of this is stepwise regression.

When can I use the data itself to help specify the model, and when is this not a valid approach? Do you always need to have a subject-matter-based theory to form the model?

**Answer**

Variable selection techniques, in general (whether stepwise, backward, forward, all subsets, AIC, etc.), capitalize on chance or random patterns in the sample data that do not exist in the population. The technical term for this is over-fitting and it is especially problematic with small datasets, though it is not exclusive to them. By using a procedure that selects variables based on best fit, all of the random variation that looks like fit ** in this particular sample** contributes to estimates and standard errors. This is a problem for

**prediction and interpretation of the model.**

*both*Specifically, r-squared is too high and parameter estimates are biased (they are too far from 0), standard errors for parameters are too small (and thus p-values and intervals around parameters are too small/narrow).

The best line of defense against these problems is to build models thoughtfully and include the predictors that make sense based on theory, logic, and previous knowledge. If a variable selection procedure is necessary, you should select a method that penalizes the parameter estimates (shrinkage methods) by adjusting the parameters and standard errors to account for over-fitting. Some common shrinkage methods are Ridge Regression, Least Angle Regression, or the lasso. In addition, cross-validation using a training dataset and a test dataset or model-averaging can be useful to test or reduce the effects of over-fitting.

Harrell is a great source for a detailed discussion of these problems. *Harrell (2001). “Regression Modeling Strategies.”*

**Attribution***Source : Link , Question Author : Statisfactions , Answer Author : Brett*