# Confounding variables in machine learning predictions?

In classical statistics, confounding variable is a critical concept since it can distort our view about input variables and outcome variable’s relationship. Many forms of control and adjustment are sought in statistics to eliminate, avoid or minimize the effect of confounding. For example, expected confounding variables (i.e., age and sex) are often included in the analysis, in the final model, the coefficient of your interested explanatory variable (i.e., treatment) is then adjusted for confounding (i.e., age and sex).

Confounding is not a frequent topic shows up in machine learning and predictive analysis. I wonder how confounding may (or may not) play an important role in machine learning algorithms. Does confounding potentially affect the accuracy of out-of-sample accuracy? Does including or not including an expected confounding variable play an important consideration when selecting as feature in machine learning?

Confounding is not as a big a problem when performing prediction, because we are not concerned with identifying the exact effect of a variable on another. We are simply looking to find out what is the most likely’ value of a dependent variable given a set of predictors.
It is very likely that $\beta_1$ in the equation above will be positive and fairly large, because older people tend to have more education and more work experience. So if we wish pin-point the link between age and salary, we should probably control for these confounders, estimating the model:
It is very likely that $\beta_1^* < \beta_1$ and that $\beta_1^*$ will be a much better estimator for the pure effect of age on one's earnings. That, in the sense of change someone's age and keep EVERYTHING else fixed'. However, since age is highly correlated with education and experience, the first model might just be good enough for predicting a person's salary.