Doing correct statistics in a working environment?

I am not sure where this question belongs to: Cross Validated, or The Workplace. But my question is vaguely related to statistics.

This question (or I guess questions) arose during my working as a “data science intern”. I was building this linear regression model and examining the residual plot. I saw clear sign of heteroskedasticity. I remember that heteroskedasticity distorts many test statistics such as confidence interval and t-test. So I used weighted least square, following what I have learned at college. My manager saw that and advised me not to do that because “I was making things complicated”, which was not a very convincing reason to me at all.

Another example would be “removing an explanatory variable since its p-value is insignificant”. To be, this advice just does not make sense from a logical point of view. According to what I have learned, insignificant p-value could be due to different reasons: chance, using the wrong model, violating the assumptions, etc.

Yet another example is that, I used k-fold cross validation to evaluate my model. According to the result, $CV_{model 1}$ is just way better than $CV_{model 2}$. But we do have a lower $R^2$ for model 1, and the reason has something to do with the intercept. My supervisor, though, seems to prefer model 2 because it has higher $R^2$. His reasons (such as $R^2$ is robust, or cross-validation is machine learning approach, not statistical approach) just do not seem to be convincing enough to change my mind.

As someone who just graduated from college, I am very confused. I am very passionate about applying correct statistics to solve real world problems, but I don’t know which of the followings is true:

  1. The statistics I learned by myself is just wrong, so I am just making mistakes.
  2. There is huge difference between theoretical statistics and building models in companies. And although statistics theory is right, people just don’t follow it.
  3. The manager is not using statistics correctly.

Update at 4/17/2017: I have decided to pursue a Ph.D. in statistics. Thank you all for your reply.

Answer

In a nutshell, you’re right and he’s wrong. The tragedy of data analysis is that a lot of people do it, but only a minority of people do it well, partly due to a weak education in data analysis and partly due to apathy. Turn a critical eye to most any published research article that doesn’t have a statistician or a machine-learning expert on the author list and you’ll quickly spot such elementary mistakes as interpreting $p$-values as the probability that the null hypothesis is true.

I think the only thing to do, when confronted with this kind of situation, is to carefully explain what’s wrong about the wrongheaded practice, with an example or two.

Attribution
Source : Link , Question Author : 3x89g2 , Answer Author : Kodiologist

Leave a Comment