Does regression work on data that isn’t normally distributed?

histogram of my data

I’m trying to see if variables x and y together or separately significantly affect Q_7 (the histogram for which is above). I’ve run a Shapiro-Wilk normality test and got the following

shapiro.test(Q_7)
## data:  Q_7
## W = 0.68439, p-value < 2.2e-16

With this distribution, will the following regression work? Or is there another test I should be doing?

lm(Q_7 ~ x*y)

Answer

A regression analysis assumes that the data is normally distributed conditioned on the variables in the regression model. That is, if this is the regression model:
y=Xβ+ε
where X is your matrix of regressor variables, y is the (vector of) data to be explained, β is a vector of coefficients on the regressors and ε is random variability (typically considered noise), then the assumption of Normality applies strictly to ε, not to y (edit: well, strictly speaking it applies to the conditional distribution y|X (which is the same as the distribution of ε), but not to the marginal distribution of y). In other words, the data should be Normally distributed once the effects of the regressors have been accounted for, but not (necessarily) before.

What you’re testing here is the distribution of y, where what you want to test is the distribution of ε. Of course you don’t know ε, but you can estimate it by running the regression and examining the distrbution of the residuals ˆε=yXˆβ (where ˆβ are the estimated coefficents from the regression). These residuals ˆε are an estimate of ε, and so their distribution will be an approximation of the distribution of ε.

Attribution
Source : Link , Question Author : Community , Answer Author : Ruben van Bergen

Leave a Comment