Linear regression with log transformed data – large error [duplicate]

I have a set of data which is has a very large positive skew, and has been transformed using a logarithm. I wish to predict one variable from another using the lm function in R. Since both variables have been transformed, I am well aware that my regression will output the equation:

ln(y) = b*ln(x) + a, where a and b are the coefficients.

The model fit is good, with an R squared of almost 0.6, producing a range of predicted y values.

Now, i have ‘back-transformed’ the variables using the following equation:

y_predicted = exp(a)*x^b

However, the predicted values for the larger x and y are significantly lower than they should be. Since I am going to be using the mean and sum of all of the y_predicted values in comparison with the y_actual values, this makes my model under predict by around 75%.

Due to the logarithmic scale, a small deviation from the line of best fit in the log domain, has resulted in a very large deviation when back-transformed.

My question, is how to adequately deal with this? I can come up with my own regression coefficients, which ensures that the line of best fit over-predicts some of these larger values, and makes the sum more aligned. However, this would go against the point of using a linear model in the first place, which optimises the model.

Also, i am not sure how ‘statistically’ valid this would be, as the method could not be replicated, as the coefficients were determined by eye.

Thoughts welcome!

Answer

If you say your model is ln(y) = b*ln(x) + a it is only part of your model. Your actual model includes an error term:

$\ln y_i = b\cdot \ln x_i + a + \varepsilon_i$

and you assume that the error distribution is $\varepsilon_i \sim \mathcal{N}(0,\,\sigma^2)$. Now let’s back-transform it:

$y_i = \exp(a) \cdot x_i^b \cdot \exp(\varepsilon_i)$

As you see, you have a multiplicative error term, i.e., a relative error with constant variation. As a result, you allow more deviation from the fitted line in your higher fitted values, i.e., you place less weight on them. This actually is often justified, but of course gives you larger residuals for higher values as you have observed.

If you are not happy with this, you should not do transformation followed by OLS. One alternative would be a Generalized Linear Model, which models the error differently, or even non-linear regression.

Attribution
Source : Link , Question Author : sym246 , Answer Author : Nick Cox

Leave a Comment