# Pitfalls to avoid when transforming data?

I achieved a strong linear relationship between my $X$ and $Y$ variable after doubly transforming the response. The model was
$Y\sim X$
but I transformed it to
$\sqrt{\frac{Y}{X}}\sim \sqrt{X}$
improving $R^2$ from .19 to .76.

Clearly I did some decent surgery on this relationship. Can anyone discuss the pitfalls of doing this, such as dangers of excessive transformations or possible violations of statistical principles?

You can’t really compare $R^2$ before and after, because the underlying variability in $Y$ is different. So you literally can take no comfort whatever from the change in $R^2$. That tells you nothing of value in comparing the two models.

The two models are different in several ways, so they mean different things — they assume very different things about the shape of the relationship and the variability of the error term (when considered in terms of the relationship between $Y$ and $X$). So if you’re interested in modelling $Y$ (if $Y$ itself is meaningful), produce a good model for that. If you’re interested in modelling $\sqrt Y$ (/$\sqrt Y$ is meaningful), produce a good model for that. If $\sqrt{Y/X}$ carries meaning, then make a good model for that. But compare any competing models on comparable scales. $R^2$ on different responses simply aren’t comparable.

If you’re just trying different relationships in the hope of finding a transformation with a high $R^2$ — or any other measure of ‘good fit’ — the properties of any inference you might like to conduct will be impacted by the existence of that search process.

Estimates will tend to be biased away from zero, standard errors will be too small, p-values will be too small, confidence intervals too narrow. Your models will on average appear to be ‘too good’ (in the sense that their out-of-sample behavior will be disappointing compared to in-sample behavior).

To avoid this kind of overfitting, you need, if possible, to do the model-identification and estimation on different subsets of the data (and model evaluation on a third). If you repeat this kind of procedure on many “splits” of the data taken at random, you get a better sense of how reproducible your results are.

There are many posts here with relevant points on these issues: it might be worth trying some searches.

(If you have good a priori reasons for choosing a particular transformation, that’s a different issue. But searching the space of transformations to find something that fits carries all manner of ‘data snooping’ type problems with it.)