I have recently come to know about imputation techniques, which, in short, “guess” realistic values with which to replace missing values in a dataset. My big issue with this is that we are guessing data by assuming that they are similar to the ones we already had, which is going to reinforce any pattern that might be in the data, potentially turning a non-significant pattern into a significant one. How is this practice acceptable? What am I missing?
I am relatively new to the topic but I have done some studying and I am aware that imputation techniques range from replacing all NA with a fixed “realistic” value, to replacing it with the mean value of the observed values, to guessing the missing values with nearest-neighbor methods or with maximum likelihood methods. While I understand how these methods work I cannot shake off me the idea that they are crafting data. Imputation techniques differ in complexity and in how close to real the crafted data may look, but they are still crafting data. To me, this practice defeats the whole point of statistics as a tool to draw realistic inferences about a population based on a real, untampered sample of it, and not just a realistic sample of it. My question, to paraphrase Ian Malcolm, is not about whether we can do it but whether we should.
The first of Tukey’s principles against statistician’s hubris states:
The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data.
(From “Sunset Salvo”, The American Statistician 40(1), 72-76, February 1986)
Doesn’t imputation collide with it?
I realise that it may just be my ignorance talking, which may be making any statistician reading this livid. If that’s the case, please enlighten me. I would also appreciate pointers towards relevant literature. So far I only read the relevant chapter in Robinson’s “Forest analytics in R”. Cheers!
There is no clear cut answer here. The fun though is that one can verify the effects of the imputation using a validation procedure: let the data decide!
Should one throw away a feature if a few values are missing? Or the observations then? What if those observations have valuable information in the other features and your algorithm cannot handle missing values? And so on.
Imputation, like removing observations or features, is just a way of dealing with missing values. The decison of which one is best should be supported by good machine procedures like (cross-)validation.