I have a prediction model tested with four methods as you can see in the boxplot figure below. The attribute that the model predicts is in range of 0-8.
You may notice that there is one upper-bound outlier and three lower-bound outliers indicated by all methods. I wonder if it is appropriate to remove these instances from the data? Or is this a sort of cheating to improve the prediction model?
It is almost always a cheating to remove observations to improve a regression model. You should drop observations only when you truly think that these are in fact outliers.
For instance, you have time series from the heart rate monitor connected to your smart watch. If you take a look at the series, it’s easy to see that there would be erroneous observations with readings like 300bps. These should be removed, but not because you want to improve the model (whatever it means). They’re errors in reading which have nothing to do with your heart rate.
One thing to be careful though is the correlation of errors with the data. In my example it could be argued that you have errors when the heart rate monitor is displaced during exercises such as running o jumping. Which will make these errors correlated with the hart rate. In this case, care must be taken in removal of these outliers and errors, because they are not at random
I’ll give you a made up example of when to not remove outliers. Let’s say you’re measuring the movement of a weight on a spring. If the weight is small relative to the strength of the weight, then you’ll notice that Hooke’s law works very well: F=−kΔx, where F is force, k – tension coefficient and Δx is the position of the weight.
Now if you put a very heavy weight or displace the weight too much, you’ll start seeing deviations: at large enough displacements Δx the motion will seem to deviate from the linear model. So, you might be tempted to remove the outliers to improve the linear model. This would not be a good idea, because the model is not working very well since Hooke’s law is only approximately right.
In your case I would suggest pulling those data points and looking at them closer. Could it be lab instrument failure? External interference? Sample defect? etc.
Next try to identify whether the presnece of these outliers could be correlated with what you measure like in the example I gave. If there’s correlation then there’s no simple way to go about it. If there’s no correlation then you can remove the outliers