I have a classification dataset where roughly 20% (maybe more) of the labels are incorrect. There is no way to know which labels are incorrect and no way to eliminate them in the future when further data is collected.
A method I see in dealing with this is to train an ensemble of classifiers and then take only the training data which matches the ensemble majority vote.
Are there any other algorithms/methods that are more resilient to data that is not labeled 100% correctly? Can we even treat this data as supervised learning? Is there anyway to trust the trained model or performance metrics such as accuracy and F1 score?
Thank you for the help.
This problem is known as “label noise” and there are a number of methods of dealing with it (essentially you need to include the possibility of incorrect labelling of patterns into the model and infer whether the pattern has been mislabelled, or actually belongs the wrong side of the decision boundary). There is a nice paper by Bootkrajang and Kaban on this topic, which would be a good place to start. This paper by Lawrence and Scholkopf is also well worth investigating. However, research on this problem has quite a long history, IIRC there is a discussion of this in McLachlan’s book on “Discriminant Analysis and Statistical Pattern Recognition“.