I’ve been looking into semi-supervised learning methods, and have come across the concept of “pseudo-labeling”.
As I understand it, with pseudo-labeling you have a set of labeled data as well as a set of unlabeled data. You first train a model on only the labeled data. You then use that initial data to classify (attach provisional labels to) the unlabeled data. You then feed both the labeled and unlabeled data back into your model training, (re-)fitting to both the known labels and the predicted labels. (Iterate this process, re-labeling with the updated model.)
The claimed benefits are that you can use the information about the structure of the unlabeled data to improve the model. A variation of the following figure is often shown, “demonstrating” that the process can make a more complex decision boundary based on where the (unlabeled) data lies.
However, I’m not quite buying that simplistic explanation. Naively, if the original labeled-only training result was the upper decision boundary, the pseudo-labels would be assigned based on that decision boundary. Which is to say that the left hand of the upper curve would be pseudo-labeled white and the right hand of the lower curve would be pseudo-labeled black. You wouldn’t get the nice curving decision boundary after retraining, as the new pseudo-labels would simply reinforce the current decision boundary.
Or to put it another way, the current labeled-only decision boundary would have perfect prediction accuracy for the unlabeled data (as that’s what we used to make them). There’s no driving force (no gradient) which would cause us to change the location of that decision boundary simply by adding in the pseudo-labeled data.
Am I correct in thinking that the explanation embodied by the diagram is lacking? Or is there something I’m missing? If not, what is the benefit of pseudo-labels, given the pre-retraining decision boundary has perfect accuracy over the pseudo-labels?
Pseudo-labeling doesn’t work on the given toy problem
Oliver et al. (2018) evaluated different semi-supervised learning algorithms. Their first figure shows how pseudo-labeling (and other methods) perform on the same toy problem as in your question (called the ‘two-moons’ dataset):
The plot shows the labeled and unlabeled datapoints, and the decision boundaries obtained after training a neural net using different semi-supervised learning methods. As you suspected, pseudo-labeling doesn’t work well in this situation. They say that pseudo-labeling “is a simple heuristic which is widely used in practice, likely because of its simplicity and generality”. But: “While intuitive, it can nevertheless produce incorrect results when the prediction function produces unhelpful targets for [the unlabeled data], as shown in fig. 1.”
Why and when does pseudo-labeling work?
Pseudo-labeling was introduced by Lee (2013), so you can find more details there.
The cluster assumption
The theoretical justification Lee gave for pseudo-labeling is that it’s similar to entropy regularization. Entropy regularization (Grandvalet and Bengio 2005) is another semi-supervised learning technique, which encourages the classifier to make confident predictions on unlabeled data. For example, we’d prefer an unlabeled point to be assigned a high probability of being in a particular class, rather than diffuse probabilities spread over multiple classes. The purpose is to take advantage the assumption that the data are clustered according to class (called the “cluster assumption” in semi-supervised learning). So, nearby points have the same class, and points in different classes are more widely separated, such that the true decision boundaries run through low density regions of input space.
Why pseudo-labeling might fail
Given the above, it would seem reasonable to guess that the cluster assumption is a necessary condition for pseudo-labeling to work. But, clearly it’s not sufficient, as the two-moons problem above does satisfy the cluster assumption, but pseudo-labeling doesn’t work. In this case, I suspect the problem is that there are very few labeled points, and the proper cluster structure can’t be identified from these points. So, as Oliver et al. describe (and as you point out in your question), the resulting pseudo-labels guide the classifier toward the wrong decision boundary. Perhaps it would work given more labeled data. For example, contrast this to the MNIST case described below, where pseudo-labeling does work.
Where it works
Lee (2013) showed that pseudo-labeling can help on the MNIST dataset (with 100-3000 labeled examples). In fig. 1 of that paper, you can see that a neural net trained on 600 labeled examples (without any semi-supervised learning) can already recover cluster structure among classes. It seems that pseudo-labeling then helps refine the structure. Note that this is unlike the two-moons example, where several labeled points were not enough to learn the proper clusters.
The paper also mentions that results were unstable with only 100 labeled examples. This again supports the idea that pseudo-labeling is sensitive to the initial predictions, and that good initial predictions require a sufficient number of labeled points.
Lee also showed that unsupervised pre-training using denoising autoencoders helps further, but this appears to be a separate way of exploiting structure in the unlabeled data; unfortunately, there was no comparison to unsupervised pre-training alone (without pseudo-labeling).
Grandvalet and Bengio (2005) reported that pseudo-labeling beats supervised learning on the CIFAR-10 and SVHN datasets (with 4000 and 1000 labeled examples, respectively). As above, this is much more labeled data than the 6 labeled points in the two-moons problem.