Consider the following problem. You have a large dataset, some small subset of which have labels from the classes A, B and C. I would like to classify the unlabelled subset of items each of which can be from classes A, B and C or (crucially) also from other classes I have not seen any labels for yet.
The ideal result would be a full labeling of the unlabelled subset with classes, A, B, C, D, E, …
Is this an example of semi-supervised classification and what are good approaches one can take to this kind of problem?
This is a very interesting framework.
Building one-vs-all classifiers will help you to identify A,B,C and “others”.
However, it won’t be able to to differ between D,E and the rest in “others”.
I think that you should cluster your data in order to identify the clusters of the unknown class.
If you have a distance function at hand, you can evaluate how well it separates the known classes. However, you can actually learn a proper distance function.
Let L be your labeled dataset.
Build a pair dataset for all pairs x,y in L.
Let the concept of the pairs dataset be the desired distance.
If class(x)=class(y), the distance should be zero.
If the class is different is is domain question of the needed distance (e.g., the distance between A and B might be smaller than the distance between B and C).
Now train a regressor on the pairs dataset.
Use the regressor as the distance function to your clustering algorithm.
Hierarchal clustering algorithms seems to fit well to your needs.
Run the clustering algorithm on the unlabelled data to get clusters of samples.
If you also have one-vs-all classifiers fro the known classes, run them on the samples.
Clusters were the samples tend not belong to the known classes are the candidates for the new classes.