First, let me say that I am a bit out of my depth here, so if this question needs to be re-phrased or closed as a duplicate, please let me know. It may simply be that I don’t have the proper vocabulary to express my question.
I am working on an image processing task in which I identify features in an image, and then classify them based on their properties, including shape, size, darkness, etc. I’m quite experienced with the image processing portion of this, but think I could improve the methods I use for classification of the features.
Right now, I set thresholds for each of the parameters measured, and then classify features according to some simple logic based on which thresholds the feature has crossed. For example (the actual properties and groupings are more complex, but I’m trying to simplify irrelevant portions of my project for this question), lets say I’m grouping features into the groups “Big and Dark,” “Big and Light” and “Small”. Then a feature A will be in “Big and Dark” iff (size(A)>sizeThreshold) & (darkness(A)>darknessThreshold).
The goal is for the classification to agree with the classification done by an expert-level human, so I can set the thresholds to produce the best match between the groupings made by human and computer on some test set, and then hope that the classification works well with new data.
This is already working pretty well, but I see one particular failure mode which I think may be fixable. Let’s say feature A is known to belong to “Big and Dark.” The human classified it in this way because, while is was just barely big enough, it was very very dark, which made up somewhat for the lack of “bigness.” My algorithm would fail to classify this feature properly, because classification is based on rigid binary logic, and requires all thresholds to be crossed.
I would like to improve this failure by making my algorithm better mimic the human guided process, in which a deficiency in one parameter can be compensated by an abundance of another. To do this, I would like to take each of the base properties of my features, and convert them into some sort of score which would be a predictor of the group to which the feature belongs. I have thought of many ways of doing this, but they are mostly ad hoc ideas, based on my background in vector calculus and physics. For example, I’ve considered treating each feature as a vector in the N-D space of feature properties, and calculating the projection of each feature along certain vectors, each of which would measure the degree to which a feature belongs in the group.
I am sure there is a more rigorous and better established technique for doing this sort of thing, but my background is relatively weak in statistical analysis, so I’m looking for a shove in the right direction. Even the name of a technique, or a link to a textbook would be helpful.
What techniques are useful in classifying objects based on a large number of descriptive parameters?
It sounds like any linear classifier will do what you need. Suppose you have N features and the value of feature i is fi. Then a linear classifier will compute a score
s=∑iwifi+o (where o is the offset). Then, if s>t (where t is some threshold), then the feature belongs to a class (a group), and if s<t, then it doesn't. Note that there is a single threshold applied to the entire score (rather than to individual feature values), so indeed a deficiency in one parameter can be compensated for by abundance in another. The weights are intuitively interpretable, in the sense that the higher the weight is, the more important (or more decisive) that feature is.
There are a lot of off-the-shelf linear classifiers that can do that, including SVM, LDA (linear discriminant analysis), linear neural networks, and many others. I'd start by running linear SVM because it works well in a lot of cases and can tolerate limited training data. There are also a lot of packages in many environments (like Matlab and R), so you can easily try it. The downside of SVM is that it can be computationally heavy, so if you need to learn a lot of classes, it might be less appropriate.
If you want to preserve some of the threshold behavior you currently have, you can pass the feature values through a sigmoid with the threshold in the right location. E.g. for a feature i for which you currently use a threshold of ti, first compute
and then learn a linear classifier using g's rather than f's. This way, the compensating behavior will only happen near the threshold, and things that are too far away from the threshold cannot be compensated for (which is sometimes desirable).
Another thing that you could try is to use probabilistic classifiers like Naive Bayes or TAN. Naive Bayes is almost like a linear classifier, except it computes
So there is still a sum of weights. These weights depend on the feature values fi, but not by multiplication like in a usual linear classifier. The score in this case is the log-probability, and the weights are the contributions of the individual features into that log-probability. The disadvantage of using this in your case is that you will need many bins for your feature values, and then learning may become difficult. There are ways around that (for example, using priors), but since you have no experience with this, it might be more difficult.
Regarding terminology: what you called 'test set' is usually called a 'training set' in this context, and what you called 'new data' is called the 'test set'.
For a book, I'd read "Pattern recognition" by Duda, Hart, and Stork. The first chapter is a very good introduction for beginners.