I understand that one-class SVMs (OSVMs) were proposed with the absence of negative data in mind and that they seek to find decision boundaries that separate a positive set and some negative anchor point, say the origin.

A work in 2011 proposes Exemplar SVMs (ESVMs) which trains a “single per-category classifier” which claims to be different from OSVMs, in that ESVMs do not “require mapping the exemplars into a common feature space over which a similarity kernel can be computed”. I do not quite understand what this means and how ESVMs differ from OSVMs. And so, how do they differ? And how is this similarity kernel computation avoided in ESVMs?

**Answer**

(You may want to look at the “table” below first)

Let’s start with the “classic” support vector machines. These learn to discriminate between two categories. You collect some examples of category A, some of category B and pass them both to the SVM training algorithm, which finds the line/plane/hyperplane that best separates A from B. This works–and it often works quite well–when you want to distinguish between well-defined and mutually exclusive classes: men vs. women, the letters of the alphabet, and so on.

However, suppose you want to identify “A”s instead. You could treat this as a classification problem: How do I distinguish “A”s from “not-A”s. It is fairly easy to gather up a training set consisting of pictures of dogs, but what should go into your training set of not-dogs? Since there are an infinite number of things that are not dogs, you might have a difficult time constructing a comprehensive and yet representative training set of all non-canine things. Instead, you might consider using a one-class classifier. The traditional, two-class classifier finds a (hyper)plane that separates A from B. The one-class SVM instead finds the line/plane/hyperplane that separates all of the in-class points (the “A”s) from origin; it is essentially a two-class SVM where the origin is the only member of the second class (finding the maximum margin from the origin is pretty similar to finding the smallest sphere that contains all As, which might make more conceptual sense).

The Ensemble SVM “system” is actually a collection of many two-class SVM “subunits”. Each subunit is trained using a **single** positive example for one class and an enormous collection of negative examples for the other. Thus, instead of discriminating dogs vs. not-dog examples (standard two-class SVM), or dogs vs. origin (one-class SVM), each subunit discriminates between specific dog (e.g., “Rex”) and many not-dog examples. Individual subunit SVMs are trained for each example of the positive class, so you would have one SVM for Rex, another for Fido, yet another for your neighbour’s dog that barks at 6am, and so on. The outputs of these subunit SVMs are calibrated and combined to determine whether a dog, not just one of the specific exemplars, appears in the test data. I guess you could also think of the individual subnits as somewhat like one-class SVMs, where the coordinate space is shifted so that the single positive example lies at the origin.

In summary, the key differences are:

### Training Data

- Two class SVM: Positive and negative examples
- One class SVM: Positive examples
*only* - Ensemble SVM “system”: Positive and negative examples. Each subunit is trained on a single positive example and many negative examples.

### Number of Machines

- Two class SVM: one
- One class SVM: one
- Ensemble SVM “system”: many (one subunit machine per positive example)

### Examples per class (per machine)

- Two class SVM: many/many
- One class SVM: many/one (fixed at the origin)
- Ensemble SVM “system”: many/many
- Ensemble SVM “subunit”: one/many

### Post-processing

- Two class SVM: Not necessary
- One class SVM: Not necessary
- Ensemble SVM: Needed to fuse each SVM’s output into a class-level prediction.

Postscript: You had asked what they mean by “[other approaches] require mapping the exemplars into a common feature space over which a similarity kernel can be computed.” I think they mean that a traditional two-class SVM operates under the assumption that all members of class are somehow similar, and so you want to find a kernel that places great danes and dachsunds near each other, but far away from everything else. By contrast, the ensemble SVM system sidesteps this by calling something a dog if it’s sufficiently great dane-like OR dachsund-like OR poodle-like, without worrying about the relationship between those exemplars.

**Attribution***Source : Link , Question Author : bjou , Answer Author : Matt Krause*