# Hinge loss with one-vs-all classifier

I’m currently looking at the unconstrained primal form of the one-vs-all classifier

where

$N_I$ is the number of instances,
$N_K$ is the number of classes,
$N_F$ is the number of features,
$X$ is a $N_K \times N_F$ data matrix,
$y$ is a vector of class labels,
$W$ is an $N_K \times N_I$ matrix where each corresponds to the weights for the hyperplane splitting one class from the rest,
$L$ is some arbitrary loss function.

My understanding is that the functional above tries to find a hyperplane for each class that maximizes the distance between the samples within the associated class to all other samples. If the hyperplanes are correctly positioned then $\mathbf{w_k}\cdot\mathbf{x_i}$ should always be negative, $\mathbf{w_{y_i}}\cdot\mathbf{x_i}$ should always be positive and our loss function should come back fairly low.

I’m trying to implement this using the hinge loss which I believe in this case will end up being

$\max(0,1+\mathbf{w_k}\cdot\mathbf{x_i}-\mathbf{w_{y_i}}\cdot\mathbf{x_i}$).

However, in the above couldn’t we end up with a situation where the hyperplanes classify all samples as belonging to every class. For example, if we are looking at the hyperplane seperating class 1 from all other classes, provided that $1+\mathbf{w_k}\cdot\mathbf{x_i}<\mathbf{w_{y_i}}\cdot\mathbf{x_i}$ then the incurred loss will be 0 despite $\mathbf{x_i}$ being classified as the wrong class.

Where have I gone wrong? Or does it not matter whether $\mathbf{w_k}\cdot\mathbf{x_i}$ is negative or positive provided that $\mathbf{w_{y_i}}\cdot\mathbf{x_i}$ ends up with a higher score? I have a feeling that my use of the hinge function as I've described here is incorrect but my use of Google today has only led to more confusion.

On a related note, why is there a 1 in the functional above? I would think that it would have little impact.

Your post seems to be mostly correct.

The way that multiclass linear classifiers are set up is that an example, $x$, is classified by the hyperplane that give the highest score: $\underset{k}{\mathrm{argmax}\,} w_k \cdot x$.
It doesn't matter if these scores are positive or negative.

If the hinge loss for a particular example is zero, then this means that the example is correctly classified.
To see this, the hinge loss will be zero when $1+w_{k}\cdot x_i. This is a stronger condition than $w_{k}\cdot x_i, which would indicate that example $i$ was correctly classified as $y_i$.

The 1 in the hinge loss is related to the "margin" of the classifier.

The hinge loss encourages scores from the correct class, $w_{y_i}\cdot x_i$ to not only be higher that scores from all the other classes, $w_k\cdot x_i$, but to be higher than these scores by an additive factor.

We can use the value 1 for the margin because the distance of a point from a hyperplane is scaled by the magnitude of the linear weights: $\frac{w}{|w|}\cdot x$ is the distance of $x$ from the hyperplane with normal vector $w$.
Since the weights are the same for all points in the dataset, it only matters that the scaling factor—1—is the same for all data points.

Also, it may make things easier to understand if you parameterize the loss function as $L(x,y;w)$. You currently have the loss functions as a function of the linear margin, and this is not necessarily the case.