My understanding of SVM is that it’s very similar to a logistic regression (LR), i.e. a weighted sum of features is passed to the sigmoid function to get a probability of belonging to a class, but instead of the cross-entropy (logistic) loss function, training is performed using the hinge loss. The benefit of using the hinge loss is that one can do various numerical tricks to make kernelisation more efficient. A drawback, however, is that the resulting model has less information than a corresponding LR model could have. So, for example, without kernelisation (using a linear kernel) the SVM decision boundary would still be at the same location where LR would output a probability of 0.5, BUT one cannot tell how quickly the probability of belonging to a class decays away from the decision boundary.
My two questions are:
- Is my interpretation above correct?
- How does using the hinge loss make it invalid to interpret SVM results as probabilities?
A SVM does not feed anything into a sigmoid function. It fits a separating hyperplane to the data that tries to put all data points from your training set that are of one class on one side, and all points of the other class on the other. Consequently, it assigns class based on which side your feature vector is on. More formally, if we denote the feature vector as x and the hyperplane coefficients as β and β0 the intercept, then the class assignment is y=sign(β⋅x+β0). Solving an SVM amounts to finding β,β0 which minimize the hinge loss with the greatest possible margin. Therefore, because an SVM only cares about which side of the hyperplane you are on, you cannot transform its class assignments into probabilities.
In the case of a linear SVM (no kernel), the decision boundary boundary will be similar to that of a logistic regression model, but may vary depending on the regularization strength you used to fit the SVM. Because the SVM and LR solve different optimization problems, you are not guaranteed to have identical solutions for the decision boundary.