Can anybody please clarify what a surrogate loss function is? I’m familiar with what a loss function is, and that we want to bring about a convex function that is differentiable, but I don’t understand the theory behind how you can satisfactorily use a surrogate loss function and actually trust its results.

**Answer**

In the context of learning, say you have a classification problem with data set {(X1,Y1),…,(Xn,Yn)}, where Xn are your features and Yn are your true labels.

Given a hypothesis function h(x), the loss function l:(h(Xn),Yn)→R takes the hypothesis function’s prediction (i.e. h(Xn)) as well as the true label for that particular input and returns a penalty. Now, a general goal is to find a hypothesis such that it minimizes the empirical risk (that is, it minimizes the chances of being wrong):

Rl(h)=Eempirical[l(h(X),Y)]=1mm∑il(h(Xi),Yi)

In the case of binary classification, a common loss function that is used is the 0–1 loss function:

l(h(X),Y)={0Y=h(X)1otherwise

In general, the loss function that we care about cannot be optimized efficiently. For example, the 0–1 loss function is discontinuous. So, we consider another loss function that will make our life easier, which we call the **surrogate loss function**.

An example of a surrogate loss function could be ψ(h(x))=max (the so-called hinge loss in SVM), which is convex and easy to optimize using conventional methods. This function acts as a proxy for the actual loss we wanted to minimize in the first place. Obviously, it has its disadvantages, but in some cases a surrogate loss function actually results in being able to learn more. By this, I mean that once your classifier achieves optimal risk (i.e. highest accuracy), you can still see the loss decreasing, which means that it is trying to push the different classes even further apart to improve its robustness.

**Attribution***Source : Link , Question Author : AZhao , Answer Author : The Pointer*