I’m learning about a binary classifier.

It uses the cross-entropy function as its loss function.$y_i \log p_i + (1-y_i) \log(1-p_i)$

But why does it use the log function?

How about just use linear form as follows?$y_ip_i + (1-y_i)(1-p_i)$

Is there any advantage to use log function?

And an other question:

Log function maps (0,1) to (-inf, 0).

So I think it can crush the algorithm if we get 0 for $p_i$ or $1-p_i$ because log value would be -inf and back-prop will be exploded.

**Answer**

For binary classification one way to encode the probability of an output is $p^y(1-p)^{1-y}$, if y is encoded as 0 or 1. This is the likelihood function and it’s meaning is with probability p we output 0 and with probability 1-p if output is 1.

Now you have a sample and you want to find p which best fits your data. One way is to find the maximum likelihood estimator. If your observations are independent your mle is found by maximizing the likelihood over the whole sample. This is the product of individual likelihoods $\pi_{i=1}^n p^{y_i}(1-p)^{y_i-1}$. But this is hard to use. Because of that one transform likelihood with logs. The transformation is monotonous and you get rid of products and obtain sums which are more tractable. Apply logs and get your expression.

Why not use your encoding instead? I think there is no reason why not. The question is which are the properties of your estimator? The first formulation uses likelihood and mle which has some theory behind which includes the fact that your estimator is efficient. The second formulation is not used often, don’t know any example of encoding the probability like that which does not exclude your approach.

**Attribution***Source : Link , Question Author : Viridisjun , Answer Author : rapaio*