Is it okay to use cross entropy loss function with soft labels?

I have a classification problem where pixels will be labeled with soft labels (which denote probabilities) rather than hard 0,1 labels. Earlier with hard 0,1 pixel labeling the cross entropy loss function (sigmoidCross entropyLossLayer from Caffe) was giving decent results. Is it okay to use the sigmoid cross entropy loss layer (from Caffe) for this soft classification problem?

The answer is yes, but you have to define it the right way.

Cross entropy is defined on probability distributions, not on single values. For discrete distributions $p$ and $q$, it’s:

When the cross entropy loss is used with ‘hard’ class labels, what this really amounts to is treating $p$ as the conditional empirical distribution over class labels. This is a distribution where the probability is 1 for the observed class label and 0 for all others. $q$ is the conditional distribution (probability of class label, given input) learned by the classifier. For a single observed data point with input $x_0$ and class $y_0$, we can see that the expression above reduces to the standard log loss (which would be averaged over all data points):

Here, $I\{\cdot\}$ is the indicator function, which is 1 when its argument is true or 0 otherwise (this is what the empirical distribution is doing). The sum is taken over the set of possible class labels.

In the case of ‘soft’ labels like you mention, the labels are no longer class identities themselves, but probabilities over two possible classes. Because of this, you can’t use the standard expression for the log loss. But, the concept of cross entropy still applies. In fact, it seems even more natural in this case.

Let’s call the class $y$, which can be 0 or 1. And, let’s say that the soft label $s(x)$ gives the probability that the class is 1 (given the corresponding input $x$). So, the soft label defines a probability distribution:

The classifier also gives a distribution over classes, given the input:

Here, $c(x)$ is the classifier’s estimated probability that the class is 1, given input $x$.

The task is now to determine how different these two distributions are, using the cross entropy. Plug these expressions for $p$ and $q$ into the definition of cross entropy, above. The sum is taken over the set of possible classes $\{0, 1\}$:

That’s the expression for a single, observed data point. The loss function would be the mean over all data points. Of course, this can be generalized to multiclass classification as well.