What is auxiliary loss that is mentioned in the PSPNet(Pyramid Scene Parsing Network) paper link ?
I’m quoting the part of the paper down below
An example of our deeply supervised ResNet101 
model is illustrated in Fig. 4. Apart from the main branch
using softmax loss to train the final classifier, another classifier
is applied after the fourth stage, i.e., the res4b22
residue block. Different from relay backpropagation 
that blocks the backward auxiliary loss to several shallow
layers, we let the two loss functions pass through all previous
layers. The auxiliary loss helps optimize the learning
process, while the master branch loss takes the most responsibility.
We add weight to balance the auxiliary loss.
My question is how does this auxiliary loss work and how does it help in training process. What is its work in the network ?
The idea of auxiliary loss (aka auxiliary towers) comes from GoogLeNet paper. At core intuition can be explained in this way:
Let’s say you are building a network by stacking up lots of identical modules. As network becomes deeper, you face slowed down training because of vanishing gradient issue (this was before BatchNorm days). To promote learning for each module layer, you can attach some small network to the output of that module. This network typically have a couple of conv layers followed by FCs and then final classification prediction. This auxiliary network’s task is to predict same label as final network would predict but using the module’s output. We add the loss of this aux network to the final loss of the entire network weighted by some value < 1. For example, in GoogLeNet, you can see two tower like aux networks on the right ending in orange nodes:
Now, if the module is learning slowly then it would generate big loss and cause gradient flow in that module helping gradients further downstream as well. This technique has apparently found to help training for very deep networks. Even when using batch norm, this can help to accelerate training during early cycles when weights are randomly initialized. Many NAS architecture uses this technique for initial evaluation during the search as you have a very limited budget to run epochs when evaluating 1000s of architectures so early acceleration improves performance. As aux networks are removed from the final model, it is not considered “cheating”.