Topology of Google Inception model could be found here: Google Inception Netowrk
I noticed that there is 3 softmax layer in this model(#154,#152,#145), and 2 of them are some sort of early escape of this model.
From what I know,softmax layer is for final output,so why there is so many? what’s the purpose of other 2 layer?
Short answer: Deep architectures, and specifically GoogLeNet (22 layers) are in danger of the vanishing gradients problem during training (back-propagation algorithm). The engineers of GoogLeNet addressed this issue by adding classifiers in the intermediate layers as well, such that the final loss is a combination of the intermediate loss and the final loss. This is why you see a total of three loss layers, unlike the usual single layer as the last layer of the network.
Longer answer: In classic Machine Learning, there is usually a distinction between feature engineering and classification. Neural networks are most famous for their ability to solve problems “end to end”, i.e, they combine the stages of learning a representation for the data, and training a classifier. Therefore, you can think of a neural network with a standard architecture (for example, AlexNet) as being composed of a “representation learning” phase (the layers up until previous to last) and a “classification” phase, which as expected, includes a loss function.
When creating deeper networks, there arises a problem coined as the “vanishing gradients” problem. It’s actually not specific to neural networks; rather to any gradient based learning methods. It not that trivial and therefore deserves a proper explanation for itself; see here for a good reference. Intuitively, you can think about the gradients carrying less and less information the deeper we go inside the network, which is of course a major concern, since we tune the network’s parameters (weights) based solely on the gradients, using the “back-prop” algorithm.
How did the developers of GoogLeNet handle this problem? They recognized the fact that it’s not only the features of the final layers that carry all the discriminatory information: intermediate features are also capable of discriminating different labels; and, most importantly, their values are more “reliable” since they are extracted from earlier layers in which the gradient carry more information. Building on this intuition, they added “auxiliary classifiers” in two intermediate layers. This is the reason for the “early escape” loss layers in the middle of the network which you referenced to in your question.
The total loss is then a combination of these three loss layers. I quote from the original article:
These classifiers take the form
of smaller convolutional networks put on top of the output
of the Inception (4a) and (4d) modules. During training,
their loss gets added to the total loss of the network
with a discount weight (the losses of the auxiliary classi-
fiers were weighted by 0.3). At inference time, these auxiliary
networks are discarded.