Can someone give an intuition behind drop-out method used in convolutional neural networks?
What is exactly drop-out doing?
As described in the paper introducing it, dropout proceeds like so:
- During training, randomly remove units from the network. Update parameters as normal, leaving dropped-out units unchanged.
The only difference is that for each training case in a mini-batch, we sample a thinned network by dropping out units. Forward and backpropagation for that training case are done only on this thinned network. […] Any training case which does not use a parameter contributes a gradient of zero for that parameter.
- At test time, account for this by rescaling:
If a unit is retained with probability p during training, the outgoing weights of that unit are multiplied by p at test time as shown in Figure 2. This ensures that for any hidden unit the expected output (under the distribution used to drop units at training time) is the same as the actual output at test time.
The intuition is that we’d like to find the Bayes optimal classifier, but doing that for a large model is prohibitive; per the paper, using a full network trained via dropout is a simple approximation that proves useful in practice. (See the paper for results on a variety of applications. One application includes a convolutional architecture.)