This is with refernce to the paper Efficient Object Localization Using Convolutional Networks, and from what I understand the dropout is implemented in 2D.
After reading the code from Keras on how the Spatial 2D Dropout is implemented, basically a random binary mask of shape [batch_size, 1, 1, num_channels] is implemented. However, what does this spatial 2D Dropout exactly do to the input convolution block of shape [batch_size, height, width, num_channels]?
My current guess is that for each pixel, if any of the pixel’s layers/channels has a negative value, the entire channels of that one pixel will be defaulted to zero. Is this correct?
However, if my guess is correct, then how does using a binary mask of shape [batch_size, height, width, num_channels] that are exactly in the dimension of the original input block give the usual element-wise dropout (this is according to the tensorflow’s original dropout implementation that sets the shape of the binary mask as the shape of the input)? Because it would then mean if any pixel in the conv block is negative, then the entire conv block will be defaulted to 0. This is the confusing part I don’t quite understand.
This response is a bit late, but I needed to address this myself and thought it might help.
Looking at the paper, it seems that in Spatial Dropout, we randomly set entire feature maps (also known as channels) to 0, rather than individual ‘pixels.’
It make sense what they are saying, that regular dropout would not work so well on images because adjacent pixels are highly correlated. So if you hide pixels randomly I can still have a good idea of what they were by just looking at the adjacent pixels. Dropping out entire feature maps might be better aligned with the original intention of dropout.
Here’s a function that implements it in Tensorflow, based on tf.nn.dropout. The only real change from tf.nn.dropout is that the shape of our dropout mask is BatchSize * 1 * 1 * NumFeatureMaps, as opposed to BatchSize * Width * Height * NumFeatureMaps
def spatial_dropout(x, keep_prob, seed=1234): # x is a convnet activation with shape BxWxHxF where F is the # number of feature maps for that layer # keep_prob is the proportion of feature maps we want to keep # get the batch size and number of feature maps num_feature_maps = [tf.shape(x), tf.shape(x)] # get some uniform noise between keep_prob and 1 + keep_prob random_tensor = keep_prob random_tensor += tf.random_uniform(num_feature_maps, seed=seed, dtype=x.dtype) # if we take the floor of this, we get a binary matrix where # (1-keep_prob)% of the values are 0 and the rest are 1 binary_tensor = tf.floor(random_tensor) # Reshape to multiply our feature maps by this tensor correctly binary_tensor = tf.reshape(binary_tensor, [-1, 1, 1, tf.shape(x)]) # Zero out feature maps where appropriate; scale up to compensate ret = tf.div(x, keep_prob) * binary_tensor return ret
Hope that helps!