Is it possible to have negative weights (after enough epochs) for deep convolutional neural networks when we use ReLU for all the activation layers?
Rectified Linear Units (ReLUs) only make the output of the neurons to be non-negative. The parameters of the network, however, can, and will, become positive or negative depending on the training data.
Here are two reasons I can think of right now that justifies (intuitively) why some parameters would become negative:
the regularization of the parameters (a.k.a. the weight decay); the variation in the parameter values makes prediction possible, and if the parameters are centered around zero (i.e. their mean is close to zero), then their ℓ2 norm (which is a standard regularizer) is low.
although the gradients of the output of a layer with respect to the layer parameters depend on the input to the layer (which are always positive assuming that the previous layer passes its outputs through a ReLU), however, the gradient of the error (which comes from the layers closer to the final output layers) may be positive or negative, making it possible for SGD to make some of the parameter values negative after taking the next gradient step. More specifically, let I, O, and w denote the input, output, and parameters of a layer in a neural network. Also, let E be the final error of the network induced by some training sample. The gradient of the error with respect to w is computed as ∂E∂w=(∑Kk=1∂E∂Ok)⋅∂Ok∂w; note that Ok=O,∀k (see picture below):