# How do CNN’s avoid the vanishing gradient problem

I have been reading a lot about convoloutional neural networks and was wondering how they avoid the vanishing gradient problem. I know deep belief networks stack single level auto-encoders or other pre-trained shallow networks and can thus avoid this problem but I don’t know how it is avoided in CNNs.

According to Wikipedia:

“despite the above-mentioned “vanishing gradient problem,” the
superior processing power of GPUs makes plain back-propagation
feasible for deep feedforward neural networks with many layers.”

I don’t understand why GPU processing would remove this problem?

The vanishing gradient problem requires us to use small learning rates with gradient descent which then needs many small steps to converge. This is a problem if you have a slow computer which takes a long time for each step. If you have a fast GPU which can perform many more steps in a day, this is less of a problem.

There are several ways to tackle the vanishing gradient problem. I would guess that the largest effect for CNNs came from switching from sigmoid nonlinear units to rectified linear units. If you consider a simple neural network whose error $E$ depends on weight $w_{ij}$ only through $y_j$, where

If $f$ is the logistic sigmoid function, $f'$ will be close to zero for large inputs as well as small inputs. If $f$ is a rectified linear unit,