When was the ReLU function first used in a neural network?

When was the ReLU function first used in a neural network?

By ReLU, I mean the function
f(x) = \max\{0, x\}.

By neural network, I mean function approximation machines which are comprised of one or more “hidden layers.”

(That is, I wish to exclude models which are viewed as “special cases” of neural networks because if we admitted such special cases, then the question would reduce to something along the lines of “when did anyone, in any context, first propose the idea of thresholding values below 0?” which is not really interesting to me.)


The earliest usage of the ReLU activation that I’ve found is Fukushima (1980, page 196, equation 2). Unless I missed something, the function is not given any particular name in this paper. I am not aware of an older reference, but because terminology is inconsistent and rapidly changing, it’s eminently possible that I’ve missed a key detail in an even older publication.

It is common to cite Nair & Hinton (2010) as the first usage of $f$. For example, Schmidhuber (2014) cites Nair & Hinton when discussing ReLU units in his review article. Certainly, Nair & Hinton’s paper is important because it spurred the recent interest in using $f$ in neural networks, and it is the source of the modern nomenclature “rectified linear units.” Nonetheless, the idea of using $f$ as an activation is decades older than the 2010 paper.

Incidentally, Hinton also coauthored a chapter in Parallel Distributed Processing in which $f$ was used in a neural network. In this paper, $f$ is called the “threshold function.” However, this volume was published in 1986, six years after Fukushima’s paper.


Source : Link , Question Author : Sycorax , Answer Author : Sycorax

Leave a Comment