When used as an activation function in deep neural networks The ReLU function outperforms other non-linear functions like tanh or sigmoid . In my understanding the whole purpose of an activation function is to let the weighted inputs to a neuron interact non-linearly. For example, when using $sin(z)$ as the activation, the output of a two input neuron would be:
$$ sin(w_0+w_1*x_1+w_2*x_2) $$
which would approximate the function
$$ (w_0+w_1*x_1+w_2*x_2) – {(w_0+w_1*x_1+w_2*x_2)^3 \over 6} + {(w_0+w_1*x_1+w_2*x_2)^5 \over 120} $$and contain all kinds of combinations of different powers of the features $x_1$ and $x_2$.
Although the ReLU is also technically a non-linear function, I don’t see how it can produce non-linear terms like the $sin(), tanh()$ and other activations do.
Edit: Although my question is similar to this question, I’d like to know how even a cascade of ReLUs are able to approximate such non-linear terms.
Answer
Suppose you want to approximate $f(x)=x^2$ using ReLUs $g(ax+b)$. One approximation might look like $h_1(x)=g(x)+g(-x)=|x|$.
But this isn’t a very good approximation. But you can add more terms with different choices of $a$ and $b$ to improve the approximation. One such improvement, in the sense that the error is “small” across a larger interval, is we have $h_2(x)=g(x)+g(-x)+g(2x-2)+g(-2x+2)$, and it gets better.
You can continue this procedure of adding terms to as much complexity as you like.
Notice that, in the first case, the approximation is best for $x\in[-1,1]$, while in the second case, the approximation is best for $x\in[-2,2]$.
x <- seq(-3,3,length.out=1000)
y_true <- x^2
relu <- function(x,a=1,b=0) sapply(x, function(t) max(a*t+b,0))
h1 <- function(x) relu(x)+relu(-x)
png("fig1.png")
plot(x, h1(x), type="l")
lines(x, y_true, col="red")
dev.off()
h2 <- function(x) h1(x) + relu(2*(x-1)) + relu(-2*(x+1))
png("fig2.png")
plot(x, h2(x), type="l")
lines(x, y_true, col="red")
dev.off()
l2 <- function(y_true,y_hat) 0.5 * (y_true - y_hat)^2
png("fig3.png")
plot(x, l2(y_true,h1(x)), type="l")
lines(x, l2(y_true,h2(x)), col="red")
dev.off()
Attribution
Source : Link , Question Author : farhanhubble , Answer Author : Sycorax