How does the Rectified Linear Unit (ReLU) activation function produce non-linear interaction of its inputs? [duplicate]

When used as an activation function in deep neural networks The ReLU function outperforms other non-linear functions like tanh or sigmoid . In my understanding the whole purpose of an activation function is to let the weighted inputs to a neuron interact non-linearly. For example, when using $sin(z)$ as the activation, the output of a two input neuron would be:

$$ sin(w_0+w_1*x_1+w_2*x_2) $$

which would approximate the function
$$ (w_0+w_1*x_1+w_2*x_2) – {(w_0+w_1*x_1+w_2*x_2)^3 \over 6} + {(w_0+w_1*x_1+w_2*x_2)^5 \over 120} $$

and contain all kinds of combinations of different powers of the features $x_1$ and $x_2$.

Although the ReLU is also technically a non-linear function, I don’t see how it can produce non-linear terms like the $sin(), tanh()$ and other activations do.

Edit: Although my question is similar to this question, I’d like to know how even a cascade of ReLUs are able to approximate such non-linear terms.

Answer

Suppose you want to approximate $f(x)=x^2$ using ReLUs $g(ax+b)$. One approximation might look like $h_1(x)=g(x)+g(-x)=|x|$.

h1(x)

But this isn’t a very good approximation. But you can add more terms with different choices of $a$ and $b$ to improve the approximation. One such improvement, in the sense that the error is “small” across a larger interval, is we have $h_2(x)=g(x)+g(-x)+g(2x-2)+g(-2x+2)$, and it gets better.

h2(x)

You can continue this procedure of adding terms to as much complexity as you like.

Notice that, in the first case, the approximation is best for $x\in[-1,1]$, while in the second case, the approximation is best for $x\in[-2,2]$.

enter image description here

x <- seq(-3,3,length.out=1000)
y_true <- x^2
relu <- function(x,a=1,b=0) sapply(x, function(t) max(a*t+b,0))

h1 <- function(x) relu(x)+relu(-x)
png("fig1.png")
    plot(x, h1(x), type="l")
    lines(x, y_true, col="red")
dev.off()

h2 <- function(x) h1(x) + relu(2*(x-1)) + relu(-2*(x+1))
png("fig2.png")
    plot(x, h2(x), type="l")
    lines(x, y_true, col="red")
dev.off()

l2 <- function(y_true,y_hat) 0.5 * (y_true - y_hat)^2

png("fig3.png")
    plot(x, l2(y_true,h1(x)), type="l")
    lines(x, l2(y_true,h2(x)), col="red")
dev.off()

Attribution
Source : Link , Question Author : farhanhubble , Answer Author : Sycorax

Leave a Comment