# How does the Rectified Linear Unit (ReLU) activation function produce non-linear interaction of its inputs? [duplicate]

When used as an activation function in deep neural networks The ReLU function outperforms other non-linear functions like tanh or sigmoid . In my understanding the whole purpose of an activation function is to let the weighted inputs to a neuron interact non-linearly. For example, when using $sin(z)$ as the activation, the output of a two input neuron would be:

$$sin(w_0+w_1*x_1+w_2*x_2)$$

which would approximate the function
$$(w_0+w_1*x_1+w_2*x_2) – {(w_0+w_1*x_1+w_2*x_2)^3 \over 6} + {(w_0+w_1*x_1+w_2*x_2)^5 \over 120}$$

and contain all kinds of combinations of different powers of the features $x_1$ and $x_2$.

Although the ReLU is also technically a non-linear function, I don’t see how it can produce non-linear terms like the $sin(), tanh()$ and other activations do.

Edit: Although my question is similar to this question, I’d like to know how even a cascade of ReLUs are able to approximate such non-linear terms.

Suppose you want to approximate $f(x)=x^2$ using ReLUs $g(ax+b)$. One approximation might look like $h_1(x)=g(x)+g(-x)=|x|$. But this isn’t a very good approximation. But you can add more terms with different choices of $a$ and $b$ to improve the approximation. One such improvement, in the sense that the error is “small” across a larger interval, is we have $h_2(x)=g(x)+g(-x)+g(2x-2)+g(-2x+2)$, and it gets better. You can continue this procedure of adding terms to as much complexity as you like.

Notice that, in the first case, the approximation is best for $x\in[-1,1]$, while in the second case, the approximation is best for $x\in[-2,2]$. x <- seq(-3,3,length.out=1000)
y_true <- x^2
relu <- function(x,a=1,b=0) sapply(x, function(t) max(a*t+b,0))

h1 <- function(x) relu(x)+relu(-x)
png("fig1.png")
plot(x, h1(x), type="l")
lines(x, y_true, col="red")
dev.off()

h2 <- function(x) h1(x) + relu(2*(x-1)) + relu(-2*(x+1))
png("fig2.png")
plot(x, h2(x), type="l")
lines(x, y_true, col="red")
dev.off()

l2 <- function(y_true,y_hat) 0.5 * (y_true - y_hat)^2

png("fig3.png")
plot(x, l2(y_true,h1(x)), type="l")
lines(x, l2(y_true,h2(x)), col="red")
dev.off()