In mathematics a function is considered linear whenever a fucntion $f: A \rightarrow B$ if for every $x$ and $y$ in the domain $A$ has the following property: $f(x) + f(y) = f(x+y)$. By definition the ReLU is $max(0,x)$. Therefore, if we split the domain from $(-\infty, 0]$ or $[0, \infty)$ then the function is linear. However, it's easy to see that $f(-1) + f(1) \neq f(0)$. Hence by definition ReLU is not linear.

Nevertheless, ReLU is so close to linear that this often confuses people and wonder how can it be used as a universal approximator. In my experience, the best way to think about them is like Riemann sums. You can approximate any continuous functions with lots of little rectangles. ReLU activations can produced lots of little rectangles. In fact, in practice, ReLU can make rather complicated shapes and approximate many complicated domains.

I also feel like clarifying another point. As pointed out by a previous answer, neurons do not die in Sigmoid, but rather vanish. The reason for this is because at maximum the derivative of the sigmoid function is .25. Hence, after so many layers you end up multiplying these gradients and the product of very small numbers less than 1 tend to go to zero very quickly.

Hence if you're building a deep learning network with a lot of layers, your sigmoid functions will essentially stagnant rather quickly and become more or less useless.

The key take away is the vanishing comes from multiplying the gradients not the gradients themselves.

3This post is 2 years old but I'd like to mention that ReLU does not avoid dead neurons. "Dead" neurons occur due the 0 gradient of the ReLU function when the input is less than 0. – Sathvik Swaminathan – 2020-05-25T18:05:22.760

Right, there's a modified version of ReLU which helps to avoid "dead" neurons. It's called LeakyReLU and it has a slightly sloped activation for negative numbers. – mark mark – 2020-12-18T18:08:58.883