I want to improve my understanding of neural networks and their benefits compared to other machine learning algorithms. My understanding is as below and my question is:

Can you correct and supplement my understanding please? 🙂

My understanding:

(1) Artificial neural networks = A function, which predicts output values from input values. According to a Universal Approximation Theorem (https://en.wikipedia.org/wiki/Universal_approximation_theorem), you usually can have any possible (though it should behave well) prediction function, given enough neurons.

(2) The same is true for linear regression, by taking polynomials of the input values as additional input values, since you can approximate (compare Taylor expansion) each function well by polynomials.

(3) This means, that (in a sense, with respect to best possible outcomes), those 2 methods are equivalent.

(4) Hence, their main difference lies in which method lends itself to better computational implementation. In other words, with which method can you find, based on training examples, faster good values for the parameters which eventually define the prediction function.

I welcome any thoughts, comments and recommendations to other links or books to improve my thinking.

**Answer**

Here’s the deal:

Technically you did write true sentences(both models can approximate any ‘not too crazy’ function given enough parameters), but those sentences do not get you anywhere at all!

Why is that?

Well, take a closer look at the universal approximation theory, or any other formal proof that a neural network can compute any f(x) if there are ENOUGH neurons.

All of those kind of proofs which I have seen use only one hidden layer.

Take a quick look here http://neuralnetworksanddeeplearning.com/chap5.html for some intuition.

There are works showing that in a sense the number of neurons needed grow exponentially if you are just using one layer.

So, while in theory you are right, in practice, you do not have infinite amount of memory, so you don’t really want to train a 2^1000 neurons net,do you? Even if you did have infinite amount of memory,that net will overfit for sure.

To my mind, the most important point of ML is the practical point!

Let’s expand a little on that.

The real big issue here isn’t just how polynomials increase/decrease very quickly outside the training set. Not at all. As a quick example, any picture’s pixel is within a very specific range ([0,255] for each RGB color) so you can rest assured that any new sample will be within your training set range of values. No. The big deal is: This comparison is not useful to begin with(!).

I suggest that you will experiment a bit with MNIST, and try and see the actual results you can come up with by using just one single layer.

Practical nets use way more than one hidden layers, sometimes dozens (well, Resnet even more…) of layers. For a reason. That reason is not proved, and in general, choosing an architecture for a neural net is a hot area of research. In other words, while we still need to know more, both models which you have compared(linear regression and NN with just one hidden layer ), for many datasets, are not useful whatsoever!

By the way, in case you will get into ML, there is another useless theorem which is actually a current ‘area of research’- PAC (probably approximately correct)/VC dimension. I will expand on that as a bonus:

If the universal approximation basically states that given infinite amount of neurons we can approximate any function (thank you very much?), what PAC says in practical terms is, given (practically!) infinite amount of labelled examples we can get as close as we want to the best hypothesis within our model.

It was absolutely hilarious when I calculated the actual amount of examples needed for a practical net to be within some practical desired error rate with some okish probability 🙂

It was more than the number of electrons in the universe.

P.S. to boost it also assumes that the samples are IID (that is never ever true!).

**Attribution***Source : Link , Question Author : tyrex , Answer Author : Yoni Keren*