What is the essential difference between a neural network and nonlinear regression?

Artificial neural networks are often (demeneangly) called “glorified regressions”. The main difference between ANNs and multiple / multivariate linear regression is of course, that the ANN models nonlinear relationships.

So what is the difference between an ANN and a multiple / multivariate nonlinear regression model?

The only thing I can think of, is the graph-like structure of the neural network, that allows for an efficient parameter learning procedure (backpropagation), and other advantages (flexible stackering of layers in deep networks allowing for feature learning, etc).

Can they be effectively called ‘glorified nonlinear regressions’? Or is there more to it?

EDIT: Found a good discussion on this here https://www.quora.com/Is-Machine-Learning-just-glorified-curve-fitting
where essentially it is agreed that the differences are mostly nuiances, but the approach being similar.

I understand that in that case, the answer is more of a subjective nature, and this question is not appropiate for stackexchange.

In theory, yes. In practice, things are more subtle.

First of all, let’s clear the field from a doubt raised in the comments: neural networks can handle multiple outputs in a seamless fashion, so it doesn’t really matter whether we consider multiple regression or not (see The Elements of Statistical Learning, paragraph 11.4).

Having said that, a neural network of fixed architecture and loss function would indeed just be a parametric nonlinear regression model. So it would even less flexible than nonparametric models such as Gaussian Processes. To be precise, a single hidden layer neural network with a sigmoid or tanh activation function would be less flexible than a Gaussian Process: see http://mlss.tuebingen.mpg.de/2015/slides/ghahramani/gp-neural-nets15.pdf. For deep networks this is not true, but it becomes true again when you consider Deep Gaussian Processes.

So, why are Deep Neural Networks such a big deal? For very good reasons:

1. They allow fitting models of a complexity that you wouldn’t even begin to dream of, when you fit Nonlinear Least Squares models with the Levenberg-Marquard algorithm. See for example https://arxiv.org/pdf/1611.05431.pdf, https://arxiv.org/pdf/1706.02677.pdf and https://arxiv.org/pdf/1805.00932.pdf where the number of parameters $p$ goes from 25 to 829 millions. Of course DNNs are overparametrized, non-identifiable, etc. etc. so the number of parameters is very different from the “degrees of freedom” of the model (see https://arxiv.org/abs/1804.08838 for some intuition). Still, it’s undeniably amazing that models with $N < ($N=$ sample size) are able to generalize so well.

2. They scale to huge data sets. A vanilla Gaussian Process is a very flexible model, but inference has a $O(N^3)$ cost which is completely unacceptable for data sets as big as ImageNet or bigger such as Open Image V4. There are approximations to inference with GPs which scale as well as NNs, but I don't know why they don't enjoy the same fame (well, I have my ideas about that, but let's not digress).

3. For some tasks, they're impressively accurate, much better than many other statistical learning models. You can try to match ResNeXt accuracy on ImageNet, with a 65536 inputs kernel SVM, or with a random forest for classification. Good luck with that.

However, the real difference between theory:

all neural networks are parametric nonlinear regression or classification models

and practice in my opinion, is that in practice nothing about a deep neural network is really fixed in advance, so you end up fitting a model from a much bigger class than you would expect. In real-world applications, none of these aspects are really fixed:

• architecture (suppose I do sequence modeling: shall I use an RNN? A dilated CNN? Attention-based model?)
• details of the architecture (how many layers? how many units in layer 1, how many in layer 2, which activation function(s), etc.)
• how do I preprocess the data? Standardization? Minmax normalization? RobustScaler?
• kind of regularization ($l_1$? $l_2$? batch-norm? Before or after ReLU? Dropout? Between which layers?)
• optimizer (SGD? Path-SGD? Entropy-SGD? Adam? etc.)
• other hyperparameters such as the learning rate, early stopping, etc. etc.
• even the loss function is often not fixed in advance! We use NNs for mostly two applications (regression and classification), but people use a swath of different loss functions.

Look how many choices are performed even in a relatively simple case where there is a strong seasonal signal, and the number of features is small, as far as DNNs go:

https://stackoverflow.com/questions/48929272/non-linear-multivariate-time-series-response-prediction-using-rnn

Thus in practice, even though ideally fitting a DNN would mean to just fit a model of the type

$y=f(\mathbf{x}\vert\boldsymbol{\theta})+\epsilon$

where $f$ has a certain hierarchical structure, in practice very little (if anything at all) about the function and the fitting method is defined in advance, and thus the model is much more flexible than a "classic" parametric nonlinear model.