## In a nutshell

Has consistency been studied for any of the ‘typical’ deep learning models used in practice, i.e. neural networks with multiple hidden layers, trained with stochastic gradient descent? If so, what have been the findings?

There seems to be some confusion as to why I mention optimization procedures in connection with consistency. Consistency is a property of estimators, and while we are free to choose any estimator, I chose to restrict the question to estimators that we can actually calculate. Hence the necessity to specify an algorithm.

I would also like to point out that universal approximation theorems (UATs) typically only show that neural networks as a function class are rich enough to model certain kinds on nonlinear relationships, and those I know of do not deal with calculable estimators. Hence some motivation for my question: UATs, even those with specific error bounds, give no guarantees as to the performance of real-world neural nets.

## The details

A consistent estimator converges in probability to its true value if the sample size grows to infinity. We can apply this concept to nonparametric models like (regression) neural networks:

Let the DGP be

E[y]=f(x)+ε

and we assume that f∈F for some function class F.

We train the neural network with some algorithm An:(X,Y)n→F′n such that

fn=An((x1,y1),…,(xn,yn))

is our trained network for dataset size n.

Then a neural network would be consistent for F if

lim

for all \delta>0.

Does there exist a neural network that is consistent in this sense?

## [Edit 3] What kind of network? Which \mathcal{F}? Which norm?

To say that ‘Neural networks’ is a broad class is an understatement, pretty much every model can be represented as a neural network. There are a number of specifications however which do not have other representations (yet), namely the ‘typical’ neural network:

- trained with a variation of stochastic gradient descent
- more than 1 hidden layer
- ridge activation function: x\mapsto u(a’x+b) for some univariate function u (this includes ReLU and its variations, sigmoid and its variations) or pooling activation function
- dense/convolutional/recurrent or a combination thereof

Anyof these ‘typical’ networks is fine. Another way to view the question would be:“If I want to choose my network architecture such that my network is a consistent estimator, what would be a valid choice?”.With regards to \mathcal{F} I have even less restrictions. It should not be a finite-dimensional space (e.g. polynomials with some finite order), that would be cheating. Beyond that, whatever is necessary to get consistency. Beppo Levi, C^k, Hilbert, Sobolev, pick your favorite. Compact support is fine of course, although not necessary.

Finally, while I realize that not all norms are equivalent on infinite-dimensional spaces such as \mathcal{F}, I am happy with convergence in probability under any norm. After all, it depends on \mathcal{F} which norms even make sense.

## Edit

When I say that neural networks are nonparametric models, I mean that we should let the size of the network depend on the size of the dataset. If we do not do this, it is quite obvious that neural networks are inconsistent because networks of fixed size have irreducible approximation errors – no network with ReLU or sigmoid activation can represent e^x exactly, for example.

## [Edit 2] Why is this relevant?

Neural networks have a reputation for being “very good at very complex tasks”. It seems to me this reputation stems from some high-profile achievements in image classification, language processing and (video)game playing. Without question, neural networks have been able to distill meaningful signals from the very high-dimensional input spaces these problems have. At the same time, neural networks have been found to be incredibly nonrobust, which has been studied under the nomer of ‘adversarial attacks’.

The kind of nonrobustness that neural networks show (namely sensitive to

smallperturbations) is a strong hint to me that the signals found by neural networks are, at least in part, spurious. That is hardly a stretch; with input dimensions in the hundreds of thousands, even uncorrelated random noise will likely contain a number of strong spurious correlations (linear dependence), and the amount of reasonably smooth nonlinear dependencies will be much larger.This is where consistency comes in. If training neural networks is consistent, at least there is hope that under the right conditions (given by the consistency proof) the spurious signals disappear. If on the other hand training is inconsistent, we should look for ways to adapt the learning process to make it consistent.

Here is my reasoning:

We know thatg_n = \arg\min_{f\in\mathcal{F}’_n}\sum_{i=1}^n{(y_i-f(x_i))}^2

is consistent if \mathcal{F}’_n is chosen very precisely, for example according to the universal approximation proof in this paper.

The problem is that neural networks are known to have multiple local minima in their loss function and have no special structure we can exploit, therefore there does not exist an algorithm that can find with certainty the elusive g_n.

In particular, the average discrepancy between local and global minima need not converge to zero. That would be enough to not have consistency.

This does not conflict with the universal approximation theorem because that theorem does not concern algorithms, it only answers the question if \lim_{n\rightarrow\infty}\mathcal{F’_n} is dense in \mathcal{F}.

Am I wrong? How well known/studied is this in the ML literature? For example, the deep learning book has a whole section on the theory of consistency, but never relays it back to deep learning.

**Answer**

Some details must be defined before the question can be answered.

There are different sorts of neuron and network connections, and these need specifying. There are also different measures of how well a learnt function approximates the function to be learned.

The universal approximation theorem assumes compact support, so probably this should be specified in the question. Neural nets do not extrapolate. For example, RELUs are asymptotically linear so a RELU network, however well it approximates the objective function within the support of the training set, can only learn a function that is asymptotically linear.

There are also different training algorithms – the answer must be dependent on which is chosen.

**Attribution***Source : Link , Question Author : Sebastiaan , Answer Author : chrishmorris*