I haven’t found a satisfactory answer to this from

Of course if the data I have is of the order of millions then deep learning is the way.

And I have read that when I do not have big data then maybe it is better to use other methods in machine learning. The reason given is over-fitting. Machine learning: i.e. looking at data, feature extractions, crafting new features from what is collected etc. things such as removing heavily correlated variables etc. the whole machine learning 9 yards.

And I have been wondering: why is it that the neural networks with one hidden layer are not panacea to machine learning problems? They are universal estimators, over-fitting can be managed with dropout, l2 regularization, l1 regularization, batch-normalization. Training speed is not generally an issue if we have just 50,000 training examples. They are better at test time than, let us say, random forests.

So why not – clean the data, impute missing values as you would generally do, center the data, standardize the data, throw it to an ensemble of neural networks with one hidden layer and apply regularization till you see no over-fitting and then train them to the end. No issues with gradient explosion or gradient vanishing since it is just a 2 layered network. If deep layers were needed, that means hierarchical features are to be learned and then other machine learning algorithms are no good as well. For example SVM is a neural network with hinge loss only.

An example where some other machine learning algorithm would outperform a carefully regularized 2 layered (maybe 3?) neural network would be appreciated. You can give me the link to the problem and I would train the best neural network that I can and we can see if 2 layered or 3 layered neural networks falls short of any other benchmark machine learning algorithm.

**Answer**

Each machine learning algorithm has a different inductive bias, so it’s not always appropriate to use neural networks. A linear trend will always be learned best by simple linear regression rather than a ensemble of nonlinear networks.

If you take a look at the winners of past Kaggle competitions, excepting any challenges with image/video data, you will quickly find that neural networks are not the solution to everything. Some past solutions here.

apply regularization till you see no over-fitting and then train them to the end

There is no guarantee that you can apply enough regularization to prevent overfitting without completely destroying the capacity of the network to learn anything. In real life, it is rarely feasible to eliminate the train-test gap, and that’s why papers still report train and test performance.

they are universal estimators

This is only true in the limit of having an unbounded number of units, which isn’t realistic.

you can give me the link to the problem and i would train the best neural network that i can and we can see if 2 layered or 3 layered neural networks falls short of any other benchmark machine learning algorithm

An example problem which I expect a neural network would never be able to solve: Given an integer, classify as prime or not-prime.

I believe this could be solved perfectly with a simple algorithm that iterates over all valid programs in ascending length and finds the shortest program which correctly identifies the prime numbers. Indeed, this 13 character regex string can match prime numbers, which wouldn’t be computationally intractable to search.

Can regularization take a model from one that overfits to the one that has its representational power severely hamstrung by regularization? Won’t there always be that sweet spot in between?

Yes, there is a sweet spot, but it is usually way before you stop overfitting. See this figure:

If you flip the horizontal axis and relabel it as “amount of regularization”, it’s pretty accurate — if you regularize until there is no overfitting at all, your error will be huge. The “sweet spot” occurs when there is a bit of overfitting, but not too much.

How is a ‘simple algorithm that iterates over all valid programs in

ascending length and finds the shortest program which correctly

identifies the prime numbers.’ an algorithm that learns?

It finds the parameters θ such that we have a hypothesis H(θ) which explains the data, just like backpropagation finds the parameters θ which minimize the loss (and by proxy, explains the data). Only in this case, the parameter is a string instead of many floating point values.

so if i get you correctly you are making the argument that if the data is not substantial the deep network will never hit the validation accuracy of the best shallow network given the best hyperparameters for both?

Yes. Here is an ugly but hopefully effective figure to illustrate my point.

but that doesnt make sense. a deep network can just learn a 1-1

mapping above the shallow

The question is not “can it”, but “will it”, and if you are training backpropagation, the answer is probably not.

We discussed the fact that larger networks will always work better than smaller networks

Without further qualification, that claim is just wrong.

**Attribution***Source : Link , Question Author : MiloMinderbinder , Answer Author : shimao*