Can increasing the amount of training data make overfitting worse?

Suppose I train a neural network on dataset A and evaluate on dataset B (that has a different feature distribution than dataset A). If I increase the amount of data in dataset A by a factor of 10, is it likely to decrease accuracy on dataset B?


On the contrary, more data is almost always better at generalizing to unseen data. The more examples of the data-generating process, the closer the model predictions will get to that of the population. After all, your model has seen a larger part of the population.

Hypothetically, if all hyperparameters were to be held constant, then more data means more steps along the gradient at the same learning rate, which could indeed overfit more easily. However, if you regularize appropriately, choose the right learning rate, etc., then this isn’t a problem.

That said, if the new and old data do not come from the same distribution, then simply adding more data will not remedy this. You should probably look into over-/undersampling, or other methods, depending on what exactly you mean by a different feature distribution.

Source : Link , Question Author : asdfaefi , Answer Author : Frans Rodenburg

Leave a Comment