How to do data augmentation and train-validate split?

I am doing image classification using machine learning.

Suppose I have some training data (images) and will split the data into training and validation sets. And I also want to augment the data (produce new images from the original ones) by random rotations and noise injection. The augmentaion is done offline.

Which is the correct way to do data augmentation?

  1. First split the data into training and validation sets, then do data augmentation on both training and validation sets.

  2. First split the data into training and validation sets, then do data augmentation only on the training set.

  3. First do data augmentation on the data, then split the data into training and validation set.

Answer

First split the data into training and validation sets, then do data augmentation on the training set.

You use your validation set to try to estimate how your method works on real world data, thus it should only contain real world data. Adding augmented data will not improve the accuracy of the validation. It will at best say something about how well your method responds to the data augmentation, and at worst ruin the validation results and interpretability.

Attribution
Source : Link , Question Author : yangjie , Answer Author : burk

Leave a Comment