Difference between training and test data distribution

The basic assumption in machine learning is training and test data follows same distribution. But in reality this is highly unlikely. Covariate shift address this issue in which training and test distributions are different. Can someone clear the following doubts regarding this ?

  1. How to check whether two distribution are different statistically ?
  2. Can kernel density estimate (KDE) approach be used to estimate the probability distribution to tell the difference ?
  3. Lets say I have 100 images of a specific category. Number of test images is 50. I’m changing the number of training images from 5 to 50 in steps of 5. Can I say the probability distributions are different when using 5 training images and 50 testing images after estimating them by KDE?

Answer

Ordinarily, you would obtain your training data as a simple random sample of your total dataset. This allows you to take advantage of all the known properties of random samples, including the fact that the training and test data then have the same underlying distributions. Indeed, the main purpose of this split is to use one set of data to “train” your model (i.e., fit the model) and the other set of data to set hypotheses of interest in that model. If you do not randomly sample your training data then you get all sorts of problems arising from the fact that there may be systematic differences between the two parts of your data.

Attribution
Source : Link , Question Author : Daniel Wonglee , Answer Author : Ben

Leave a Comment