Is it better to do exploratory data analysis on the training dataset only?

I’m doing exploratory data analysis (EDA) on a dataset. Then I will select some features to predict a dependent variable.

The question is:
Should I do the EDA on my training dataset only? Or should I join the training and test datasets together then do the EDA on them both and select the features based on this analysis?

Answer

I’d recommend having a look at “7.10.2 The Wrong and Right Way to Do Cross-validation” in http://statweb.stanford.edu/~tibs/ElemStatLearn/printings/ESLII_print10.pdf.

The authors give an example in which someone does the following:

  1. Screen the predictors: find a subset of “good” predictors that show
    fairly strong (univariate) correlation with the class labels
  2. Using just this subset of predictors, build a multivariate classifier.
  3. Use cross-validation to estimate the unknown tuning parameters and
    to estimate the prediction error of the final model

This sounds very similar to doing EDA on all (i.e. training plus test) of your data and using the EDA to select “good” predictors.

The authors explain why this is problematic: the cross-validated error rate will be artificially low, which might mislead you into thinking you’ve found a good model.

Attribution
Source : Link , Question Author : Aboelnour , Answer Author : Adrian

Leave a Comment