Difference between preprocessing train and test set before and after splitting

Is there a difference between doing preprocessing for a dataset in sklearn before and after splitting data into train_test_split?

In other words, are both of these approaches equivalent?

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

#standardizing after splitting
X_train, X_test, y_train, y_test = train_test_split(data, target)
sc = StandardScaler().fit(X_train)
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)

#standardizing before splitting
data_std = StandardScaler().fit_transform(data)
X_train, X_test, y_train, y_test = train_test_split(data_std, target)


No, Both approaches are not equivalent.

StandardScaler() standardize features by removing the mean and scaling to unit variance

If you fit the scaler after splitting: Suppose, if there are any outliers in the test set(after Splitting), the Scaler would not consider those in computing mean and Variance.

If you fit the scaler on whole dataset and then split, Scaler would consider all values while computing mean and Variance.

Since, the mean and variance are different in both cases, the fits and transform functions would perform differently.

Source : Link , Question Author : W.R. , Answer Author : phanny

Leave a Comment