Is there a difference between doing preprocessing for a dataset in
sklearn
before and after splitting data intotrain_test_split
?In other words, are both of these approaches equivalent?
from sklearn.preprocessing import StandardScaler from sklearn.model_selection import train_test_split #standardizing after splitting X_train, X_test, y_train, y_test = train_test_split(data, target) sc = StandardScaler().fit(X_train) X_train_std = sc.transform(X_train) X_test_std = sc.transform(X_test) #standardizing before splitting data_std = StandardScaler().fit_transform(data) X_train, X_test, y_train, y_test = train_test_split(data_std, target)
Answer
No, Both approaches are not equivalent.
StandardScaler()
standardize features by removing the mean and scaling to unit variance
If you fit the scaler after splitting: Suppose, if there are any outliers in the test set(after Splitting), the Scaler would not consider those in computing mean and Variance.
If you fit the scaler on whole dataset and then split, Scaler would consider all values while computing mean and Variance.
Since, the mean and variance are different in both cases, the fits and transform functions would perform differently.
Attribution
Source : Link , Question Author : W.R. , Answer Author : phanny