I’m having trouble to understand how to

compare2 sets of data by theirdistribution.For Example,

how can I understand that column X100 has the same distribution as column Y1?Also, is there a way to express the

distribution comparisonof all columns to all columns?I’m a machine learning developer using

python, and this is a part of aclassification problemI’m working on.Would appreciate any help.. tnx 🙂

**Answer**

You can compare distribution of the two columns using two-sample Kolmogorov-Smirnov test, it is included in the `scipy.stats`

: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ks_2samp.html

From the stackoverflow topic:

```
from scipy.stats import ks_2samp
import numpy as np
np.random.seed(123456)
x = np.random.normal(0, 1, 1000)
y = np.random.normal(0, 1, 1000)
z = np.random.normal(1.1, 0.9, 1000)
>>> ks_2samp(x, y)
Ks_2sampResult(statistic=0.022999999999999909, pvalue=0.95189016804849647)
>>> ks_2samp(x, z)
Ks_2sampResult(statistic=0.41800000000000004, pvalue=3.7081494119242173e-77)
```

Under the null hypothesis the two distributions are identical. If the K-S statistic is small or the p-value is high (greater than the significance level, say 5%), then we cannot reject the hypothesis that the distributions of the two samples are the same. Conversely, we can reject the null hypothesis if the p-value is low.

**Attribution***Source : Link , Question Author : Sahar Millis , Answer Author : hellpanderrr*