# Realistically, does the i.i.d. assumption hold for the vast majority of supervised learning tasks?

The i.i.d. assumption states:

We are given a data set, $${(xi,yi)}i=1,…,n\{(x_i,y_i)\}_{i = 1, \ldots, n}$$, each data $$(xi,yi)(x_i,y_i)$$ is generated in an independent and identically distributed fashion.

To me, physically this means that we can imagine that the generation of $$(xi,yi)(x_i,y_i)$$ has no affect on $$(xj,yj)(x_j,y_j)$$, $$j≠ij \neq i$$ and vice versa.

But does this hold true in practice?

For example, the most basic machine learning task is prediction on MNIST dataset. Is there a way to know whether MNIST was generated in an i.i.d. fashion? Similarly for thousands of other data sets. How do we “any practitioner” know how the data set is generated?

Sometimes I also see people mentioning shuffling your data to make the distribution more independent or random. Does shuffling tangibly create benefit as compared to a non-shuffled data set?

For example, suppose we create a “sequential” MNIST dataset contained digits arranged in an increasing sequence 1,2,3,4,5,6,..obviously, the data set was not generated in an independent fashion. If you generate 1, the next one must be 2. But does training a classifier on this data set has any difference as compared to a shuffled dataset?

Just some basic questions.

The operational meaning of the IID condition is given by the celebrated “representation theorem” of Bruno de Finetti (which, in my humble opinion, is one of the greatest innovations of probability theory ever discovered). According to this brilliant theorem, if we have a sequence $$X=(X1,X2,X3,...)\mathbf{X}=(X_1,X_2,X_3,...)$$ with empirical distribution $$FxF_\mathbf{x}$$, if the values in the sequence are exchangeable then we have:
$$X1,X2,X3,...|Fx∼IID Fx.X_1,X_2,X_3, ... | F_\mathbf{x} \sim \text{IID } F_\mathbf{x}.$$