‘WHY’ does the LLN actually work? Why after multiple trials will results converge out to actually ‘BE’ closer to the mean the larger the samples get?

Can someone please explain to me in simple, intuitive laymen’s terms ‘what’ causes the Law of Large Numbers to work and more to the point ‘why’ it works? As in what’s actually going on within the maths to allow convergence to occur instead of not occur? (I’m not a mathematician by no means, and don’t really understand the symbols – I’m just really interested in this concept for the ‘why’).

I’m aware of all that gamblers fallacy stuff (chance having no memory etc), this isn’t what I’m referring to. I’m looking to understand in simple terms actually ‘why’ the LLN causes convergence to happen the larger the sample/results become.

For instance, if I had a strategy, system, edge or trick coin (heads) all weighted to 80%… What I’m trying to understanding is ‘why’ after multiple trials will whatever I’m doing converge out to actually ‘be’ closer to 80% over time?

I’m completely aware that whatever it is has ‘no memory’ and after only a few trials anything could happen… so in the case of the coin example, I could get 9 tails out of 10 flips (even though the coin is weighted 80% heads!)… However, after 10,000 of flips, they’ll be roughly 8000 heads and 2000 tails (maybe 8050 and 1950 – whatever)… in other words why does it ‘strengthen’ and converge out the more trials taken.

So to conclude, I’m wondering if anyone could tell me ‘how/why’ this happens and what’s going on in the maths for it to be the case instead of not the case. I know there’s nothing actually “pulling the results back to the 80% expected mean” but it just seems that way. Can someone please explain in as simple as possible laymen’s terms why this LLN artefact of maths works? Perhaps a simple example would help.

For a terrific if mathematical explanation of the Law of Large
Numbers, see the blog entry of Terry Tao.

It contains in particular the following analogy:

“Imagine a table in which the rows are all the possible points in the sample space (this is a continuum of rows, but never mind this), and the columns are the number $n$ of trials, and there is a check mark whenever the empirical mean $\overline{X}_n$ deviates significantly from the actual mean ${\Bbb E}[X]$. The weak law asserts that the density of check marks in each column goes to zero as one moves off to the right. The strong law asserts that almost all of the rows have only finitely many checkmarks.”

In my opinion, the fundamental reason for the (strong) law of large numbers to hold is the Borel-Cantelli lemma:

Given a sequence $E_1, E_2, E_3, \ldots$ of events satisfying$$\sum_{n=1}^\infty {\Bbb P}(E_n) < \infty$$ with probability one, there
is only a finite collection of true events $E_n$

Note also that the law of large numbers holds outside iid cases, as for instance, ergodic Markov chains which satisfy the LLN although they are neither independent [being Markov] nor iid [depending on the starting value].