# Relationships between correlation and causation

For any two correlated events, A and B, the different possible relationships include:

1. A causes B (direct causation);
2. B causes A (reverse causation);
3. A and B are consequences of a common cause, but do not cause each
other;
4. A and B both causes C, which is (explicitly or implicitly)
conditioned on.;
5. A causes B and B causes A (bidirectional or cyclic causation);
6. A causes C which causes B (indirect causation);
7. There is no connection between A and B; the correlation is a
coincidence.

What does the fourth point mean. A and B both causes C, which is (explicitly or implicitly) conditioned on. If A and B cause C, why do A and B have to be correlated.

“Conditioning” is a word from probability theory : https://en.wikipedia.org/wiki/Conditional_probability

Conditioning on C means that we are only looking at cases where C is true. “Implicitly” means that we may not be making this restriction explicit, sometimes not even aware of doing it.

The point means that, when A and B both cause C, observing a correlation between A and B in cases where C is true, does not mean there is a real relationship between A and B. It’s just conditioning on C (maybe unwillingly) that creates an artificial correlation.

Let’s take an example.

In a country there exists exactly two sorts of diseases, perfectly independent. Call A : “person has first disease”, B : “person has second disease”. Assume $P(A)=0.1$, $P(B)=0.1$.

Now any person who has one of these diseases goes to see the doctor and only then. Call C : “person goes to see the doctor”. We have $C=A \text{ or } B$.

Now let’s calculate a few probabilities :

• $P(C)=0.19$
• $P(A|C)=P(B|C)=\frac{0.1}{0.19}\approx 0.53$
• $P(A \text{ and } B|C)=\frac{0.01}{0.19}\approx 0.053$
• $P(A|C)P(B|C)\approx 0.28$

Clearly, when conditioned on C, $A$ and $B$ are very far from being independent. Actually, conditioned on C, $not A$ seems to “cause” $B$.

If you use the list of persons who where recorded by their doctor(s) as a data source for an analysis, then there seems to be a strong correlation between diseases $A$ and $B$. You may not be aware of the fact that your data source is actually a conditioning. This is also called a “selection bias”.