Intro / Background / Example
A recent article connecting pollen with covid-19 has gone viral this week.
Higher airborne pollen concentrations correlated with increased SARS-CoV-2 infection rates, as evidenced from 31 countries across the globe PNAS March 23, 2021 118 (12) e2019034118
The third figure in that article sketches a correlation, which is used in a remarkable way.
Fig. 3 Bag plot depicting the date of onset of SARS-CoV-2 exponential infection phase. Date of onset of the exponential infection phase (x axis) across all sites versus the average pollen concentration of the previous 4 d (y axis).
It shows a (weak) correlation between pollen and time. We see that later in the month March there have been more higher pollen concentrations than earlier in the month March.
The remarkable thing about this correlation is that the time points have been chosen by some measure for the onset date of the covid-19 epidemic in various places (which happened around 13 March for this sample).
Due to this, the authors argue that there is some relation between the onset date of the covid-19 epidemic and pollen concentrations (which is subtly different from a relation between time and pollen concentrations).
On a cross-sectional design for all 80 regions under study, it was found that the onset date of the exponential phase per region positively and significantly correlated with the cumulative amount of pollen up to 4 d before (P < 0.001, r = 0.25)
However, the onset date has nothing to do with the found correlation. We can see this when we plot all the time series entirely and with the points from the onset day in Fig. 3 overlayed.
The onset dates have little to do with the pollen concentrations and any other random selection/filter of time points around 13 March would have likely made a positive correlation because there are more and higher pollen peaks later in March than at the beginning of March.
This link between the time points (the onset dates) and the pollen concentration is a non sequitur.
Is there for this particular fallacy, with the correlation of time points, a specific name? Or is there a text book reference that demonstrates this fallacy?
For instance, if I would like to shorten the above story/explanation and just say a single sentence like “In figure 3 they make the error/fallacy of …. ” What name or textbook reference could we place on the points?
Observation of “spurious correlation” for time-series over the same time period is something that has been recognised in the statistical community for over a century. Yule (1926) has observed that comparison of time-series vectors breaches the usual independent sampling assumptions in statistical problems, and that some simple deterministic series lead to correlation values with non-zero magnitute — in some cases giving perfect positive or negative correlation. Wald argues that when time-series have systematic serial correlation (i.e., auto-correlation) then they will tend to be correlated with one another when taken over the same or similar time periods, even if there is no causal connection between the series.
Below I give some simple examples that illustrate the phenomenon of interst here. For an affine time-series with non-zero slope, any time vector is perfectly correlated with its corresponding time-series vector. For out-of-phase sinousoidal time-series, the time-series vectors are strongly negatively correlated, and can be perfectly negatively correlated for particular time vectors. Of particular interest here is the first case, which shows the statistical relationship between a time vector and its corresponding time-series vector under a simple trend. The case in your question is similar, insofar as it looks at the correlation between time values and pollen concentrations at those times. The low positive correlation simply means that there is a slight increasing trend in pollen concentration (relative to its variance) over the period in which the time values of interest occur. As you correctly point out, this does not really mean much — just that pollen concentration was trending upward (very weakly) over a particular time period that coincided with the onset of Covid phases.
All of this really just reflects the fact that contemporaneous trends in time-series vectors lead to correlation between those vectors. If two time-series trend in the same direction over the same time period then they will tend to be positively correlated over that period. Likewise, if two time-series trend in opposite directions over the same time period then they will tend to be negatively correlated over that period. Several examples can be seen in the book Spurious Correlations, where contemporaneous temporal trends lead to high correlation.
The fallacy that encapsulates your concern here is cum hoc ergo propter hoc (“with this, therefore because of this”). Inferring a causal connection from the mere fact that two things have contemporaneous trends can lead to error, and usually we require more than this for a good causal inference. (And certainly we would at least want to know if the authors here were testing a pre-registered hypothesis, or just making a post hoc observation of correlation. It is almost certainly the latter.) The take-home here is that when you observe that two time-series are correlated (even highly correlated) that does not really mean much, especially as evidence for an underlying causal connection. As you observe in your question, the correlation observed in the paper occurs because there was increasing pollen count during March, and that conincided temporally with more frequent onset of Covid “phases”. That is really not saying much, and if you just said that plainly then it would be an unremarkable statement that would not suggest any causal link between the two things.
Perfect positive correlation: As a simple illustration of high positive correlation, consider an affine time-series of the form:
Suppose we take some time vector t=(t1,...,tn) and form the corresponding vector x=(x1,...,xn) composed of values of the series at those time points. Since xi=α+βti for all i=1,...,n it is easy to show that these vectors are perfectly correlated —i.e., they have Pearson correlation equal to one.
Strong/perfect negative correlation: As a simple example of high negative correlation, consider the two time-series of the form:
Suppose we take some time vector t=(t1,...,tn) and form the corresponding vectors x=(x1,...,xn) and y=(y1,...,yn) composed of values of the series at those time points. Through the use of discrete Fourier transformation, it is easy to show that these vectors will tend to have high negative correlation, and in some cases they can have perfect negative correlation.