Standard error of a count

I have a dataset of incident cases by season of a rare disease. For example, say there were 180 cases in the spring, 90 in the summer, 45 in the fall, and 210 in the winter. I’m struggling with whether it is appropriate to attach standard errors to these numbers. The research goals are inferential in the sense that we are looking for a seasonal pattern in disease incidence that might recur in the future. Thus, it feels intuitively like it should be possible to attach a measure of uncertainty to the totals. However, I’m not sure how one would compute a standard error in this case since we are dealing with simple counts rather than, e.g., means or proportions.

Finally, would the answer depend on whether the data represent the population of cases (every case that has ever occurred) or a random sample? If I am not mistaken, it generally does not make sense to present standard errors with population statistics, since there is no inference.

Answer

The population is the (hypothetical) set of all people who are at risk to get the disease; usually, that consists of all people (or some clearly identifiable subgroup of people) residing in the study area. It is important to define this population clearly, because it is the target of the study and of all inferences made from the data.

When cases of the disease are independent (which might be a reasonable hypothesis when the disease is not readily communicated between people and is not caused by local environmental conditions) and they are rare, then the counts should closely follow a Poisson distribution. For this distribution, a good estimate of its standard deviation is the square root of the count.

Using these heuristics, the data (180, 90, 45, 210) would have associated standard deviations of (13.4, 9.5, 6.7, 14.5), which we can take provisionally as rough assessments of error. Conceptually, in each season there is a hypothetical true disease incidence rate–everybody in the population during that season has the same (low) risk of contracting the disease–but because getting this disease is thought of as a random event, the actual numbers of diseases observed during a season will vary from that true rate. The square root of the true (but unknown!) rate quantifies the amount of variation likely to occur. Because the observed counts ought to be close to the true rates, their square roots should be reasonable proxies for the square roots of the true rates. These proxies are exactly what is meant by a “standard error.”

The first thing to notice about this calculation is the variation among the counts (which have a range of 165 and a standard deviation of 77) is much greater than the individual SDs, which do not exceed 14.5. This confirms that the underlying rate is significantly changing by season: that’s to be expected. Accordingly, reporting the SD of 77 for this batch of data could be useful for indicating the magnitude of seasonal variation, but it is not relevant for indicating standard errors of the values.

But what if the data are not independent? Disease outbreaks often occur in clusters. If, for instance, a typical cluster size were 9, then these data (approximately) reflect (20, 10, 5, 23) clusters, respectively. If we take these to be realizations of four Poisson variables and use their square roots to estimate SDs, we get (4.5, 3.2, 2.2, 4.8). Multiplying by 9 to convert from clusters to people gives (40, 28.5, 20, 44). Notice how much larger these values are than before: clustering increases relative error.

That’s about as far as one can go with these limited data. These simple calculations have revealed that:

  • Characterizing the population is critical,

  • The square root of a count is a rough starting point for assessing its standard error,

  • The square root has to be multiplied (roughly) by some factor to reflect lack of independence in the disease cases (and this factor can approximately be related to sizes of disease clusters),

  • Variation among these counts primarily reflects variation in the disease rate over time rather than uncertainty (about the underlying Poisson intensity).

Attribution
Source : Link , Question Author : half-pass , Answer Author : whuber

Leave a Comment