Sample size determination using historical data and simulation

I am interested in the legitimacy and classification of the following approach to sample size determination. This question is inspired by “There is only one test!” by Allen Downey and “Statistics without the agonizing pain” by John Rauser.

Suppose there is a customer base, and each customer is described by some characteristic x, such as the total amount of spendings. Let f be an arbitrary metric that we care about, such as the average value. Let also T be a treatment that is hypothesized to improve the business as measured by f. We are then to run the following NHST:

H0: T does not affect the population as measured by f.

H1: T positively affects the population as measured by f.

Given an α and a minimum detectable effect size δ, the goal is to determine the minimum sample size n required to attain a prescribed β. At our disposal, we have a dataset X={xi} describing the customer base during a time interval in the recent past. We resort to simulation as follows.

First, we assume some n and draw n samples with replacement from X and apply f. We repeat it multiple times. By doing so, we obtain an empirical estimate of the sampling distribution under the null hypothesis. We then assume that, under the alternative hypothesis, the distribution will have the same shape but will be shifted by δ. We repeat the sampling with δ as an offset and obtain an estimate the sampling distribution under the alternative hypothesis. We can then compute β.

Placing the above procedure in an optimization loop, we obtain the minimum n required to attain the desired β. The following R snippet demonstrates the approach:


data <- tibble(value = rlnorm(20000))

metric <- mean
alpha <- 0.05
beta <- 0.2
delta <- 0.05 * metric(data$value)

simulate <- function(n, N) {
  control <- replicate(N, { metric(sample(data$value, n, replace = TRUE)) })
  treatment <- replicate(N, { metric(sample(data$value, n, replace = TRUE)) }) + delta
  critical <- quantile(control, 1 - alpha)
  beta <- sum(treatment < critical) / N
  list(control = control,
       treatment = treatment,
       critical = critical,
       beta = beta)

N <- 1000
target <- function(n) beta - simulate(as.integer(n), N)$beta
n <- as.integer(uniroot(target, interval = c(1, 10000))$root)

result <- simulate(n, N)
tibble(control = result$control, treatment = result$treatment) %>%
  gather(group, value) %>%
  ggplot(aes(value, fill = group)) +
  geom_density(alpha = 0.5, color = NA) +
  geom_vline(xintercept = result$critical, linetype = 'dashed')


Is it a statistically sound procedure? How would you classify it? Does it have a name? Is it a variant of bootstrap? How does it compare with the classical t-test?

The last is particularly concerning, as it gives a totally different estimate:

power.t.test(delta = delta,
             sd = sd(data$value),
             sig.level = alpha,
             alternative = 'one.sided',
             power = 1 - beta)

The former gives 4822 per group, while the latter 8620 per group. I have also tried with rnorm(20000, mean = 100, sd = 50) instead of rlnorm(20000) and got 634 and 1261 per group, respectively. What accounts for this difference?


For those interested, I have written a blog post describing the technique in more detail.


You’re doing what’s called a “bootstrap” power calculation. It’s a perfectly reasonable method. Though not as common as methods that make assumptions about the distribution of the test statistic, it is valid, and likely to be more appropriate in cases where the necessary assumptions are not true enough. That is, the assumptions are almost never exactly true, but usually they’re plenty good enough.

I didn’t find any references about this straightforward situation in a quick search of the literature, but here’s a blog post from statistics you can probably trust that refers to Common Errors in Statistics (and How to Avoid Them) by Phillip Good and James Hardin, which I’ve not personally read but have heard good things about, and both authors are respected statisticians.

Anyway, the problem with your code is that it’s doing a one-sample t-test, not a two-sample t-test. power.t.test for the one-sided situation gives about 631, close to your 634. Try comparing with this.

power.t.test(delta=delta, sig.level=alpha, power=1-beta, sd=sd(data$value), 
             alternative='one.sided', type="one.sample")

That is, the test statistic you’re using is the mean of the test group, and comparing it with a fixed known control. This is a different situation than if you’re using the difference between two random groups as the test statistic.

Here’s a stab at what equivalent code for a two-sample test would look like; I get 1306, close to the expected 1261.

simulate <- function(n, N) {
  control_under_null <- replicate(N, { metric(sample(data$value, n, replace = TRUE)) })
  treatment_under_null <- replicate(N, { metric(sample(data$value, n, replace = TRUE)) })
  difference_under_null <- treatment_under_null - control_under_null
  control <- replicate(N, { metric(sample(data$value, n, replace = TRUE)) })
  treatment <- replicate(N, { metric(sample(data$value, n, replace = TRUE)) }) + delta
  difference <- treatment - control
  critical <- quantile(difference_under_null, 1- alpha)
  beta <- sum(difference < critical) / N
  list(null = difference_under_null,
       alternative = difference,
       critical = critical,
       beta = beta)

PS. The Downey blog is misleading about the current state of statistical education; teaching via simulation and bootstrap is now very common, and many consider it best practices. For an example see one of the OpenIntro statistics textbooks:

Source : Link , Question Author : Ivan , Answer Author : Aaron left Stack Overflow

Leave a Comment