I am a software developer working on A/B testing systems. I don’t have a solid stats background but have been picking up knowledge over the past few months.
A typical test scenario involves comparing two URLs on a website. A visitor visits
LANDING_URLand then is randomly forwarded to either
URL_EXPERIMENTAL. A visitor constitutes a sample, and a victory condition is achieved when the visitor performs some desirable action on that site. This constitutes a conversion and the rate of conversion rates is the conversion rate (typically expressed as a percentage). A typical conversion rate for a given URL is something in the realm of 0.01% to 0.08%. We run tests to determine how new URLs compare against old URLs. If
URL_EXPERIMENTALis shown to outperform
URL_CONTROL, we replace
We have developed a system using simple hypothesis testing techniques. I used the answers to another CrossValidated question here to develop this system.
A test is set up as follows:
- The conversion rate estimate
URL_CONTROLis calculated using historical data.
- The desired target conversion rate
- A significance level of 0.95 is typically used.
- A power of 0.8 is typically used.
Together, all of these values are used to compute the desired sample size. I’m using the R function
power.prop.testto obtain this sample size.
A test will run until all samples are collected. At this point, the confidence intervals for
CR_EXPERIMENTALare computed. If they do not overlap, then a winner can be declared with significance level of 0.95 and power of 0.8.
The users of our tests have two major concerns, though:
1. If, at some point during the test, enough samples are collected to show a clear winner, can’t the test be stopped?
2. If no winner is declared at the end of the test, can we run the test longer to see if we can collect enough samples to find a winner?
It should be noted that many commercial tools out there exist that allow their users to do exactly what our own users desire. I’ve read that there are many fallacies with the above, but I’ve also come across the idea of a stopping rule and would like to explore the possibility of using such a rule in our own systems.
Here are two approaches we would like to consider:
power.prop.test, compare the current measured conversion rates to the current number of samples and see if enough samples have been collected to declare a winner.
Example: A test has been set up to see if the following behavior exists in our system:
CRE_EXPERIMENTAL: 0.1 * 1.3
- With these parameters, the sample size
However, as the test advances and reaches 325 samples,
CRM_CONTROL(measured conversion rate for control) is 0.08 and
power.prop.testis run on these conversion rates and
Nis found to be 325. Exactly the number of samples needed to declare
CRM_EXPERIMENTALto be the winner! At this point it is our hope that the test could be ended. Similarly, if the test reaches 1774 samples but no winner is found, but then it reaches 2122 samples which is enough to show that
CRM_CONTROLof 0.1 and
CRM_EXPERIMENTAL0.128 is a result where a winner can be declared.
In a related question users advised that such a test is less credible due to encouraging early stops having fewer samples and also being vulnerable to estimation bias and an increased number of Type I and Type II errors. Is there some way to make this stopping rule work? This is our preferred approach since it means less programming time for us. Perhaps this stopping rule could work by offering some kind of numerical score or scores that measures the credibility of the test should it be stopped early?
These methods of testing are designed exactly for the situation we find ourselves in: how can our users start a test and end it in such a way that they don’t waste excess time in testing? Either running a test too long, or having to start a test over with different parameters.
Of the two above methods, I favor SPRT because the mathematics is a bit easier for me to grasp and because it looks like it may be easier to program. However, I don’t understand how to use the likelihood function in this context. If someone could construct an example of how to compute the likelihood-ratio, the cumulative sum of the likelihood-ratio, and continue through an example illustrating a situation when one would continue monitoring, when one would accept the null hypothesis and the alternative hypothesis, that would help us determine if SPRT is the right way to go.
This is an interesting problem and the associated techniques are have lots of applications. They are often called “interim monitoring” strategies or “sequential experimental design” (the wikipedia article, which you linked to, is unfortunately a little sparse), but there are several ways to go about this. I think @user27564 is mistaken in saying that these analyses must necessarily be Bayesian–there are certainly frequentist approaches for interim monitoring too.
Your first approach resembles one of the original approaches to interim monitoring, called ‘curtailment.’ The idea is very simple: you should stop collecting data once the experiment’s outcome is inevitable. Suppose you’ve got a collection of 100 As and/or Bs and you want to know whether it was generated by a process that selects an A or B at random each time (i.e., P(A)=P(B)=0.5. In this case, you should stop as soon as you count at least 58 items of the same kind; counting the remaining items won’t change the significance after that point. The number 58 comes from finding x such that 1−F(x;100;0.5)<α, where F is the cumulative binomial distribution.
Similar logic lets you find the “inevitability points” for other tests where:
- The total sample size* is fixed, and
- Each observation contributes a bounded amount to the sample.
This would probably be easy for you to implement–calculate the stopping criteria offline and then just plug it into your site’s code–but you can often do even better if you’re willing to terminate the experiment not only when the outcome is inevitable, but when it is also very unlikely to change.
This is called stochastic curtailment. For example, suppose, in the example above, that we’ve seen 57 As and 2 Bs. We might feel reasonably confident, if not absolutely certain, that there is at least one more A in the box of 100, and so we could stop. This review by Christopher Jennison and Bruce Turnbull, works through Stochastic Curtailment in Section 4. They also have a longer book; you can peek at Chapter 10 via Google Books. In addition to the derivation, the book has some formulae where you can more or less plug in the results of your interim tests.
There are a number of other approaches too. Group sequential methods are designed for situations where you may not be able to obtain a set number of subjects and the subjects trickle in at variable rates. Depending on your site’s traffic, you might or might not want to look into this.
There are a fair number of R packages floating around CRAN, if that’s what you’re using for your analysis. A good place to start might actually be the Clinical Trials Task View, since a lot of this work came out of that field.
[*] Just some friendly advice: be careful when looking at significance values calculated from very large numbers of data points. As you collect more and more data, you will eventually find a significant result, but the effect might be trivially small. For instance, if you asked the whole planet whether they prefer A or B, it’s very unlikely that you would see an exact 50:50 split, but it’s probably not worth retooling your product if the split is 50.001:49.999. Keep checking the effect size (i.e., difference in conversion rates) too!