Stopping rules affect the relationship between P-values and the error rates associated with decisions. A recent paper by Simmons et al. 2011 coins the term researcher degrees of freedom to describe a collection of behaviours that they consider to be responsible for many of the reports in the psychology literature that have been found to be not reproducible.
Of those behaviours, optional stopping rules or undeclared interim analyses are what I am currently interested in. I describe their effect on error rates to my students, but they do not seem to be described in the textbooks that my students use (or don’t use!). In the main bookshop at my university there are fourteen statistics textbooks aimed at introductory-level students in various disciplines such as biosciences, business, engineering etc. Only one of those texts contained an index item “sequential testing’ and none had an index item ‘stopping rule’.
Is there an introductory level statistics textbook that explains the issue of optional stopping rules?
Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant. Psychological Science, 22(11), 1359–1366. doi:10.1177/0956797611417632
You can’t have a stopping rule without some idea of your distribution and your effect size – which you don’t know a priori.
Also yes, we need to focus on effect size – and it has never been regarded as correct to consider only p-values, and we should certainly not be showing tables or graphs that show p-values or F-values rather than effect size.
There are problems with traditional Statistical Hypothesis Inference Testing (which Cohen says is worthy of its acronym, and Fisher and Pearson would both turn over in the graves if they saw all that is being done in their violently opposed names today).
To determine N, you need to have already determined a target significance and power threshold, as well as making lots of assumptions about distribution, and in particular you also need to have determined the effect size that you want to establish. Indolering is exactly right that this should be the starting point – what minimum effect size would be cost effective!
The “New Statistics” is advocating showing the effect sizes (as paired difference where appropriate), along with the associated standard deviations or variance (because we need to understand the distribution), and the standard deviations or confidence intervals (but the latter is already locking in a p-value and a decision about whether you are predicting a direction or an each way bet). But setting a minimum effect of specified sign with a scientific prediction, makes this clear – although the pre-scientific default is to do trial and error and just look for differences. But again you have made assumptions about normality if you go this way.
Another approach is to use box-plots as a non-parametric approach, but the conventions about whiskers and outliers vary widely and even then themselves originate in distributional assumptions.
The stopping problem is indeed not a problem of an individual researcher setting or not setting N, but that we have a whole community of thousands of researchers, where 1000 is much more than 1/alpha for the traditional 0.05 level. The answer is currently proposed to be to provide the summary statistics (mean, stddev, stderr – or corresponding “non-parametric versions – median etc. as with boxplot) to facilitate meta-analysis, and present combined results from all experiments whether they happen to have reached a particular alpha level or not.
Closely related is the multiple testing problem, which is just as fraught with difficulty, and where experiments are kept oversimplistic in the name of preserving power, whilst overcomplex methodologies are proposed to analyze the results.
I don’t think there can be a text book chapter dealing with this definitively yet, as we still have little idea what we are doing…
For the moment, the best approach is probably to continue to use the traditional statistics most appropriate to the problem, combined with displaying the summary statistics – the effect and standard error and N being the most important. The use of confidence intervals is basically equivalent to the corresponding T-test, but allows comparing new results to published ones more meaningully, as well as allowing an ethos encouraging reproducibility, and publication of reproduced experiments and meta-analyses.
In terms of Information Theoretic or Bayesian approaches, they use different tools and make different assumptions, but still don’t have all the answers either, and in the end face the same problems, or worse ones because Bayesian inference steps back from making a definitive answer and just adduces evidence relative assumed or absent priors.
Machine Learning in the end also has results which it needs to consider for significance – often with CIs or T-Test, often with graphs, hopefully pairing rather than just comparing, and using appropriately compensated versions when the distributions don’t match. It also has its controversies about bootstrapping and cross-validation, and bias and variance. Worst of all, it has the propensity to generate and test myriads of alternative models just by parameterizing thoroughly all the algorithms in one of the many toolboxes, applied to the datasets thoughtfully archived to allow unbridled multiple testing. Worst still it is still in the dark ages using accuracy, or worse still F-measure, for evaluation – rather than chance-correct methods.
I have read dozens of papers on these issues, but have failed to find anything totally convincing – except the negative survey or meta-analysis papers that seem to indicate that most researchers don’t handle and interpret the statistics properly with respect to any “standard”, old or new. Power, multiple testing, sizing and early stopping, interpretation of standard errors and confidence intervals, … these are just some of the issues.
Please shoot me down – I’d like to be proven wrong! In my view there’s lots of bathwater, but we haven’t found the baby yet! At this stage none of the extreme views or name-brand approaches looks promising as being the answer, and those that want to throw out everything else have probably lost the baby.