I always thought larger sample sizes were better.

Then I read something somewhere about how when sample sizes are larger, it’s easier to find significant p-values when they’re not really there (i.e., false positives), because significance gets exaggerated.

Is there a name for this phenomenon?

I’m currently working with a large sample size (around 5,000 cases) where I did a t-test and the p-value turned out to be less than 0.001. What test(s) can I use to determine whether this is a valid p-value or whether this happened because the sample size was large.

I’m not a statistics expert, so please pardon any “newb-ness” evident in my post.

**Answer**

I always thought larger sample sizes were good.

Almost always, though there are situations where they don’t help much. However, as sample sizes become quite large, the particular aspects of the problem that are of most concern change.

Then I read something somewhere about how when sample sizes are larger, it’s easier to find significant p-values when they’re not really there (i.e., false positives), because significance gets exaggerated.

As stated, this is untrue, though there are some things that may be of concern.

Let’s start with the basic assertion: Large samples don’t prevent hypothesis tests from working *exactly* as they are designed to. [If you’re able to, ask the source of the statement for some kind of reason to accept this claim, such as evidence that it’s true (whether by algebraic argument, simulation, logical reasoning or whatever – or even a reference). This will likely lead to a slight change in the statement of the claim.]

The problem isn’t generally false positives, but *true* positives — in situations where people don’t want them.

People often make the mistaken assumption that statistical significance always implies something practically *meaningful*. In large samples, it may not.

As sample sizes get very large even very tiny differences from the situation specified in the null may become detectable. This is not a failure of the test, that’s how it’s supposed to work!

[It sometimes seems to me to border on the perverse that while almost everyone will insist on consistency for their tests, so many will complain that something is wrong with hypothesis testing *when they actually get it*.]

When this bothers people it’s an indication that hypothesis testing (or at least the form of it they were using) didn’t address the actual research question they had. In some situations this is addressed better by confidence intervals. In others, it’s better addressed by calculation of effect sizes. In other situations equivalence tests might better address what they want. In other cases they might need other things.

[A caveat: If some of the assumptions don’t hold, you might in some situations get an increase in false positives as sample size increases, but that’s a failure of the assumptions, rather than a problem with large-sample hypothesis testing itself.]

In large samples, issues like sampling bias can completely dominate effects from sampling variability, to the extent that they’re the only thing that you see. Greater effort is required to address issues like this, because small issues that produce effects that may be very small compared to sampling variation in small samples may dominate in large ones. Again, the impact of that kind of thing is not a problem with hypothesis testing itself, but in the way the sample was obtained, or in treating it as a random sample when it actually wasn’t.

I’m currently working with a large sample size (around 5,000 cases) where I did a t-test and the p-value turned out to be less than 0.001. What test(s) can I use to determine whether this is a valid p-value or whether this happened because the sample size was large.

Some issues to consider:

Significance level: in very large samples, if you’re using the same significance levels that you would in small samples, you’re not balancing the costs of the two error types; you can reduce type I error substantially with little detriment to power at effect sizes you care about – it would be odd to tolerate relatively high type I error rates if there’s little to gain. Hypothesis tests in large samples would sensibly be conducted at substantially smaller significance levels, while still retaining good very power (why would you have power of 99.99999% if you can get power of say 99.9% and drop your type I error rate by a factor of 10?).

Validity of p-value: You may like to address the robustness of your procedure to potential failure of assumptions; this is not addressed by hypothesis testing of assumptions on the data. You may also like to consider possible issues related to things like sampling biases (e.g. do you really have a random sample of the target population?)

Practical significance: compute CIs for actual differences from the situation under the null in the case of say a two-sample t-test, look at a CI for the difference in means* – it should exclude 0, but is it so small you don’t care about it?

* (Or, if it’s more relevant to your situation, perhaps a calculation of effect size.)

One way to reassure yourself about your own test would be to carry out (before the test, and indeed hopefully before you have data) a study of the power at some small-but-relevant-to-your-application effect size; if you have very good power then, and reasonably low type I error rate, then you would nearly always be making the right decision when the effect size is at least that large and nearly always be making the right decision when the effect size was 0. The only section in which you were not nearly always making the correct choice would be in the small window of effect sizes that were very small (once you didn’t have a strong interest in rejecting), where the power curve is increasing from $\alpha$ to whatever it was at your small-effect-size that you did your power calculation at.

I’m not a statistics expert, so please pardon any “newb-ness” evident in my post.

The entire point of this site is to generate good questions and good answers, and the question is quite good. You shouldn’t apologize for using the site for exactly what it’s here for. [However, aspects of it are addressed in other questions and answers on the site. If you look down the ‘Related’ column at the right hand side of this page you’ll see a list of links to somewhat similar questions (as judged by an automatic algorithm). At least a couple of the questions in that list are highly relevant, in a way that may have altered the form or emphasis in your question, but the basic question of the truth of the statement itself – relating to the possible occurrence of false positives – would presumably remain, so even if you had pursued those questions, you’d presumably still need to ask the main one]

e.g. see this question; it has $n$ of about a hundred thousand.

One of the data sets in one of the other questions in the sidebar has sample size in the *trillions*. *That* is a big sample. In that kind of situation sampling variation (and so hypothesis testing) generally becomes completely irrelevant.

**Attribution***Source : Link , Question Author : thanks_in_advance , Answer Author : Glen_b*