Unequal sample sizes: When to call it quits

I’m peer reviewing an academic journal article and the authors wrote the following as justification for not reporting any inferential statistics (I deidentified the nature of the two groups):

In total, 25 of the 2,349 (1.1%) respondents reported X. We appropriately refrain from presenting analyses that statistically compare group X to group Y (the other 2,324 participants) since those results could be heavily driven by chance with an outcome this rare.

My question is are the authors of this study justified in throwing in the towel with respect to comparing groups? If not, what might I recommend to them?

Statistical tests do not make assumptions about sample size. There are, of course, differing assumptions with various tests (e.g., normality), but the equality of sample sizes is not one of them. Unless the test used is inappropriate in some other way (I can’t think of an issue right now), the type I error rate will not be affected by drastically unequal group sizes. Moreover, their phrasing implies (to my mind) that they believe it will. Thus, they are confused about these issues.

On the other hand, type II error rates very much will be affected by highly unequal $n$s. This will be true no matter what the test (e.g., the $t$-test, Mann-Whitney $U$-test, or $z$-test for equality of proportions will all be affected in this way). For an example of this, see my answer here: How should one interpret the comparison of means from different sample sizes? Thus, they may well be “justified in throwing in the towel” with respect to this issue. (Specifically, if you expect to get a non-significant result whether the effect is real or not, what is the point of the test?)

As the sample sizes diverge, statistical power will converge to $\alpha$. This fact actually leads to a different suggestion, which I suspect few people have ever heard of and would probably have trouble getting past reviewers (no offense intended): a compromise power analysis. The idea is relatively straightforward: In any power analysis, $\alpha$, $\beta$, $n_1$, $n_2$, and the effect size $d$, exist in relationship to each other. Having specified all but one, you can solve for the last. Typically, people do what is called an a-priori power analysis, in which you solve for $N$ (generally you are assuming $n_1=n_2$). On the other hand, you can fix $n_1$, $n_2$, and $d$, and solve for $\alpha$ (or equivalently $\beta$), if you specify the ratio of type I to type II error rates that you are willing to live with. Conventionally, $\alpha=.05$ and $\beta=.20$, so you are saying that type I errors are four times worse than type I errors. Of course, a given researcher might disagree with that, but having specified a given ratio, you can solve for what $\alpha$ you should be using in order to possibly maintain some adequate power. This approach is a logically valid option for the researchers in this situation, although I acknowledge the exoticness of this approach may make it a tough sell in the larger research community that probably has never heard of such a thing.