Grants often require power analysis to support a proposed sample size. In proteomics (and most -omics), there are 100’s to 1000’s of features/variables measured across 10’s of samples (maybe 100s, but unlikely). Also, it is known that some of these measurement units (e.g., spectral counts of proteins) are not normally distributed and so we will use non-parametric test for analysis. I have seen the power of a sample size determined assuming a single measurement and assuming a t-test, but I don’t think this is completely correct. Another problem with spectral counts specifically is that each of the 100’s of features is on very different scales with vastly different errors (larger values have less error). [ This problem is nicely described in the limit fold change model, Mutch et al., 2002 ]
What would be the appropriate way to determine the power of a proposed sample size given some assumptions of FDR and an acceptable fold-change? Using the tool here I was able to determine given the following:
- 300 genes
- 3 false positives
- 1.4 fold-differences
- 0.8 desired power
- 0.7 stdev
requires a sample size per group of 49.
This was handy since I am proposing a 50v50 design, know that 1.4 fold-change is pretty accepted, 1% FDR is fine, and I will probably measure 300 proteins in this experiment. This problem of power or sample size calculation will continue to occur, so it would be nice to have a referenced approach in place.
I read where a colleague proposed to model spectral counts from negative binominal distributions using the likelihood function followed by a Wald test. Basically uses prelim data to get protein variance estimates and then calculate detectable fold changes between groups for each quantile. There is also an FDR (alpha) input. So, given a >80% power and set sample size, they can determine detectable fold-changes for 25% lowest variance, 50% smaller variance, and 25% highest variance. Problem is that I don’t know how they did this. Not sure if sharing this approach will help anyone with a possible answer.
In applications (especially ethical applications, where you have to do a power study) I like using this reference [Wang and Chen 2004], because it nicely explains the concept behind a power calculation for high-throughput data (whatever the data actually is).
In essence, in addition to the usual parameters (α, β, N, effect size) you use two additional parameters, λ and η. The latter, η, is the assumed numbered of truly altered genes, and λ is the fraction of the truly altered genes that you want to be able to detect. It is quite straightforward to expand any known power calculations to a high-throughput data using this approach.
Wang, Sue-Jane, and James J. Chen. “Sample size for identifying differentially expressed genes in microarray experiments.” Journal of Computational Biology 11.4 (2004): 714-726.