I need some guidance on the appropriate level of pooling to use for difference of means tests on time series data. I am concerned about temporal and sacrificial pseudo-replication, which seem to be in tension on this application. This is in reference to a mensural study rather than a manipulative experiment.
Consider a monitoring exercise: A system of sensors measures dissolved oxygen (DO) content at many locations across the width and depth of a pond. Measurements for each sensor are recorded twice daily, as DO is known to vary diurnally. The two values are averaged to record a daily value. Once a week, the daily results are aggregated spatially to arrive at a single weekly DO concentration for the whole pond.
Those weekly results are reported periodically, and further aggregated – weekly results are averaged to give a monthly DO concentration for the pond. The monthly results are averaged to give an annual value. The annual averages are themselves averaged to report decadal DO concentrations for the pond.
The goal is to answer questions such as: Was the pond’s DO concentration in year X higher, lower, or the same as the concentration in year Y? Is the average DO concentration of the last ten years different than that of the prior decade? The DO concentrations in a pond respond to many inputs of large magnitude, and thus vary considerably. A significance test is needed. The method is to use a T-test comparison of means. Given that the decadal values are the mean of the annual values, and the annual values are the mean of the monthly values, this seems appropriate.
Here’s the question – you can calculate the decadal means and the T-values of those means from the monthly DO values, or from the annual DO values. The mean doesn’t change of course, but the width of the confidence interval and the T-value does. Due to the order of magnitude higher N attained by using monthly values, the CI often tightens up considerably if you go that route. This can give the opposite conclusion vs using the annual values with respect to the statistical significance of an observed difference in the means, using the same test on the same data. What is the proper interpretation of this discrepancy?
If you use the monthly results to compute the test stats for a difference in decadal means, are you running afoul of temporal pseudoreplication? If you use the annual results to calc the decadal tests, are you sacrificing information and thus pseudoreplicating?
I believe that you are trying to use statistical methods that are appropriate for independent observations while you have correlated data, both temporarily and spatially. If you have observations say for 5 hours and decide to re-state this as 241 observations taken every minute, you really don’t have 240 degrees of freedom in respect to the mean of these 241 values. Autocorrelation potentially yields an overstatement of the size of “N” and thusly creates false uncertainty statements. What you need to do is to find someone/some textbook/some web site/…. to teach you about time series data and it’s analysis. One way to start is to GOOGLE “help me understand time series” and start to read/learn. There is a lot of material available on the web. One available trove of time series information is something I helped create at http://www.autobox.com/AFSUniversity/afsuFrameset.htm . I mention this as I am still associated with this firm and it’s products thus my comments are “biased and opinionated” but not solely self-serving.