I am analyzing data from two surveys that I merged together:

School staff survey, for years 2005-06 and 2007-08

School students survey, for years 2005-06 through 2008-09

For both of these data sets, I have observations (at the student or staff level) from 3 different school districts, each having representative samples per year within their distinct school district.

For analysis, I combined the student data into two 2-year periods (2005-07 and 2007-09). Then I then I ‘ddply’-ed each data set to obtain percentages of staff or students that responded to questions according to cutoffs (e.g., whether they answered in the affirmative, “Agreed”, or whether the student marked that they used alcohol, etc.). So when I merged the staff and student level data sets together, the school is the unit of analysis, and I only have 1 observation per school per 2-year time periods (given that the school wasn’t missing data for a given time period).

My goal is to estimate associations between staff and student responses. So far, my plan was to obtain Pearson correlation coefficients between all the variables (as they’re all continuous responses representing percentages) for each school district separately from each other (as this eliminates the generalizability assumption for the other districts in this data set). To do this, I would average the district data over the two years anyway to get just one observation per school.

Questions:

- Is this an appropriate analysis plan? Is there some other method I may use that could provide me better inference or power?
- If my plan is appropriate, should I obtain weighted correlations based on school’s enrollment (as there are more smaller schools than large that would be contributing disproportionally to the correlation coefficients)?
I have asked the data administrator about this, and he mentioned that the main factors that determine the necessity for weighting my data is whether or not I think school size affects the degree of correlation and whether my interpretation will be at the student or school level. I think my interpretation will be at the school level (e.g., “a school with this percentage of staff answering this way is correlated to this percentage of students responding this way…”).

**Answer**

I imagine this is history by now, but just in case…

1) Yes, this seems appropriate. Your research question must be “are teacher attitudes/behaviours at a school related to student attitudes/behaviours at that school?” If this is your question, a school is the appropriate unit of analysis (and there would be no way to match up individual teachers to students anyway).

I would just add caveats on the use of Pearson’s correlation coefficient, unrelated to the question of the unit of analysis or sampling strategy. The correlation coefficient cannot pick up non-linear relationships, can be misleading to interpret, is easily distorted by a few outliers, and classical inference based on it depends on Normality (which won’t hold exactly with your proportion data, although it may be a reasonable approximation). At a minimum I would carefully use graphical methods to check that this is a sensible approach and there is not a better way of inferring the relationship between the two variables.

2) I don’t think you *need* to weight the data but I would certainly try it (and hope it doesn’t change the results). But I would weight by your *sample size* in the school, not by the enrollment size. The reason would be about estimation rather than either your unit of analysis or any need to “weight to population”. You only have an estimate of the true teacher and student responses in each school, drawing on your finite sample. Schools where you had a larger sample you are more confident in your estimate, and hence it would be good if they were taken more seriously in fitting your correlation or linear regression.

**Attribution***Source : Link , Question Author : Iris Tsui , Answer Author : Peter Ellis*