I have a data set of 11,000+ distinct items, each of which was classified on a nominal scale by at least 3 different raters on Amazon’s Mechanical Turk.
88 different raters provided judgments for the task, and no one rater completed more about 800 judgments. Most provided significantly fewer than that.
My question is this:
I would like to calculate some measure of inter-rater reliability for the ratings, something better than a simply looking at consensus. I believe, however, that Fleiss Kappa, which is the measure I know best, would require a consistent group of raters for the entire set of items, and so I cannot use Fleiss Kappa to check IRR with my data. Is this correct? Is there another method I could use?
Any advice would be much appreciated!
Check out Krippendorff’s alpha. It has several advantages over some other measures such as Cohen’s Kappa, Fleiss’s Kappa, Cronbach’s alpha: it is robust to missing data (which I gather is the main concern you have); it is capable of dealing with more than 2 raters; and it can handle different types of scales ( nominal, ordinal, etc.), and it also accounts for chance agreements better than some other measures like Cohen’s Kappa.
Calculation of Krippendorff’s alpha is supported by several statistical software packages, including R (by the irr package), SPSS, etc.
Below are some relevant papers, that discuss Krippendorff’s alpha including its properties and its implementation, and compare it with other measures:
Hayes, A. F., & Krippendorff, K. (2007). Answering the call for a standard reliability measure for coding data. Communication Methods and Measures, 1(1), 77-89.
Krippendorff, K. (2004). Reliability in Content Analysis: Some Common Misconceptions and Recommendations. Human Communication Research, 30(3), 411-433. doi: 10.1111/j.1468-2958.2004.tb00738.x
Chapter 3 in Krippendorff, K. (2013). Content Analysis: An Introduction to Its Methodology (3rd ed.): Sage.
There are some additional technical papers in Krippendorff’s website