Is there a statistical reason why item analysis/response theory isn’t more widely applied? For instance, if a teacher gives a 25 question multiple choice test and finds that 10 questions were answered correctly by everyone, 10 questions were answered by a really low fraction (say 10%) and the remaining 5 were answered by roughly 50% of people. Doesn’t it make sense to reweight the scores so that hard questions are given more weight?
And yet, in the real world tests almost always have all questions weighted equally. Why?
The below link discusses discrimination indices and other measures of difficulties for choosing which questions are best:
It seems though that the method of figuring out the discrimination index of questions is only used in a forward-looking way (eg., if a question doesn’t discriminate well, toss it). Why aren’t tests re-weighted for the current population?
(You asked whether there is a statistical reason: I doubt it, but I’m guessing about other reasons.) Would there be cries of “moving the goalpost”? Students usually like to know when taking a test just how much each item is worth. They might be justified in complaining upon seeing, for example, that some of their hard-worked answers didn’t end up counting much.
Many teachers and professors use unsystematic, subjective criteria for scoring tests. But those who do use systems are probably wary about opening those systems up to specific criticism — something they can largely avoid if hiding behind more subjective approaches. That might explain why item analysis and IRT are not used more widely than they are.