Are there any straightforward methods of transforming ordinal level data into interval level (just as there are for doing it the other way round)? And performable in Excel or SPSS?
Having the data, say:
10 questions on the ordinal level (say 0-5 scale, where 0=”not at all”, 5=”all the time”),
I want to tranform them so that they could be treated as proper interval level data for parametric testing purposes (normal distribution, non-parametric tests out of the question).
Would be extremely grateful for answers!
This response will discuss possible models from a measurement perspective, where we are given a set of observed (manifest) interrelated variables, or measures, whose shared variance is assumed to measure a well-identified but not directly observable construct (generally, in a reflective manner), which will be considered as a latent variable.
If you are unfamiliar with latent trait measurement model, I would recommend the following two articles: The attack of the psychometricians, by Denny Borsbooom, and Latent Variable Modelling: A Survey, by Anders Skrondal and Sophia Rabe-Hesketh. I will first make a slight digression with binary indicators before dealing with items with multiple response categories.
One way to transform ordinal level data into interval scale is to use some kind of Item Response model. A well-known example is the Rasch model, which extends the idea of the parallel test model from classical test theory to cope with binary-scored items through a generalized (with logit link) mixed-effect linear model (in some of the ‘modern’ software implementation), where the probability of endorsing a given item is a function of ‘item difficulty’ and ‘person ability’ (assuming there’s no interaction between one’s location on the latent trait being measured and item location on the same logit scale–which could be captured through an additional item discrimination parameter, or interaction with individual-specific characteristics–which is called differential item functioning). The underlying construct is assumed to be unidimensional, and the logic of the Rasch model is just that the respondent has a certain ‘amount of the construct’–let’s talk about subject’s liability (his/her ‘ability’), and call it θ, as does any item that defines this construct (their ‘difficulty’). What is of interest is the difference between respondent location and item location on the measurement scale, θ. To give a concrete example, consider the following question: “I found it hard to focus on anything other than my anxiety” (yes/no). A person suffering from anxiety disorders is more likely to answer positively to this question compared to a random individual taken from the general population and having no past history of depression or anxiety-related disorder.
An illustration of 29 item response curves derived from a large-scale US study that aims to build a calibrated item bank assessing anxiety-related disorders(1,2) is shown below. The sample size is N=766; exploratory factor analysis confirmed the unidimensionality of the scale (with first eigenvalue largely above the second eigenvalue (by a 17-fold amount), and unreliable 2nd factor axis (eigenvalue juste above 1) as confirmed by parallel analysis), and this scale shows reliability index in the acceptable range, as assessed by Cronbach’s alpha (α=0.971, with 95% bootstrap CI [0.967;0.975]). Initially, five response categories were proposed (1 = ‘Never’, 2 = ‘Rarely’, 3 = ‘Sometimes’, 4 = ‘Often’, and 5 = ‘Always’) for each item. We will here only consider binary-scored responses.
(Here, responses to Likert-type items have been recoded as binary responses (1/2=0, 3-5=1), and we consider that each item is equally discriminative across individuals, hence the parallelism between item curve slopes (Rasch model).)
As can be seen, people located to the right of the x-axis, which reflects the latent trait (anxiety), who are thought to express more of this trait are more likely to answer positively to questions like “I felt terrified” (terrific) or “I had sudden feelings of panic” (panic) than people located to the left (normal population, unlikely to be considered as cases); on the other hand, it is not unlikely than someone from the general population would report having trouble to get asleep (sleeping): for someone located at intermediate range of the latent trait, say 0 logit, his/her probability of scoring 3 or higher is about 0.5 (which is the item difficulty).
For polytomous items with ordered categories, there are several choices: the partial credit model, the rating scale model, or the graded response model, to name but a few that are mostly used in applied research. The first two belong to the so-called “Rasch family” of IRT models and share the following properties: (a) monotonicity of the response probability function (item/category response curve), (b) sufficiency of total individual score (with latent parameter considered as fixed), (c) local independence meaning that responses to items are independent, conditional on the latent trait, and (d) absence of differential item functioning meaning that, conditional on the latent trait, responses are independent of external individual-specific variables (e.g., gender, age, ethnicity, SES).
Extending the previous example to the case where the five response categories are effectively accounted for, a patient will have a higher probability of choosing response category 3 to 5, compared to someone sampled from the general population, without any antecedent of anxiety-related disorders. Compared to the modeling of dichotomous item described above, these models consider either cumulative (e.g., odds of answering 3 vs. 2 or less) or adjacent-category threshold (odds of answering 3 vs. 2), which is also discussed in Agresti’s Categorical Data Analysis (chapter 12). The main difference between the aforementioned models lies in the way transitions from one response category to the other are handled: the partial credit model does not assume that difference between any given threshold location and the mean of the threshold locations on the latent trait is equal or uniform across items, contrary to the rating scale model. Another subtle difference between those models is that some of them (like the unconstrained graded response or partial credit model) allows for unequal discrimination parameters between item. See Applying item response theory modeling for evaluating questionnaire item and scale properties, by Reeve and Fayers, or The basis of item response theory, by Frank B. Baker, for more details.
Because in the preceding case we discussed the interpretation of responses probability curves for dichotomously scored items, let’s look at item response curves derived from a graded response model, highlighting the same target items:
(Unconstrained graded response model, allowing for unequal discrimination among items.)
Here, the following observations deserve some consideration:
- Response categories for the ‘sleeping’ item are less discriminative than, say, the ones attached to ‘terrific’: in the case of ‘sleeping’, for two persons located at the two extrema of the interval [2;2.5] on the latent trait (in logit units), their probability of choosing the fourth response (“often had difficulty sleeping”) goes from approx. 0.35 to 0.4; with ‘terrific’, that probability goes from less than 0.1 to about 0.25 (dashed blue line). If you want to discriminate between two patients showing signs of anxiety, the latter item is more informative.
- There is an overall shift, from the left to the right, between item assessing sleep quality and those assessing more severe conditions, although sleeping disorders are not uncommon. This is expected: after all, even people in the general population might experience some difficulty falling asleep, independent of their health state, and people severely depressed or anxious are likely to exhibit such problems. However, ‘normal persons’ (if this ever had any meaning) are unlikely to show some signs of panic disorder (the probability they choose the highest response category is zero for people located up to the intermediate range or more of the latent trait, [0;1]).
In both cases discussed above, this θ scale which reflects individual liability on the assumed latent trait has the property of an interval scale.
Besides being thought of as truly measurement models, what makes Rasch models attractive is that sum scores, as a sufficient statistic, can be used as surrogates for the latent scores. Moreover, the sufficiency property readily implies the separability of model (persons and items) parameters (in the case of polytomous items, one should not forget that everything applies at the level of item response category), hence conjoint additivity.
A good review of IRT model hierarchy, with R implementation, is available in Mair and Hatzinger’s article published in the Journal of Statistical Software: Extended Rasch Modeling: The eRm Package for the Application of IRT Models in R. Other models include log-linear models, non-parametric model, like the Mokken model, or graphical models.
Apart from R, I am not aware of Excel implementations, but several statistical packages were proposed on this thread: How to get started with applying item response theory and what software to use?
Finally, if you want to study the relationships between a set of items and a response variable without resorting on a measurement model, some form of variable quantization through optimal scaling can be interesting too. Apart from R implementations discussed in those threads, SPSS solutions were also proposed on related threads.
- Pilkonis, P., Choi, S., Reise, S., Stover, A. and Riley, W. et al. (2011). Item banks for mea- suring emotional distress from the patient-reported outcomes measurement information system (PROMIS): Depression, anxiety, and anger. Assessment, 18(3), 263–283.
- Choi, S., Gibbons, L. and Crane, P. (2011). lordif: An R package for detecting differential item functioning using iterative hybrid ordinal logistic regression/Item Response Theory and monte carlo simulations. Journal of Statistical Software, 39(8).