Let’s say I have some self-report items measured on a 5-point Likert scale (Strongly Disagree to Strongly Agree) and other items measured on a 4-point Likert scale (Never, Rarely, Sometimes, Often). Can anyone point me to literature (or practical advice) on combining these items into a composite scale? Let’s assume for the sake of argument that we have some empirical evidence that items should be combined.
1. Sum raw scores
Con: Max responses on the 5-point scale (originally 4’s) drive up the total scale score more than max responses on the 4-point scale (originally 3’s).
2. Rescale and sum
Put all of the items on a 0-1 scale and sum. So the 4-point items (0,1,2,3) would be multiplied by (4/3)/4 and the 5-point items (0,1,2,3,4) would be multiplied by 1/4, resulting in possible values of (0,.33,.66,1) and (0,.25,.50,.75,1), respectively. This way max responses on the 5-point scale (originally 4’s) would not drive up the total scale score more than max responses on the 4-point scale (originally 3’s).
Pro: Items would have equal weighting. (could be a con, depending on your perspective).
Con: Ignores differences in variability between items on different metrics?
3. Standardize and sum
A related approach would be to standardize all of the items (z score) and then sum.
Pro: Addresses differences in variability between items on different metrics
Con: Total scale score becomes less interpretable and sample-specific. The latter makes it hard to benchmark as a measure to be used in other settings/other samples.
4. PCA or other data reduction
4a. EFA to get factor loadings. Multiply scaled items by factor loadings.
4b. PCA to get score of first principal component.
Pro: Items weighted by influence.
Con: Same as #2. EFA-derived scores could vary a lot depending on rotation/extraction choices. Some would not advise on ordinal data.
Overall: I like #2 because it seems easier to compare results across different samples. Thoughts? Alternative ideas or concerns about the ideas presented?
This is a great question!
I think that in scale construction, there’s a delicate balance between interpretability and psychometric considerations. Specifically, a scale sum or average is much easier to grasp than a sum or average taken of standardized or otherwise re-scaled items.
However, there can be a somewhat subtle psychometric reason for re-scaling items prior to creating your scale composite (i.e., taking a sum or average). If your items have radically different standard deviations, the reliability of your composite scale will be decreased simply because of these differing standard deviations.
One way to understand this intuitively is to realize that, as you point out, items with widely varying standard deviations are assigned different weights in the composite. So, measurement error in the item with the greater standard deviation will tend to dominate the scale composite. In effect, having widely varying standard deviations reduces the very benefit that you’re trying to accrue by averaging together multiple items (i.e., normally, averaging together multiple items reduces the impact of measurement error from any one of the component items).
I have created a demonstration of the effects of a single dominant item in some simulated data below. Here I create five correlated items and find the reliability (measured with Cronbach’s alpha) of the resultant scale.
require(psych) # Create data set.seed(13105) item1 <- round(rnorm(100, sd = 3), digits = 0) item2 <- round(item1 + rnorm(100, sd = 1), digits = 0) item3 <- round(item1 + rnorm(100, sd = 1), digits = 0) item4 <- round(item1 + rnorm(100, sd = 1), digits = 0) item5 <- round(item1 + rnorm(100, sd = 1), digits = 0) d <- data.frame(item1, item2, item3, item4, item5) # Cronbach's alpha alpha(d) Reliability analysis Call: alpha(x = d) raw_alpha std.alpha G6(smc) average_r mean sd 0.97 0.97 0.97 0.87 -0.14 2.5 Reliability if an item is dropped: raw_alpha std.alpha G6(smc) average_r item1 0.96 0.96 0.94 0.84 item2 0.97 0.97 0.96 0.88 item3 0.97 0.97 0.96 0.89 item4 0.97 0.97 0.96 0.88 item5 0.96 0.97 0.96 0.87 Item statistics n r r.cor r.drop mean sd item1 100 0.98 0.99 0.97 -0.10 2.5 item2 100 0.94 0.92 0.90 -0.27 2.8 item3 100 0.93 0.91 0.89 -0.09 2.7 item4 100 0.94 0.92 0.91 -0.19 2.6 item5 100 0.94 0.93 0.91 -0.06 2.7
And here I change the standard deviation of
item2 by multiplying the item by 5. Note the dramatic drop in Cronbach’s alpha due to this procedure. Also note that multiplying an item by a positive constant does not affect the correlation matrix constructed with these five items in the slightest. The only thing that I have done by multiplying
item2 by 5 is that I have changed the scale on which
item2 is measured, and yet changing this scale greatly impacts the reliability of the composite.
# Re-scale item 2 to have a much larger standard deviation than the other items d$item2 <- d$item2 * 5 # Cronbach's alpha alpha(d) Reliability analysis Call: alpha(x = d) raw_alpha std.alpha G6(smc) average_r mean sd 0.74 0.97 0.97 0.87 -0.36 4.7 Reliability if an item is dropped: raw_alpha std.alpha G6(smc) average_r item1 0.68 0.96 0.94 0.84 item2 0.97 0.97 0.96 0.88 item3 0.69 0.97 0.96 0.89 item4 0.68 0.97 0.96 0.88 item5 0.68 0.97 0.96 0.87 Item statistics n r r.cor r.drop mean sd item1 100 0.98 0.99 0.96 -0.10 2.5 item2 100 0.94 0.92 0.90 -1.35 13.9 item3 100 0.93 0.91 0.86 -0.09 2.7 item4 100 0.94 0.92 0.89 -0.19 2.6 item5 100 0.94 0.93 0.90 -0.06 2.7