I am using Principal Component Analysis (PCA) to create an index required for my research. My question is how I should create a single index by using the retained principal components calculated through PCA.

For instance, I decided to retain 3 principal components after using PCA and I computed scores for these 3 principal components. What is the appropriate ways to create, for each respondent, a single index out of these 3 scores?

- Is it relevant to add the 3 computed scores to have a composite value?
- Or to average the 3 scores to have such a value?
- Or should I just keep the first principal component (the strongest) only and use its score as the index?
Alternatively, one could use Factor Analysis (FA) but the same question remains: how to create a single index based on several factor scores?

**Answer**

This answer is deliberately non-mathematical and is oriented towards non-statistician psychologist (say) who inquires whether he may sum/average factor scores of different factors to obtain a “composite index” score for each respondent.

**Summing or averaging** some variables’ scores assumes that the variables belong to the same dimension and are fungible measures. (In the question, “variables” are **component** or **factor scores**, which doesn’t change the thing, since they are examples of variables.)

Really (Fig. 1), respondents 1 and 2 may be seen as equally atypical (i.e. deviated from 0, the locus of the data centre or the scale origin), both having same mean score $(.8+.8)/2=.8$ and $(1.2+.4)/2=.8$. Value $.8$ is valid, as the extent of atypicality, for the construct $X+Y$ as perfectly as it was for $X$ and $Y$ separately. Correlated variables, representing same one dimension, can be seen as repeated measurements of the same characteristic and the difference or non-equivalence of their scores as random error. It is therefore warranded to sum/average the scores since random errors are expected to cancel each other out *in spe*.

That is not so if $X$ and $Y$ do not correlate enough to be seen same “dimension”. For then, the deviation/atypicality of a respondent is conveyed by Euclidean distance from the origin (Fig. 2).

That distance is different for respondents 1 and 2: $\sqrt{.8^2+.8^2} \approx 1.13$ and $\sqrt{1.2^2+.4^2} \approx 1.26$, – respondend 2 being away farther. If variables are independent dimensions, euclidean distance still relates a respondent’s position wrt the zero benchmark, but mean score does not. Take just an utmost example with $X=.8$ and $Y=-.8$. From the “point of view” of the mean score, this respondent is absolutely typical, like $X=0$, $Y=0$. Is that true for you?

Another answer here mentions weighted sum or average, i.e. $w_XX_i+w_YY_i$ with some reasonable weights, for example – if $X$,$Y$ are principal components – proportional to the component st. deviation or variance. But such weighting changes nothing in principle, it only stretches & squeezes the circle on Fig. 2 along the axes into an ellipse. Weights $w_X$, $w_Y$ are set constant for all respondents i, which is the cause of the flaw. To relate a respondent’s bivariate deviation – in a circle or ellipse – weights dependent on his scores must be introduced; the euclidean distance considered earlier *is* actually an example of such weighted sum with weights dependent on the values. And *if* it is important for you incorporate unequal variances of the variables (e.g. of the principal components, as in the question) you may compute the weighted euclidean distance, the distance that will be found on Fig. 2 after the circle becomes elongated.

Euclidean distance (weighted or unweighted) as deviation is the most intuitive solution to measure bivariate or multivariate atypicality of respondents. It is based on a presupposition of the uncorreltated (“independent”) variables forming a smooth, isotropic space. Manhatten distance could be one of other options. It views the feature space as consisting of blocks so only horizontal/erect, not diagonal, distances are allowed. $|.8|+|.8|=1.6$ and $|1.2|+|.4|=1.6$ give equal Manhattan atypicalities for two our respondents; it is actually the sum of scores – but only when the scores are all positive. In case of $X=.8$ and $Y=-.8$ the distance is $1.6$ but the sum is $0$.

(You might exclaim “I will make all data scores positive and compute sum (or average) with good conscience since I’ve chosen Manhatten distance”, but please think – are you in right to move the origin freely? Principal components or factors, for example, are extracted under the condition the data having been centered to the mean, which makes good sense. Other origin would have produced other components/factors with other scores. No, most of the time you may not play with origin – the locus of “typical respondent” or of “zero-level trait” – as you fancy to play.)

**To sum up**, if the aim of the composite construct is to reflect respondent positions relative some “zero” or typical locus but the variables hardly at all correlate, some sort of spatial distance from that origin, and not mean (or sum), weighted or unweighted, should be chosen.

Well, the mean (sum) will make sense if you decide to view the (uncorrelated) variables as alternative modes to measure the *same* thing. This way you are deliberately ignoring the variables’ different nature. In other words, you consciously leave Fig. 2 in favour of Fig. 1: you “forget” that the variables are independent. Then – do sum or average. For example, score on “material welfare” and on “emotional welfare” could be averaged, likewise scores on “spatial IQ” and on “verbal IQ”. This type of purely *pragmatic*, not approved satistically composites are called battery indices (a collection of tests or questionnaires which measure unrelated things or correlated things whose correlations we ignore is called “battery”). Battery indices make sense only if the scores have same direction (such as both wealth and emotional health are seen as “better” pole). Their usefulness outside narrow ad hoc settings is limited.

If the variables are in-between relations – they are considerably correlated still not strongly enough to see them as duplicates, alternatives, of each other, we often sum (or average) their values in a weighted manner. Then these weights should be carefully designed and they should reflect, this or that way, the correlations. This what we do, for example, by means of PCA or factor analysis (FA) where we specially compute component/factor scores. If your variables are themselves already component or factor scores (like the OP question here says) and they are correlated (because of oblique rotation), you may subject them (or directly the loading matrix) to the second-order PCA/FA to find the weights and get the second-order PC/factor that will serve the “composite index” for you.

But if your component/factor scores were uncorrelated or weakly correlated, there is no *statistical* reason neither to sum them bluntly nor via inferring weights. Use some distance instead. The problem with distance is that it is always positive: you can say how much atypical a respondent is but cannot say if he is “above” or “below”. But this is the price you have to pay for demanding a single index out from multi-trait space. If you want both deviation and sign in such space I would say you’re too exigent.

In the last point, the OP asks whether it is right to take only the score of one, strongest variable in respect to its variance – 1st principal component in this instance – as the only proxy, for the “index”. It makes sense if that PC is *much* stronger than the rest PCs. Though one might ask then “if it is so much stronger, why didn’t you extract/retain just it sole?”.

**Attribution***Source : Link , Question Author : user179313 , Answer Author : Community*