Interpreting proportions that sum to one as independent variables in linear regression

I’m familiar with the concept of categorical variables and the respective dummy variable coding that allows us to fit one level as baseline so as to avoid collinearity. I’m also familiar with how to interpret parameter estimates from such models: The predicted change in the outcome for a given fitted level of the categorical predictor, relative to the baseline category.

What I’m unsure about is how to interpret a set of independent variables that are proportions that sum to one. We again have collinearity if we fit all proportions in the model, so presumably we would have to leave one category out as baseline. I also assume I would look at the type III SS for the overall test of the significance of this variable. However, how do we interpret the parameter estimates for those levels fit in the model vs. that deemed to be baseline?

An example: At the zip code level, the independent variable is the proportion of metamorphic, igneous and sedimentary rocks. As you may know, these are the three major rock types, and all rocks are classified as one of these. As such, the proportions across all three sum to 1. The outcome is the average radon level in a respective zip code.

If I were to fit, say, the metamorphic and igneous proportions as predictors in the model, leaving sedimentary as baseline, an overall type III SS F-test of the two fitted levels would signify whether rock type, as a whole, is an important predictor of the outcome (average radon level). Then, I could look at the individual p-values (based on the t distribution) to determine if one or both rock types was significantly different from baseline.

However, when it comes to the parameter estimates, my brain keeps wanting to interpret them purely as the predicted change in the outcome between groups (rock types), and I don’t understand how to incorporate the fact that they’re fit as proportions.

If the \beta estimate for metamorphic were, say, 0.43, the interpretation is not simply that the predicted average radon level increases by 0.43 units when the rock is metamorphic vs. sedimentary. However, the interpretation is also not simply for some sort of unit increase (say 0.1) in the proportion of metamorphic rock type, because this doesn’t reflect the fact that it’s also relative to baseline (sedimentary), and, additionally, that changing the proportion of metamorphic inherently changes the proportion of the other rock level fit in the model, igneous.

Does anyone have a source that provides the interpretation of such a model, or could you provide a brief example here if not?


As follow-up and what I think is the correct answer (seems reasonable to me): I posted this question on to the ASA Connect listserv, and got the following response from Thomas Sexton at Stony Brook:

“Your estimated linear regression model looks like:

ln(Radon) = (a linear expression in other variables) + 0.43M + 0.92I

where M and I represent the percentages of metamorphic and igneous rocks, respectively, in the ZIP code. You are constrained by:

M + I + S = 100

where S represents the percentages of sedimentary rock in the ZIP code.

The interpretation of the 0.43 is that a one percentage point increase in M is associated with an increase of 0.43 in ln(Radon) holding all other variables in the model fixed. Thus, the value of I cannot change, and the only way to have a one percentage point increase in M while satisfying the constraint is to have a one percentage point decrease in S, the omitted category.

Of course, this change cannot occur in ZIP codes in which S = 0, but a decrease in M and a corresponding increase in S would be possible in such ZIP codes.”

Here is the link to the thread ASA:

I’m posting this as the accepted correct answer, but am still open to further discussion if anyone has something to add.

Source : Link , Question Author : Meg , Answer Author : Meg

Leave a Comment