Understanding dummy (manual or automated) variable creation in GLM

If a factor variable (e.g. gender with levels M and F) is used in the glm formula, dummy variable(s) are created, and can be found in the glm model summary along with their associated coefficients (e.g. genderM)

If, instead of relying on R to split up the factor in this way, the factor is encoded in a series of numeric 0/1 variables (e.g. genderM (1 for M, 0 for F), genderF (1 for F, 0 for M) and these variables are then used as numeric variables in the glm formula, would the coefficient result be any different?

Basically the question is: does R use a different coefficient calculation when working with factor variables versus numeric variables?

Follow-up question (possibly answered by the above): besides just the efficiency of letting R create dummy variables, is there any problem with re-coding factors as a series of numeric 0,1 variables and using those in the model instead?

Answer

Categorical variables (called “factors” in R) need to be represented by numerical codes in multiple regression models. There are very many possible ways to construct numerical codes appropriately (see this great list at UCLA’s stats help site). By default, R uses reference level coding (which R calls “contr.treatment”), and which is pretty much the default statistics-wide. This can be changed for all contrasts for your entire R session using ?options, or for specific analyses / variables using ?contrasts or ?C (note the capital). If you need more information about reference level coding, I explain it here: Regression based for example on days of the week.

Some people find reference level coding confusing, and you don’t have to use it. If you want, you can have two variables for male and female; this is called level means coding. However, if you do that, you will need to suppress the intercept or the model matrix will be singular and the regression cannot be fit as @Affine notes above and as I explain here: Qualitative variable coding leads to singularities. To suppress the intercept, you modify your formula by adding -1 or +0 like so: y~... -1 or y~... +0.

Using level means coding instead of reference level coding will change the coefficients estimated and the meaning of the hypothesis tests that are printed with your output. When you have a two level factor (e.g., male vs. female) and you use reference level coding, you will see the intercept called (constant) and only one variable listed in the output (perhaps sexM). The intercept is the mean of the reference group (perhaps females) and sexM is the difference between the mean of males and the mean of females. The p-value associated with the intercept is a one-sample $t$-test of whether the reference level has a mean of $0$ and the p-value associated with sexM tells you if the sexes differ on your response. But if you use level means coding instead, you will have two variables listed and each p-value will correspond to a one-sample $t$-test of whether the mean of that level is $0$. That is, none of the p-values will be a test of whether the sexes differ.

set.seed(1)
y    = c(    rnorm(30), rnorm(30, mean=1)         )
sex  = rep(c("Female",  "Male"          ), each=30)
fem  = ifelse(sex=="Female", 1, 0)
male = ifelse(sex=="Male", 1, 0)

ref.level.coding.model   = lm(y~sex)
level.means.coding.model = lm(y~fem+male+0)

summary(ref.level.coding.model)
# ...
# Coefficients:
#             Estimate Std. Error t value Pr(>|t|)    
# (Intercept)  0.08246    0.15740   0.524    0.602    
# sexMale      1.05032    0.22260   4.718 1.54e-05 ***
#   ---
#   Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# ...
summary(level.means.coding.model)
# ...
# Coefficients:
#      Estimate Std. Error t value Pr(>|t|)    
# fem   0.08246    0.15740   0.524    0.602    
# male  1.13277    0.15740   7.197 1.37e-09 ***
#   ---
#   Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# ...

Attribution
Source : Link , Question Author : Bryan , Answer Author : user1205901 – Слава Україні

Leave a Comment