# How to choose “family” in Generalized Additive Model (GAM)

When modelling a GAM model using `mgcv` in R, we need to define the `family =` . I tried some families (e.g., Gaussian, Gamma), R seems to build them all successfully. Does there some guildlines about how to choose the appropriate “family”?

Here is an example of what I mean by “outcome conditioned on the covariate”.

I want to do a linear regression. I have a continuous outcome and I am regressing it on a binary variable. This is equivalent to a t-test, but let’s pretend we don’t know that.

What most people do is look at the marginal distribution of the data. This is equivalent to plotting histogram of the outcome variable. Let’s look at that now Ew, gross, this is bimodal. Linear regression assumes the outcome is normally distributed, right? We can’t use linear regression on this!

…or can we? Here is the output of a linear model fit the this data.

``````Call:
lm(formula = y ~ x, data = d)

Residuals:
Min      1Q  Median      3Q     Max
-7.3821 -1.7504 -0.0194  1.7190  7.8183

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)   9.8994     0.1111   89.13   <2e-16 ***
x            12.0931     0.1588   76.14   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.511 on 998 degrees of freedom
Multiple R-squared:  0.8531,    Adjusted R-squared:  0.853
F-statistic:  5797 on 1 and 998 DF,  p-value: < 2.2e-16
``````

An incredibly good fit. So what gives?

The plot above is the marginal outcome. Regression, be it linear or otherwise, only cares about the conditional outcome; the distribution of the outcome conditioned on the covariates. Let’s see what happens when I color the observations by the binary variable. You can see here that the data conditioned on the outcome are normal, and hence fit into linear regression’s assumptions.

So when I say “think about the outcome conditioned on covariates” what I am really asking you to do is to think about a particular set of covariates and think about the distribution of outcomes from those covariates. That will determine the family.