When modelling a GAM model using
mgcvin R, we need to define the
family =. I tried some families (e.g., Gaussian, Gamma), R seems to build them all successfully. Does there some guildlines about how to choose the appropriate “family”?
Here is an example of what I mean by “outcome conditioned on the covariate”.
I want to do a linear regression. I have a continuous outcome and I am regressing it on a binary variable. This is equivalent to a t-test, but let’s pretend we don’t know that.
What most people do is look at the marginal distribution of the data. This is equivalent to plotting histogram of the outcome variable. Let’s look at that now
Ew, gross, this is bimodal. Linear regression assumes the outcome is normally distributed, right? We can’t use linear regression on this!
…or can we? Here is the output of a linear model fit the this data.
Call: lm(formula = y ~ x, data = d) Residuals: Min 1Q Median 3Q Max -7.3821 -1.7504 -0.0194 1.7190 7.8183 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 9.8994 0.1111 89.13 <2e-16 *** x 12.0931 0.1588 76.14 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 2.511 on 998 degrees of freedom Multiple R-squared: 0.8531, Adjusted R-squared: 0.853 F-statistic: 5797 on 1 and 998 DF, p-value: < 2.2e-16
An incredibly good fit. So what gives?
The plot above is the marginal outcome. Regression, be it linear or otherwise, only cares about the conditional outcome; the distribution of the outcome conditioned on the covariates. Let’s see what happens when I color the observations by the binary variable.
You can see here that the data conditioned on the outcome are normal, and hence fit into linear regression’s assumptions.
So when I say “think about the outcome conditioned on covariates” what I am really asking you to do is to think about a particular set of covariates and think about the distribution of outcomes from those covariates. That will determine the family.