I am trying to find a sensible way to calculate the deviance explained by each predictor variable in a GAM model and need some input on my calculations.
Following Simon Wood’s example on the thread https://stat.ethz.ch/pipermail/r-help/2009-July/397343.html, say I have the following data:
set.seed(1745) dat <- gamSim(1,n=400,dist="normal",scale=2)
These data include the response variable y and the predictor variables x0, x1, x2 and x3. Using these data, I fit the following GAM model:
b <- gam(y ~ s(x0, k=4) + s(x1,k=4) + s(x2,k=4) + s(x3,k=4), gamma = 1.4, data=dat)
Using the command
plot(b, page = 1), the model fit looks like this:
The goal is to calculate the percent deviance explained by each predictor in this model.
First, I compute the overall deviance explained by the model like so:
bint <- gam(y ~ 1,data=dat) overall.deviance.explained <- (deviance(bint) - deviance(b))/deviance(bint) percent.overall.deviance.explained <- round(overall.deviance.explained*100) percent.overall.deviance.explained
This gives 74%, which is exactly what summary(b) reports. So far so good.
Note that the smoothing parameters used in the estimation of the smooth, nonlinear effects of x0, x1, x2 and x3 in the model are given by b$sp:
Also note that I used k = 4 for each smooth and set gamma to 1.4 in an attempt to prevent overfitting.
Next, I compute the percent deviance explained by x2 in model b as follows:
# fit an alternative model b2 without x2, but use the same smoothing parameters for the # smooth terms preserved in the model as used in the original model b b2 <- gam(y ~ s(x0, k = 4) + s(x1, k = 4) + s(x3, k = 4), sp = b$sp[-3], # omit smoothing parameter for s(x2) in the original model b, gamma = 1.4, data = dat) partial.deviance.explained.by.x2 <- (deviance(b2) - deviance(b))/deviance(bint) percent.partial.deviance.explained.by.x2 <- round(partial.deviance.explained.by.x2 * 100) percent.partial.deviance.explained.by.x2
The idea for computing the percent partial deviance this way came from the reference Can patterns of urban biodiversity be predicted using simple measures of green infrastructure? by Brunbjerg et al., Urban Forestry & Urban Greening 32 (2018) 143–153:
To use the language in this reference, the null model in my example is the intercept-only model. The best model is the model b, which includes all predictors: x0, x1, x2 and x3. The alternative model without the target variable is obtained by dropping the predictor x2 from the best model.
Here are my questions:
Question 1: The quoted reference does NOT mention that the smoothing parameters used in the “alternative model without the target variable” (i.e., model b2, which was obtained by dropping x2 from model b) should be the same as those used in the “best model” (i.e., model b, which includes x2 as well as all other predictors). Based on Simon Wood’s answer at https://stat.ethz.ch/pipermail/r-help/2009-July/397343.html, it seems to me that a fair comparison between the deviance of the “alternative” and “best” model SHOULD ensure that the smoothing parameters are kept the same across the terms which appear in both models. Am I correct in my thinking that it SHOULD?
Question 2: If I use a calculation like above to calculate the percent partial deviance explained by each predictor variable in the model and then add up these percentages across all predictors included in the original model, I end up with odd situations for my own data (not for the example provided above) where:
(i) sum of percent partial deviances across all predictors < overall deviance (ii) sum of percent partial deviances across all predictors > overall deviance
The second situation is particularly puzzling. (Note that if I don’t use the same smoothing parameters across models with and without the target predictors, I sometimes end up with negative percent partial deviances for some predictors, which also seems odd.)
To address these issues, I feel inclined to define the percent partial deviance explained in relation to the sum of percent partial deviances across all predictors (rather than in relation to the overall deviance). In other words, define:
sum.partial.deviances.across.all.predictors <- (deviance(b0) - deviance(b)) + (deviance(b1) - deviance(b)) + (deviance(b2) - deviance(b)) + (deviance(b3) - deviance(b))
and then use:
partial.deviance.explained.by.x2 <- (deviance(b2) - deviance(b))/sum.partial.deviances.across.all.predictors
partial.deviance.explained.by.x2 <- (deviance(b2) - deviance(b))/deviance(b0)
Would something like this make sense? At least I would get percent partial deviances explained that add up to 100%.
Question 3: Is there a different and hopefully better way to compute the percent partial deviance explained by each predictor in a GAM model, that wouldn’t suffer from the issues I pointed out in this post?
Question 4: Given the odd/puzzling behaviour I’ve come upon, I am no longer sure I understand the concept of deviance in a GAM model. If we drop a predictor with a smooth, nonlinear effect from a model (but keep the smoothing parameters the same for the other predictors with smooth, nonlinear effects in the model the same), should we or should we not expect the deviance of the resulting model to increase? Will concurvity play any role into what our expectation should be?
Question 5: On the surface of it, comparing GAM models with and without a target predictor – such as x2 in my example above – in terms of their deviance should be done in a way that ensures the models are comparable (i.e., both models are the same, except for the fact that one includes a smooth, nonlinear effect of the target predictor and the other does not). Initially, I was hopeful that keeping the smoothing parameters the same for the non-target predictors in the two models (in addition to using the same basis dimension k and the same value of gamma in my example) would ensure the comparability of the models, so that we are comparing like with like in terms of the deviance. Now, I am not so sure. The one reference I was able to find in the literature that seems to worry about the issue of comparability advises replacing the alternate (fitted) model without the target predictor with a (fitted) “model” where the smooth effect of the target predictor – say, s(x2) – is replaced by a constant given by s(m2), where m2 = mean(x2). I do not understand why we would not replace s(2) with 0 so that we eliminate the effect of x2 from the model which initially contained it, while keeping all else the same??