I have a question regarding the use of a grouping variable in a non-linear model. Since the nls() function does not allow for factor variables, I have been struggling to figure out if one can test the effect of a factor on the model fit. I have included an example below where I want to fit a “seasonalized von Bertalanffy” growth model to different growth treatments (most commonly applied to fish growth). I would like to test the effect of the lake where the fish grew as well as the food given (just an artificial example). I am familiar with a workaround to this problem – applying an F-test comparing models fit to pooled data vs. separate fits as outlined by Chen et al. (1992) (ARSS – “Analysis of residual sum of squares”). In other words, for the example below, does the fitting of two models significantly reduce sum of the squared residuals (in this example, yes):
I imagine there is a simpler way to do this in R using nlme(), but I am running into problems. First of all, by using a grouping variable, the degrees of freedom is higher than I obtain with my fitting of separate models. Second, I am unable to nest grouping variables – I don’t see where my problem is. Any help using nlme or other methods is greatly appreciated. Below is code for my artificial example:
###seasonalized von Bertalanffy growth model soVBGF <- function(S.inf, k, age, age.0, age.s, c){ S.inf * (1-exp(-k*((age-age.0)+(c*sin(2*pi*(age-age.s))/2*pi)-(c*sin(2*pi*(age.0-age.s))/2*pi)))) } ###Make artificial data food <- c("corn", "corn", "wheat", "wheat") lake <- c("king", "queen", "king", "queen") #cornking, cornqueen, wheatking, wheatqueen S.inf <- c(140, 140, 130, 130) k <- c(0.5, 0.6, 0.8, 0.9) age.0 <- c(-0.1, -0.05, -0.12, -0.052) age.s <- c(0.5, 0.5, 0.5, 0.5) cs <- c(0.05, 0.1, 0.05, 0.1) PARS <- data.frame(food=food, lake=lake, S.inf=S.inf, k=k, age.0=age.0, age.s=age.s, c=cs) #make data set.seed(3) db <- c() PCH <- NaN*seq(4) COL <- NaN*seq(4) for(i in seq(4)){ age <- runif(min=0.2, max=5, 100) age <- age[order(age)] size <- soVBGF(PARS$S.inf[i], PARS$k[i], age, PARS$age.0[i], PARS$age.s[i], PARS$c[i]) + rnorm(length(age), sd=3) PCH[i] <- c(1,2)[which(levels(PARS$food) == PARS$food[i])] COL[i] <- c(2,3)[which(levels(PARS$lake) == PARS$lake[i])] db <- rbind(db, data.frame(age=age, size=size, food=PARS$food[i], lake=PARS$lake[i], pch=PCH[i], col=COL[i])) } #visualize data plot(db$size ~ db$age, col=db$col, pch=db$pch) legend("bottomright", legend=paste(PARS$food, PARS$lake), col=COL, pch=PCH) ###fit growth model library(nlme) starting.values <- c(S.inf=140, k=0.5, c=0.1, age.0=0, age.s=0) #fit to pooled data ("small model") fit0 <- nls(size ~ soVBGF(S.inf, k, age, age.0, age.s, c), data=db, start=starting.values ) summary(fit0) #fit to each lake separatly ("large model") fit.king <- nls(size ~ soVBGF(S.inf, k, age, age.0, age.s, c), data=db, start=starting.values, subset=db$lake=="king" ) summary(fit.king) fit.queen <- nls(size ~ soVBGF(S.inf, k, age, age.0, age.s, c), data=db, start=starting.values, subset=db$lake=="queen" ) summary(fit.queen) #analysis of residual sum of squares (F-test) resid.small <- resid(fit0) resid.big <- c(resid(fit.king),resid(fit.queen)) df.small <- summary(fit0)$df df.big <- summary(fit.king)$df+summary(fit.queen)$df F.value <- ((sum(resid.small^2)-sum(resid.big^2))/(df.big[1]-df.small[1])) / (sum(resid.big^2)/(df.big[2])) P.value <- pf(F.value , (df.big[1]-df.small[1]), df.big[2], lower.tail = FALSE) F.value; P.value ###plot models plot(db$size ~ db$age, col=db$col, pch=db$pch) legend("bottomright", legend=paste(PARS$food, PARS$lake), col=COL, pch=PCH) legend("topleft", legend=c("soVGBF pooled", "soVGBF king", "soVGBF queen"), col=c(1,2,3), lwd=2) #plot "small" model (pooled data) tmp <- data.frame(age=seq(min(db$age), max(db$age),,100)) pred <- predict(fit0, tmp) lines(tmp$age, pred, col=1, lwd=2) #plot "large" model (seperate fits) tmp <- data.frame(age=seq(min(db$age), max(db$age),,100), lake="king") pred <- predict(fit.king, tmp) lines(tmp$age, pred, col=2, lwd=2) tmp <- data.frame(age=seq(min(db$age), max(db$age),,100), lake="queen") pred <- predict(fit.queen, tmp) lines(tmp$age, pred, col=3, lwd=2) ###Can this be done in one step using a grouping variable? #with "lake" as grouping variable starting.values <- c(S.inf=140, k=0.5, c=0.1, age.0=0, age.s=0) fit1 <- nlme(model = size ~ soVBGF(S.inf, k, age, age.0, age.s, c), data=db, fixed = S.inf + k + c + age.0 + age.s ~ 1, group = ~ lake, start=starting.values ) summary(fit1) #similar residuals to the seperatly fitted models sum(resid(fit.king)^2+resid(fit.queen)^2) sum(resid(fit1)^2) #but different degrees of freedom? (10 vs. 21?) summary(fit.king)$df+summary(fit.queen)$df AIC(fit1, fit0) ###I would also like to nest my grouping factors. This doesn't work... #with "lake" and "food" as grouping variables starting.values <- c(S.inf=140, k=0.5, c=0.1, age.0=0, age.s=0) fit2 <- nlme(model = size ~ soVBGF(S.inf, k, age, age.0, age.s, c), data=db, fixed = S.inf + k + c + age.0 + age.s ~ 1, group = ~ lake/food, start=starting.values )
Reference:
Chen, Y., Jackson, D.A. and Harvey, H.H., 1992. A comparison of von Bertalanffy and polynomial functions in modelling fish growth data. 49, 6: 1228-1235.
Answer
You could stratify by the values of the categorical predictor and compare fits. For example suppose you have continuous predictors X1,...,Xp and dependent variable Y. I believe nls() gives the maximum likelihood estimate of f such that
Y=f(X1,...,Xp)+ε
where ε∼N(0,σ2) and f is parameterized in some non-linear way (see the nls helpfile). Suppose you have a categorical predictor B with m levels and stratify the data based on the values of B and fit the model within each strata. Since these are disjoint subsets of the data, the log-likelihood for the stratified model, L1 is the sum of the likelihood within each strata, which can be extracted from an nls model using the logLik function (you can also get the log-likelihood from the unstratified model, L0, using logLik).
The unstratified model is clearly a submodel of the stratified model, so the likelihood ratio test is appropriate to see whether the larger model is worth the added complexity – the test statistic is
λ=2(L1−L0)
If the categorical predictor truly has no effect, λ has a χ2 distribution with degrees of freedom equal to mp−p=p(m−1), where p is the number of parameters underlying your non-linear regression function. You can use pchisq() to calculate χ2 p-values.
Attribution
Source : Link , Question Author : Marc in the box , Answer Author : Macro