Model building and selection using Hosmer et al. 2013. Applied Logistic Regression in R

This is my first post on StackExchange, but I have been using it as a resource for quite a while, I will do my best to use the appropriate format and make the appropriate edits. Also, this is a multi-part question. I wasn’t sure if I should split the question into several different posts or just one. Since the questions are all from one section in the same text I thought it would be more relevant to post as one question.

I am researching habitat use of a large mammal species for a Master’s Thesis. The goal of this project is to provide forest managers (who are most likely not statisticians) with a practical framework to assess the quality of habitat on the lands they manage in regard to this species. This animal is relatively elusive, a habitat specialist, and usually located in remote areas. Relatively few studies have been carried out regarding the distribution of the species, especially seasonally. Several animals were fitted with GPS collars for a period of one year. One hundred locations (50 summer and 50 winter) were randomly selected from each animal’s GPS collar data. In addition, 50 points were randomly generated within each animal’s home range to serve as “available” or “pseudo-absence” locations. The locations from the GPS collars are coded a 1 and the randomly selected available locations are coded as 0.

For each location, several habitat variables were sampled in the field (tree diameters, horizontal cover, coarse woody debris, etc) and several were sampled remotely through GIS (elevation, distance to road, ruggedness, etc). The variables are mostly continuous except for 1 categorical variable that has 7 levels.

My goal is to use regression modelling to build resource selection functions (RSF) to model the relative probability of use of resource units. I would like to build a seasonal (winter and summer) RSF for the population of animals (design type I) as well as each individual animal (design type III).

I am using R to perform the statistical analysis.

The primary text I have been using is…

  • “Hosmer, D. W., Lemeshow, S., & Sturdivant, R. X. 2013. Applied Logistic Regression. Wiley, Chicester”.

The majority of the examples in Hosmer et al. use STATA, I have also been using the following 2 texts for reference with R.

  • “Crawley, M. J. 2005. Statistics : an introduction using R. J. Wiley,
    Chichester, West Sussex, England.”
  • “Plant, R. E. 2012. Spatial Data Analysis in Ecology and Agriculture
    Using R. CRC Press, London, GBR.”

I am currently following the steps in Chapter 4 of Hosmer et al. for the “Purposeful Selection of Covariates” and have a few questions about the process. I have outlined the first few steps in the text below to aid in my questions.

  1. Step 1: A univariable analysis of each independent variable (I used a
    univariable logistic regression). Any variable whose univariable test
    has a p-value of less than 0.25 should be included in the first
    multivariable model.
  2. Step 2: Fit a multivariable model containing all covariates
    identified for inclusion at step 1 and to assess the importance of
    each covariate using the p-value of its Wald statistic. Variables
    that do not contribute at traditional levels of significance should
    be eliminated and a new model fit. The newer, smaller model should be
    compared to the old, larger model using the partial likelihood ratio
    test.
  3. Step 3: Compare the values of the estimated coefficients in the
    smaller model to their respective values from the large model. Any
    variable whose coefficient has changed markedly in magnitude should
    be added back into the model as it is important in the sense of
    providing a needed adjustment of the effect of the variables that
    remain in the model. Cycle through steps 2 and 3 until it appears that all of the important variables are included in the model and those excluded are clinically and/or statistically unimportant. Hosmer et al. use the “delta-beta-hat-percent
    as a measure of the change in magnitude of the coefficients. They
    suggest a significant change as a delta-beta-hat-percent of >20%. Hosmer et al. define the delta-beta-hat-percent as
    \Delta\hat{\beta}\%=100\frac{\hat{\theta}_{1}-\hat{\beta}_{1}}{\hat{\beta}_{1}}.
    Where \hat{\theta}_{1} is the coefficient from the smaller model and \hat{\beta}_{1} is the coefficient from the larger model.
  4. Step 4: Add each variable not selected in Step 1 to the model
    obtained at the end of step 3, one at a time, and check its
    significance either by the Wald statistic p-value or the partial
    likelihood ratio test if it is a categorical variable with more than
    2 levels. This step is vital for identifying variables that, by
    themselves, are not significantly related to the outcome but make an
    important contribution in the presence of other variables. We refer
    to the model at the end of Step 4 as the preliminary main effects
    model
    .
  5. Steps 5-7: I have not progressed to this point so I will leave these
    steps out for now, or save them for a different question.

My questions:

  1. In step 2, what would be appropriate as a traditional level of
    significance, a p-value of <0.05 something larger like <.25?
  2. In step 2 again, I want to make sure the R code I have been using for the partial likelihood test is correct and I want to make sure I am interpreting the results correctly. Here is what I have been doing…anova(smallmodel,largemodel,test='Chisq') If the p-value is significant (<0.05) I add the variable back to the model, if it is insignificant I proceed with deletion?
  3. In step 3, I have a question regarding the delta-beta-hat-percent and when it is appropriate to add an excluded variable back to the model. For example, I exclude one variable from the model and it changes the \Delta\hat{\beta}\% for a different variable by >20%. However, the variable with the >20% change in \Delta\hat{\beta}\% seems to be insignificant and looks as if it will be excluded from the model in the next few cycles of Steps 2 and 3. How can I make a determination if both variables should be included or excluded from the model? Because I am proceeding by excluding 1 variable at a time by deleting the least significant variables first, I am hesitant to exclude a variable out of order.
  4. Finally, I want to make sure the code I am using to calculate \Delta\hat{\beta}\% is correct. I have been using the following code. If there is a package that will do this for me or a more simple way of doing it I am open to suggestions.

    100*((smallmodel$coef[2]-largemodel$coef[2])/largemodel$coef[2])

Answer

None of those proposed methods have been shown by simulation studies to work. Spend your efforts formulating a complete model and then fit it. Univariate screening is a terrible approach to model formulation, and the other components of stepwise variable selection you hope to use should likewise be avoided. This has been discussed at length on this site. What gave you the idea in the first place that variables should sometimes be removed from models because they are not “significant”? Don’t use P-values or changes in \beta to guide any of the model specification.

Attribution
Source : Link , Question Author : GNG , Answer Author : Frank Harrell

Leave a Comment