I know that zero-inflated models (e.g. zero-inflated Poisson or negative binomial models) can be used for dependent variables. I also know that in general there are no assumptions for the independent variables (i.e. predictors) in regression analyses. However, I have a quantitative (continuous or count) predictor which has many (say, 40-60%) zeros. When I used it as a quantitative predictor in regression (linear or logistic) I got a small P value (i.e. P<0.01), but when I used it as a binary predictor (zero or not) I got a P value>0.05. Why did it happen? How do I interpret this result?
So it’s important to think of the source of zero-inflation. Two sources that come to mind are:
- floor effect: your measurement instrument cannot detect values below a certain threshold and so the instrument simply returns zero. Think of a scale with 40 questions supposedly measuring “high-end math knowledge”, so the test is really difficult. For each question, you could be incorrect (0 score) or correct (1 score). A test-taker could be incorrect on all 40 questions resulting in a total score of 0. Assuming the scale is indeed unidimensional/valid for high end math knowledge, this 0 score does not mean the respondent has zero math knowledge. But it suggests their math knowledge is at a level below which the test can detect. If someone administered this test to the general population, you might get a high proportion of zero scores. This scenario can occur in many different contexts. Sometimes, a measurement instrument can only detect differences beyond a threshold, and several respondents are below that threshold. Instruments for physical quantities might simply return a
< Xscore, where X is that threshold.
- true zero: Sometimes, people really have a zero score. For example, this can happen with counts, e.g. how many homes do you own? And sometimes, there is something fundamentally different between someone who does not own their home and someone who owns one or more homes.
So it’s important to think of the source of zeroes and what you’re trying to measure.
In the floor effect situation, I might claim that the math scores would be approximately normally distributed but for the floor effect. So I’d assume we have a normally distributed variable that has been censored at 0. The effect of analyzing with the predictor as is is range restriction which – all other factors held constant – can reduce power. So you could build a model that accommodates a censored predictor – Bayesian modeling should make this easy.
In the true zero situation, I might decide to dichotomize the variable to create two predictors: one binary (homeowner or not) and one continuous (how many homes). Then use both predictors in my model, allowing you to measure the effect of home ownership separate from owning more homes.
These are just two scenarios that come to mind. Also, it’s never clear cut. Number of homes could also be a proxy for another variable, where number of homes is a proxy that cannot detect levels of that variable beneath a threshold. If your interest is in measuring the relation between the outcome and the true variable number of homes is a proxy for, you have another example of a censored predictor.
In all, having a predictor with many zeroes welcomes you to think about why that might be happening. And what you might want to do about it. Also, the easiest viewpoint to take is regression places no assumption on predictor distributions and just proceed with the predictor as is.