# categorizing a variable turns it from insignificant to significant

I have a numeric variable which turns out not significant in a multivariate logistic regression model.
However, when I categorize it into groups, suddenly it becomes significant.
This is very counter-intuitive to me: when categorizing a variable, we give some information up.

How can this be?

One possible explanation would be nonlinearities in the relationship between your outcome and the predictor.

Here is a little example. We use a predictor that is uniform on $$[−1,1][-1,1]$$. The outcome, however, does not linearly depend on the predictor, but on the square of the predictor: TRUE is more likely for both $$x≈−1x\approx-1$$ and $$x≈1x\approx 1$$, but less likely for $$x≈0x\approx 0$$. In this case, a linear model will come up insignificant, but cutting the predictor into intervals makes it significant.

> set.seed(1)
> nn <- 1e3
> xx <- runif(nn,-1,1)
> yy <- runif(nn)<1/(1+exp(-xx^2))
>
> library(lmtest)
>
> model_0 <- glm(yy~1,family="binomial")
> model_1 <- glm(yy~xx,family="binomial")
> lrtest(model_1,model_0)
Likelihood ratio test

Model 1: yy ~ xx
Model 2: yy ~ 1
#Df  LogLik Df  Chisq Pr(>Chisq)
1   2 -676.72
2   1 -677.22 -1 0.9914     0.3194
>
> xx_cut <- cut(xx,c(-1,-0.3,0.3,1))
> model_2 <- glm(yy~xx_cut,family="binomial")
> lrtest(model_2,model_0)
Likelihood ratio test

Model 1: yy ~ xx_cut
Model 2: yy ~ 1
#Df  LogLik Df  Chisq Pr(>Chisq)
1   3 -673.65
2   1 -677.22 -2 7.1362    0.02821 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


However, this does not mean that discretizing the predictor is the best approach. (It almost never is.) Much better to model the nonlinearity using or similar.