# Logistic regression: what happens to the coefficients when we switch the labels (0/1) of the binary outcome

How to interpret the coefficients of logistic regression? To be more specific, I have a set of independent variables, and one dependant variable (let it be “rain” or “no rain” expressed as 1 and 0 respectively)

I build my logistic regression model and I want get an insight about the relations between my inputs and outputs, and see what are the most influential variables in the model. To do this I resort to the model coefficients:

Variable       Coeff               P-Value
x1_0          0.63914           1.27e-11 ***
X2_0          0.59451           2e-16 ***
X3_0         -0.38567           1.16e-08 ***
X4_0         -0.58933           6.23e-05 ***
X5_0         -0.01629           0.775


My question now is are these coefficients refer to the “rain” or to the “no rain” in my output?
In the book “Practical Data Science with R” in chapter 7, it says: “Negative coefficients that are statistically significant correspond to variables that are negatively correlated to the odds (and hence to the probability) of a positive outcome (the baby being at risk). Positive coefficients that are statistically significant are positively correlated to the odds of a positive outcome.”

Does the positive outcome here refer to the “rain” in my output variable?

With a logistic regression model one models the probability of occurence of a binary event, in your case the probability that it rains. As any other model, your model will have to make assumptions and one of your assumptions is that this probability of rain depends on five explanatory variables, for the ease of notation I will call them $x_i, i=1,2, \dots 5$. Furthermore, your model assumes that the probability of rain, given values of the $x_i$ (notation: $P(rain=true|_{x_i})$ has a particular functional $S$-shaped form namely

$P(rain=true|_{x_i})=\frac{1}{1+e^{-(\beta_0+\sum_i \beta_i x_i)}}$.

After some manipulations we can transform this to $\ln \left( \frac{P(rain=true|_{x_i})}{1-P(rain=true|_{x_i})} \right)=\beta_0+\sum_i \beta_i x_i$. ($\ln$ is the natural logarithm)

If $\pi$ is the probability of an occurence of an event, then $\frac{\pi}{1-\pi}$ is the odds of the event. For example, if you make a bet with a fair coin, and you win the bet when head turns up, then, as $\pi=0.5$, the odds of winning are $\frac{0.5}{1-0.5}=1$ or you have as much chance of winning the bet than of losing it. If you make a bet and you win when a die turns up with ‘1’, then the odds are $\frac{\frac{1}{6}}{1-\frac{1}{6}}=\frac{1}{5}$ or the odds is 1/5 or you have five times more chance of losing the bet.

From the above it follows that a logistic regression model assumes (a) that the probability of rain is a function of the $x_i$ and (b) that the log of the odds of rain against no rain is linear in the $x_i$.

(Note: further assumptions must be made for estimating the coefficients (e.g. independence of observations)).

Your coefficient of $x_{10}$ is (rounded) 0.64 meaning that if $x_{10}$ increases by one unit then, all other things equal, the log of the odds of rain against no rain will increase by $0.64$.

If the log of the odds increases by $0.64$ then, all other things equal, the odds increase by $e^{0.64}$ (for each unit increase in $x_{10}$).

Maybe good to note with respect to your ‘most influential variables’: In the above paragraph I said ‘change in log-odds for one unit change in $x_i$’. This is important if you want to analyse ‘most influential variables’, indeed, variables can be espressed in different units: if the coeffient of $x_1$ is 1 and $x_1$ is in kilometer while the coefficient of $x_2$ is 0.1 with $x_2$ in meter then (a) a unit change in $x_1$ (1 km) changes the log-odds by 1 and (b) a unit change in $x_2$ (1 m) changes the log-odds by 0.1.

So in order to assess the impact of variables, an analysis of the magnitude of the coefficients alone is not suffcient, you should take the units of the variables into account (or use standardised variables).

EDIT: I added this after you asked the question in your comment: “what happens if I predict the odds of no rain instead of the odds of rain?”

Obviously it holds that $P(rain=FALSE|_{x_i})=1-P(rain=true|_{x_i})$. So the log-odds of no rain against rain is $ln \left( \frac{P(rain=FALSE|_{x_i})}{1-P(rain=FALSE|_{x_i})} \right) =\ln \left( \frac{1-P(rain=true|_{x_i})}{P(rain=true|_{x_i})} \right)=-\ln \left( \frac{P(rain=true|_{x_i})}{1-P(rain=true|_{x_i})} \right )$.

(note that $\ln \left( \frac{1}{x} \right)=-\ln(x)$).

So we find that the odds of ‘no rain’ against ‘rain’ is $\ln \left( \frac{P(rain=FALSE|_{x_i})}{1-P(rain=FALSE|_{x_i})} \right) = -(\beta_0+\sum_i \beta_i x_i)$. In words: the sign of the coefficients changes.

As @Scortchi has indicated in the comments, one would have serious doubt about the logistic regression if simply switching the class labels would yield a completely different result.

The code below illustrates the ‘sign switch’ :

# Generate some data: outcome is binary and x is the explanatory
#   step 1: generate success probabilities for Bernouilli variables
set.seed(1)
x<-runif(n=5000, min=-2, max=2)
p<-1/(1+exp(-(2*x-1)))
#   step 2: generate binary outcome with these probabilities
outcome<-(runif(n=5000,min=0,max=1) <=p)

# Estimate logit using: estimate binary outcome with x as explanatory
glm.lr1<-glm(outcome ~ x +1, family=binomial)
coef(glm.lr1)

# estimate logit using  **'outcome' SWITCHED** ('!' in front of it)
glm.lr2<-glm(!outcome ~ x +1, family=binomial)
coef(glm.lr2)


Note for the results: The way we generated the data, the intercept should be close to -1 and the coefficient of x close to 2 in the first case and with reversed signs in the second case