Logistic regression: class probabilities

I am using logistic regression to solve the classification problem.

g = glm(target ~ ., data=trainData, family = binomial("logit"))

There are two classes (target): 0 and 1

When I run the prediction function, it returns probabilities.

p = predict(g, testData, type = "response")

However, it is not clear to me how to understand which class has been assigned?

Real  p 

1   0.17568578
1   0.41698474
1   0.19151927
1   0.25587242
1   0.25604452
0   0.39976069
0   0.39910282
0   0.16879320

I appreciate if someone can explain me how this works based on the above example. Thanks

Answer

The predicted values only tell you how likely it is that an observation belongs to the class coded as 1 given its explanatory variables. For classification, you need to find a threshold t which in some sense is optimal for your problem. This is e.g. affected by monetary costs or ethical boundaries.

If you don’t have any of these costs or boundaries, i.e. is a cost function, one criterion could be to minimize the sum of the error frequencies. For this the following two terms are important:

Sensitivity denotes the fraction of positives that were correctly specified for a given t.

Specificity denotes the fraction of negatives that were correctly specified for a given t.

Denote s_0 as Senstitvity and s_1 as Specificty, minimizing the sum of the error frequencies is equivalent to finding maximum s_0(t) + s_1(t) for all thresholds t.

Here, I recommend to use the pROC package in R. It provides a very useful function called roc. See the sample code below. Here, response is your vector of ones and zeros and predictor your predictions. Moreover, the code produces also the corresponding ROC curve and adds a vertical line where the optimal threshold was found.

Please note: You provided very little data so I simulated some myself to get more different Sensitivities and Specificities. You can find the code for the simulated data below the picture.

rm(list = ls()) # clear work space

#install and load package
install.packages("pROC")
library(pROC)

#apply roc function
analysis <- roc(response=p$Real, predictor=p$p)

#Find t that minimizes error
e <- cbind(analysis$thresholds,analysis$sensitivities+analysis$specificities)
opt_t <- subset(e,e[,2]==max(e[,2]))[,1]

#Plot ROC Curve
plot(1-analysis$specificities,analysis$sensitivities,type="l",
ylab="Sensitiviy",xlab="1-Specificity",col="black",lwd=2,
main = "ROC Curve for Simulated Data")
abline(a=0,b=1)
abline(v = opt_t) #add optimal t to ROC curve
opt_t #print t

                                           enter image description here

##Simulate Data
set.seed(123456)
n <- 10000
q <- 0.8

#Simulate predictions
Real <- c(sample(c(0,1), n/2, replace = TRUE, prob = c(1-q,q)),
        sample(c(0,1), n/2, replace = TRUE, prob = c(0.7,0.3)))

#Simulate Response
p <- c(rep(seq(0.4,0.9, length=100), 50),
    rep(seq(0.2,0.6, length=100), 50))
p <- data.frame(cbind(Real, p))

Attribution
Source : Link , Question Author : Klausos Klausos , Answer Author : random_guy

Leave a Comment