I am using logistic regression to solve the classification problem.
g = glm(target ~ ., data=trainData, family = binomial("logit"))
There are two classes (target): 0 and 1
When I run the prediction function, it returns probabilities.
p = predict(g, testData, type = "response")
However, it is not clear to me how to understand which class has been assigned?
Real p 1 0.17568578 1 0.41698474 1 0.19151927 1 0.25587242 1 0.25604452 0 0.39976069 0 0.39910282 0 0.16879320
I appreciate if someone can explain me how this works based on the above example. Thanks
The predicted values only tell you how likely it is that an observation belongs to the class coded as 1 given its explanatory variables. For classification, you need to find a threshold t which in some sense is optimal for your problem. This is e.g. affected by monetary costs or ethical boundaries.
If you don’t have any of these costs or boundaries, i.e. is a cost function, one criterion could be to minimize the sum of the error frequencies. For this the following two terms are important:
Sensitivity denotes the fraction of positives that were correctly specified for a given t.
Specificity denotes the fraction of negatives that were correctly specified for a given t.
Denote s_0 as Senstitvity and s_1 as Specificty, minimizing the sum of the error frequencies is equivalent to finding maximum s_0(t) + s_1(t) for all thresholds t.
Here, I recommend to use the
pROC package in R. It provides a very useful function called
roc. See the sample code below. Here,
response is your vector of ones and zeros and
predictor your predictions. Moreover, the code produces also the corresponding ROC curve and adds a vertical line where the optimal threshold was found.
Please note: You provided very little data so I simulated some myself to get more different Sensitivities and Specificities. You can find the code for the simulated data below the picture.
rm(list = ls()) # clear work space #install and load package install.packages("pROC") library(pROC) #apply roc function analysis <- roc(response=p$Real, predictor=p$p) #Find t that minimizes error e <- cbind(analysis$thresholds,analysis$sensitivities+analysis$specificities) opt_t <- subset(e,e[,2]==max(e[,2]))[,1] #Plot ROC Curve plot(1-analysis$specificities,analysis$sensitivities,type="l", ylab="Sensitiviy",xlab="1-Specificity",col="black",lwd=2, main = "ROC Curve for Simulated Data") abline(a=0,b=1) abline(v = opt_t) #add optimal t to ROC curve opt_t #print t
##Simulate Data set.seed(123456) n <- 10000 q <- 0.8 #Simulate predictions Real <- c(sample(c(0,1), n/2, replace = TRUE, prob = c(1-q,q)), sample(c(0,1), n/2, replace = TRUE, prob = c(0.7,0.3))) #Simulate Response p <- c(rep(seq(0.4,0.9, length=100), 50), rep(seq(0.2,0.6, length=100), 50)) p <- data.frame(cbind(Real, p))