(Step 1) Using my predictive model, I predicted 1000 scores for my sample dataset.
(Step 2) I then calculate the random score using the same method for a randomized dataset. I firstly fit the distribution of the random score.
(Step 3) For each of my predictive score (1000 scores, in step 1), I calculated the p-values of getting a score larger than my predictive score for my sample sample dataset. Thus, 1000 p-values for my sample dataset are obtained.
(Step 4) As the real classification is known, by looking at the enrichment of true positive, I found when filter the sample dataset by p-values < 0.05. The best true positive values is obtained, which represents about 150 data from my sample dataset.
I then, want to test the predictive power of my model by doing AUC of the ROC (sensitivity vs 1- specificity plot).
However, I am facing a problem now, should I include all 1000 data for ROC plot to get the AUC, or should I only include those 150 data (p < 0.05) for my AUC analysis?
When I said p < 0.05, the p-value to obtain a score higher than my predictive score by random. In general speaking, does it means that 50% of my data are obtained by chance?
Thanks for the comments from @AlefSin, @steffen and @Frank Harrell.
For easier to discuss, I have prepare a sample dataset (x) as follows:
- My model predicted score (assume it is normal distributed with mean=1, sd=1)
- random set (assume also has mean=1, sd=1)
- Probability for each predicted score
- class prediction are listed as below, as listed in four columns
x <- data.frame (predict_score=c(rnorm(50,m=1, sd=1))) x$random <- rnorm(50, m=1, sd=1) x$probability <- pnorm(x$predict_score, m=mean(x$random),sd=sd(x$random)) x$class <- c(1,1,1,1,2,1,2,1,2,2,1,1,1,1,2,1,2,1,2,2,1,1,1,1,2,1,2,1,2,2,1,1,2,1,2,1,2,2,1,1,1,1,1,1,2,2,2,2,1,1)
I then did AUC for all data as follows for all data points:
library(caTools) colAUC(x$predict_score, x$class, plotROC=T, alg=c("Wilcoxon","ROC"))
[,1] 1 vs. 2 0.6
Let’s said if the enrichment of true positive is higher (Your runs may be differnt from mine, as rnorm give different results everytime) when I filter the dataset by p < 0.5, I did the AUC for a subset of the data as folows:
b <- subset(x, x$probability < 0.5) colAUC(b$predict_score, b$class, plotROC=T, alg=c("Wilcoxon","ROC"))
[,1] 1 vs. 2 0.7401961
My question is: When I do AUC analysis, is it a must to do the analysis with the whole dataset, or should we do filter the dataset first based on enrichment of true positive or what ever criteria before doing AUC?
ROC analysis answers the following question (in short): is your predictor different in the two groups?
You are confident that the subset obtained at step 4 is highly enriched in true positives. However this result won’t answer the question: “If I submit a new data point, is my predictive method going to get it right?” which is the typical question in ROC analysis. Instead, it will tell you “If this has a p < 0.05, am I highly confident this is a positive?”. That is not the point of ROC analysis. While this type of analysis might be relevant in your case (please note I said might, I have no idea if it is actually the case here), it is clearly not a standard ROC analysis and you’d have to make this clear.
To convince yourself that your procedure is not correct, repeat the procedure you showed above (generating a random dataset and computing the AUCs) multiple times. If this is random data, so you expect to obtain an AUC of 0.5 on average, right? Is it what you get? I bet not!
A few more comments:
- If you are especially interested in the positives, and you want to ensure that all your positive observations are predicted in the positive group, you may be interested in the partial area under the ROC curve that focuses on the high specificity region. See this paper by McClish.
- If you are interested in the enrichment of true positives, you want to check the positive predictive value rather than the specificity / ROC.