Significance test based on precision/recall/F1

Is it possible to do a significance test based solely on precision/recall/F1 scores?

For example, if you come across 2 systems in a paper for which only P/R/F1 are reported (on the same dataset, etc.), can you then perform a statistical significance test? If yes, how is that done?


Intuitively, getting a high P/R/F1 on a small data set, or on a very uniform/predictable dataset is probably easier than getting a high P/R/F1 on larger or more chaotic datasets. Therefore, an improvement in P/R/F1 on a larger and more chaotic dataset is more significant.

Following this intuition, you would probably need access to the output of the “black-box” methods in order to measure the difference in the distribution of results, while taking into account the size and variety in that set. The P/R/F1 alone are probably too little information.

Significance testing in this setting is usually done by forming a null hypothesis (the two algorithms produce always the same output) and then calculating the probability of observing the difference in output that you are observing if the algorithms were indeed the same. If the probability is less than .05 for example, you reject the null hypothesis and conclude that the improvement is significant.

This paper has relevant discussions:

Source : Link , Question Author : Vam , Answer Author : Pablo Mendes

Leave a Comment