Tuesday, February 13, 2007

The Relationship Between Precision-Recall and ROC Curves


At first, I was then ignorant about the two curves though I heard about them in all kinds of papers. Now I have the chance to explore the common and different points about them.

There are four terms in two-class problems, true positive(TP), true negative(TN), false positive(FP) and false negative(FN). The modifier true/false corresponds to the result of a given classifier, that's to say, if it correctly classifies the sample, the sample is true positive/negative. Otherwise "false" should be used. Positive and negative are the result given by the classifier. A classifier won't necessarily be a good one even though those it declares positives are indeed all positives, because half of the positives are mistaken as negatives and a single mistake is inacceptable sometimes. So a ratio like this won't be sufficient to tell us how good a classifier is.

There are 3 ratios that are commonly used. The recall rate, or true positive rate(TPR), is the ratio of TP over (TP + FN). The higher(->1) it is, the less positives it misses(needs to be recalled). Precision is the ratio of TP over (TP+FP), it is the ratio of correctly recognized samples in all samples declared so. The higher it is, the less negatives it mistakes. The false positive rate (FPR) is the ratio of FP over (FP+TN).

The so-called ROC(Receiver Operator Charicteristic) curve is the diagram of TPR over FPR while the PR(Precision-Recall) curve is precision over recall rate. They both have different environment of applications. It is said that in a largely skewed sample set, PR curve is preferred since a large number of negatives blunt the response on FPR.

Usually, we construct a ROC curve with several confusion matrices obtained in experiments. Each confusion matrix represents a point in ROC diagram and by creating a convex hull of them, the desired curve is gained. Although there should be a corresponding PR curve, it can be constructed in the same way, which leads to an overly-optimistic view.

The main theoretical result of this paper is represented in several theorems, given a fixed sample set: if recall rate is not zero, PR and confusion matrices can be mutually determined; ROC and confusion matrices can be mutually determined; the ROC curve of one classifier dominates another iff the PR of it dominates the one of the other.

The authors provides an interpolation trick that utilizes the ROC curve and corresponding linear interpolation and convert it into PR diagram for interpolation. In a word, a ROC curve can be transformed into an achievable PR curve which can then be constructed with those points to construct an achievable ROC curve.

ROC and its analogs are useful for model selection, for they provide a method of comparison. This should be explored later. A given reference is A support vector method of multivariate performance measures in ICML 2005.

No comments: