Article Preview
Top1. Introduction
The correct selection of performance metrics is one of the most key issues in evaluating classifiers' performances. A number of performance metrics have been proposed in different application scenarios. For example, accuracy is typically used to measure the percentage of correctly classified test instances. It is so far the primary metric for assessing classifier performance (Ben et al.2007) and (Huang et al.2005); precision and recall metrics are widely applied in information retrieval (Baeza-Yates 1999); medical decision making community prefers the area under the receiver operating characteristic (ROC) curves (i.e., AUC) (Lasko et al. 2005). It is a very common situation, where a classifier performs well on one performance metric but badly on others. For example, boosted trees and SVM classifiers achieve good performances on classification accuracy, while they yield poor performances on root mean square error (Caruana et al. 2004).
In general, a widely accepted consensus is to choose performance metrics depending on the practical requirements of specific applications. For example, neural networks typically optimize squared error, and thus the metric of root mean square error can better reflect the actual performance of a classifier than other metrics. However, in some cases, specific criteria are unknown in advance, and practitioners tend to select several measures from widely adopted ones, such as classification accuracy, kappa statistic, F-measure and AUC, for evaluating a new classifier (Sokolova et al.2006), and (Sokolova et al.2009). Additionally, most metrics are derived by calculating the confusion matrix of the classifier. It could be reasonable to think that some of such performance metrics are closely related, which may cause redundancy on measuring the performance of classifiers. On the other hand, it is difficult for practitioners to reach concrete conclusion when two metrics provide conflicting results.
This study focuses on providing a strategy on selecting appropriate performance metrics for classifiers by using Pearson linear correlation and Spearman rank correlation to analyses the potential relationship among seven widely used performance metrics, namely accuracy, F-measure, kappa statistic, root mean squared error (i.e., RMSE), mean absolute error (i.e., MAE), AUC, and area under the precision recall (PR) curve (i.e., AUPRC). We first briefly describe these performance metrics. By definition, we sketch out their characteristic features by confusion matrix and preliminarily classify them into three groups, namely threshold metrics, rank metrics, and probability metrics. Then, we use correlation analysis to measure the correlations of these metrics. The experimental results show that metrics from the same group are closely correlated but less correlated with metrics from different groups. Additionally, we compare the correlation changes caused by the size and class distribution of the datasets, which are the main factors affecting measured values.
The main contributions made in this work are summarized as follows. First, we divide these seven performance metrics into three groups by analyzing their definitions. Experimental results confirm that metrics inside the same group have high correlation, and metrics from different groups have low correlation. Second, we also provide practitioners with the following strategies on selecting performance metrics for evaluating a classifier's performance based on experiment results. For balanced training data sets, one should select multiple metrics to evaluate the classifier, and at least one metric is selected in each group. For imbalanced training data sets, a classifier is not necessary to achieve the optimal performance on all groups of metrics; instead, as long as the classifier meets the performance requirement of an application measured by certain group(s) of metrics, we recommend adopting it regardless of its less satisfactory performance on other groups of metrics.