Abstract

Abstract Background To evaluate binary classifications and their confusion matrices, scientific researchers can employ several statistical rates, accordingly to the goal of the experiment they are investigating. Despite being a crucial issue in machine learning, no widespread consensus has been reached on a unified elective chosen measure yet. Accuracy and F 1 score computed on confusion matrices have been (and still are) among the most popular adopted metrics in binary classification tasks. However, these statistical measures can dangerously show overoptimistic inflated results, especially on imbalanced datasets. Results The Matthews correlation coefficient (MCC), instead, is a more reliable statistical rate which produces a high score only if the prediction obtained good results in all of the four confusion matrix categories (true positives, false negatives, true negatives, and false positives), proportionally both to the size of positive elements and the size of negative elements in the dataset. Conclusions In this article, we show how MCC produces a more informative and truthful score in evaluating binary classifications than accuracy and F 1 score, by first explaining the mathematical properties, and then the asset of MCC in six synthetic use cases and in a real genomics scenario. We believe that the Matthews correlation coefficient should be preferred to accuracy and F 1 score in evaluating binary classification tasks by all scientific communities.

Keywords

Binary classificationFalse positive paradoxBinary numberFalse positives and false negativesCorrelationConfusion matrixArtificial intelligenceStatisticsPearson product-moment correlation coefficientFalse positive rateConfusionCorrelation coefficientMatthews correlation coefficientComputer scienceMachine learningPattern recognition (psychology)Data miningMathematicsSupport vector machinePsychologyArithmetic

MeSH Terms

AlgorithmsComputational BiologyCorrelation of DataData InterpretationStatisticalMachine Learning

Affiliated Institutions

Related Publications

Publication Info

Year
2020
Type
article
Volume
21
Issue
1
Pages
6-6
Citations
5067
Access
Closed

Social Impact

Social media, news, blog, policy document mentions

Citation Metrics

5067
OpenAlex
261
Influential

Cite This

Davide Chicco, Giuseppe Jurman (2020). The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics , 21 (1) , 6-6. https://doi.org/10.1186/s12864-019-6413-7

Identifiers

DOI
10.1186/s12864-019-6413-7
PMID
31898477
PMCID
PMC6941312

Data Quality

Data completeness: 86%