Is p-value 0.05 enough? A study on the statistical evaluation of classifiers

Nadine M. Neumann; Alexandre Plastino; Jony A. Pinto Junior; Alex A. Freitas

doi:10.1017/S0269888920000417

Is p-value $<$ 0.05 enough? A study on the statistical evaluation of classifiers

Published online by Cambridge University Press: 27 November 2020

Nadine M. Neumann ,

Alexandre Plastino

Jony A. Pinto Junior and

Alex A. Freitas

Show author details

Nadine M. Neumann: Affiliation:
Instituto de Computação, Universidade Federal Fluminense, Niterói, RJ, Brazil e-mails: nadinemelloni@id.uff.br, plastino@ic.uff.br
Alexandre Plastino: Affiliation:
Instituto de Computação, Universidade Federal Fluminense, Niterói, RJ, Brazil e-mails: nadinemelloni@id.uff.br, plastino@ic.uff.br
Jony A. Pinto Junior: Affiliation:
Departamento de Estatística, Universidade Federal Fluminense, Niterói, RJ, Brazil e-mail: jarrais@id.uff.br
Alex A. Freitas: Affiliation:
School of Computing, University of Kent, Canterbury, Kent, UK e-mail: a.a.freitas@kent.ac.uk

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

Statistical significance analysis, based on hypothesis tests, is a common approach for comparing classifiers. However, many studies oversimplify this analysis by simply checking the condition p-value < 0.05, ignoring important concepts such as the effect size and the statistical power of the test. This problem is so worrying that the American Statistical Association has taken a strong stand on the subject, noting that although the p-value is a useful statistical measure, it has been abusively used and misinterpreted. This work highlights problems caused by the misuse of hypothesis tests and shows how the effect size and the power of the test can provide important information for better decision-making. To investigate these issues, we perform empirical studies with different classifiers and 50 datasets, using the Student’s t-test and the Wilcoxon test to compare classifiers. The results show that an isolated p-value analysis can lead to wrong conclusions and that the evaluation of the effect size and the power of the test contributes to a more principled decision-making.

Type: Research Article
Information: The Knowledge Engineering Review , Volume 36 , 2021 , e1

DOI: https://doi.org/10.1017/S0269888920000417 [Opens in a new window]
Copyright: © The Author(s), 2020. Published by Cambridge University Press

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Barros, E. A. C. & Mazucheli, J. 2005. Um estudo sobre o tamanho e poder dos testes t-student e wilcoxon. Acta Scientiarum: Technology 27(1), 23–32.Google Scholar

Benavoli, A., Corani, G., Demšar, J. & Zaffalon, M. 2017. Time for a change: a tutorial for comparing multiple classifiers through bayesian analysis. Journal of Machine Learning Research 18(1), 1–36.Google Scholar

Berben, L., Sereika, S. M. & Engberg, S. 2012. Effect size estimation: methods and examples. International Journal of Nursing Studies 49(8), 1039–1047.CrossRef Google Scholar PubMed

Bertsimas, D. & Dunn, J. 2017. Optimal classification trees. Machine Learning 106(7), 1039–1082.CrossRef Google Scholar

Breiman, L. 2001. Random forests. Machine Learning 45(1), 5–32.CrossRef Google Scholar

Bussab, W. O. & Morettin, P. 2010. Estatística Básica, 6a. edição. Editora Saraiva.Google Scholar

Cardoso, D. O., Gama, J. & França, F. M. 2017. Weightless neural networks for open set recognition. Machine Learning 106(9–10), 1547–1567.CrossRef Google Scholar

Cohen, J. 1988. Statistical Power Analysis for the Behavioral Sciences, 2nd edition. Erlbaum.Google Scholar

Cousins, S. & Taylor, J. S. 2017. High-probability minimax probability machines. Machine Learning 106(6), 863–886.CrossRef Google Scholar

Cover, T. & Hart, P. 1967. Nearest neighbor pattern classification. IEEE Transactions on Information Theory 13(1), 21–27.CrossRef Google Scholar

Dheeru, D. & Taniskidou, E. K. 2017. UCI machine learning repository. http://archive.ics.uci.edu/ml.Google Scholar

du Plessis, M. C., Niu, G. & Sugiyama, M. 2017. Class-prior estimation for learning from positive and unlabeled data. Machine Learning 106(4), 463–492.CrossRef Google Scholar

Fern, E. F. & Monroe, K. B. 1996. Effect-size estimates: issues and problems in interpretation. Journal of Consumer Research 23(2), 89–105.CrossRef Google Scholar

Fisher, R. A. 1925. Statistical Methods for Research Workers. Springer.Google Scholar

Fritz, C. O., Morris, P. E. & Richler, J. J. 2012. Effect size estimates: current use, calculations, and interpretation. Journal of Experimental Psychology: General 141(1), 2–18.CrossRef Google Scholar PubMed

Gomes, H. M., Bifet, A., Read, J., Barddal, J. P., Enembreck, F., Pfharinger, B., Holmes, G. & Abdessalem, T. 2017. Adaptive random forests for evolving data stream classification. Machine Learning 106(9–10), 1469–1495.CrossRef Google Scholar

Hair, J. F., Black, W. C., Babin, B. J., Anderson, R. E. & Tatham, R. L. 2009. Análise multivariada de dados. Bookman Editora.Google Scholar

Hearst, M. A., Dumais, S. T., Osuna, E., Platt, J. & Scholkopf, B. 1998. Support vector machines. IEEE Intelligent Systems and Their Applications 13(4), 18–28.CrossRef Google Scholar

Huang, K. H. & Lin, H. T. 2017. Cost-sensitive label embedding for multi-label classification. Machine Learning 106(9–10), 1725–1746.CrossRef Google Scholar

Japkowicz, N. & Shah, M. 2011. Evaluating Learning Algorithms: A Classification Perspective. Cambridge University Press.CrossRef Google Scholar

Júnior, P. R. M., de Souza, R. M., Werneck, R. d. O., Stein, B. V., Pazinato, D. V., de Almeida, W. R., Penatti, O. A., Torres, R. d. S. & Rocha, A. 2017. Nearest neighbors distance ratio open-set classifier. Machine Learning 106(3), 359–386.CrossRef Google Scholar

Kim, D. & Oh, A. 2017. Hierarchical dirichlet scaling process. Machine Learning 106(3), 387–418.CrossRef Google Scholar

Kline, R. B. 2004. Beyond Significance Testing: Reforming Data Analysis Methods in Behavioral Research. American Psychological Association.CrossRef Google Scholar

Kotłowski, W. & Dembczyński, K. 2017. Surrogate regret bounds for generalized classification performance metrics. Machine Learning 106(4), 549–572.CrossRef Google Scholar

Krijthe, J. H. & Loog, M. 2017. Projected estimators for robust semi-supervised classification. Machine Learning 106(7), 993–1008.CrossRef Google Scholar

Langley, P., Iba, W., Thompson, K. 1992. An analysis of Bayesian classifiers. In Proceedings of the Tenth National Conference on Artificial Intelligence (AAAI-92), California, AAAI Press, 90, 223–228.Google Scholar

Mena, D., Montañés, E., Quevedo, J. R. & Del Coz, J. J. 2017. A family of admissible heuristics for a^* to perform inference in probabilistic classifier chains. Machine Learning 106(1), 143–169.CrossRef Google Scholar

Journal, ML. 2017. Machine Learning 106(1–12). https://link.springer.com/journal/10994/106/1 Google Scholar

Nakagawa, S. & Cuthill, I. C. 2007. Effect size, confidence interval and statistical significance: a practical guide for biologists. Biological Reviews 82(4), 591–605.CrossRef Google Scholar PubMed

Neumann, N. M., Plastino, A., Junior, J. A. P. & Freitas, A. A. 2018. Is p-value< 0.05 enough? two case studies in classifiers evaluation (in Portuguese). In Anais do XV Encontro Nacional de Inteligência Artificial e Computacional, SBC, 94–103.Google Scholar

Osojnik, A., Panov, P. & Džeroski, S. 2017. Multi-label classification via multi-target regression on data streams. Machine Learning 106(6), 745–770.CrossRef Google Scholar

Snyder, P. & Lawson, S. 1993. Evaluating results using corrected and uncorrected effect size estimates. The Journal of Experimental Education 61(4), 334–349.CrossRef Google Scholar

Sullivan, G. M. & Feinn, R. 2012. Using effect size-or why the p-value is not enough. Journal of Graduate Medical Education 4(3), 279–282.CrossRef Google Scholar PubMed

Suzumura, S., Ogawa, K., Sugiyama, M., Karasuyama, M. & Takeuchi, I. 2017. Homotopy continuation approaches for robust SV classification and regression. Machine Learning 106(7), 1009–1038.CrossRef Google Scholar

Tomczak, M. & Tomczak, E. 2014. The need to report effect size estimates revisited. an overview of some recommended measures of effect size. Trends in Sport Sciences 21(1), 19–25.Google Scholar

Wasserstein, R. L. & Lazar, N. A. 2016. The ASA’s statement on p-values: context, process, and purpose. The American Statistician 70, 129–133.CrossRef Google Scholar

Witten, I. H., Frank, E., Hall, M. A. & Pal, C. J. 2016. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann.Google Scholar

Wu, Y. P. & Lin, H. T. 2017. Progressive random k-labelsets for cost-sensitive multi-label classification. Machine Learning 106(5), 671–694.CrossRef Google Scholar

Xuan, J., Lu, J., Zhang, G., Da Xu, R. Y. & Luo, X. 2017. A Bayesian nonparametric model for multi-label learning. Machine Learning 106(11), 1787–1815.CrossRef Google Scholar

Yu, F. & Zhang, M. L. 2017. Maximum margin partial label learning. Machine Learning 106(4), 573–593.CrossRef Google Scholar

Zaidi, N. A., Webb, G. I., Carman, M. J., Petitjean, F., Buntine, W., Hynes, M. & De Sterck, H. 2017. Efficient parameter learning of Bayesian network classifiers. Machine Learning 106(9–10), 1289–1329.CrossRef Google Scholar

Article contents

Is p-value $<$ 0.05 enough? A study on the statistical evaluation of classifiers

Abstract

Access options

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests