MASSA Algorithm: an automated rational sampling of training and test subsets for QSAR modeling

Veríssimo, Gabriel Corrêa; Pantaleão, Simone Queiroz; Fernandes, Philipe de Olveira; Gertrudes, Jadson Castro; Kronenberger, Thales; Honorio, Kathia Maria; Maltarollo, Vinícius Gonçalves

doi:10.1007/s10822-023-00536-y

MASSA Algorithm: an automated rational sampling of training and test subsets for QSAR modeling

Published: 07 October 2023

Volume 37, pages 735–754, (2023)
Cite this article

Journal of Computer-Aided Molecular Design Aims and scope Submit manuscript

265 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

QSAR models capable of predicting biological, toxicity, and pharmacokinetic properties were widely used to search lead bioactive molecules in chemical databases. The dataset’s preparation to build these models has a strong influence on the quality of the generated models, and sampling requires that the original dataset be divided into training (for model training) and test (for statistical evaluation) sets. This sampling can be done randomly or rationally, but the rational division is superior. In this paper, we present MASSA, a Python tool that can be used to automatically sample datasets by exploring the biological, physicochemical, and structural spaces of molecules using PCA, HCA, and K-modes. The proposed algorithm is very useful when the variables used for QSAR are not available or to construct multiple QSAR models with the same training and test sets, producing models with lower variability and better values for validation metrics. These results were obtained even when the descriptors used in the QSAR/QSPR were different from those used in the separation of training and test sets, indicating that this tool can be used to build models for more than one QSAR/QSPR technique. Finally, this tool also generates useful graphical representations that can provide insights into the data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Applications of Genetic Algorithms in QSAR/QSPR Modeling

Modelability Criteria: Statistical Characteristics Estimating Feasibility to Build Predictive QSAR Models for a Dataset

An automated framework for QSAR model building

Article Open access 16 January 2018

Data availability

The molecular files are available in the first author’s GitHub repository at https://github.com/gcverissimo/MASSA_datasets.

Code availability

The source code is available in the first author’s GitHub repository at https://github.com/gcverissimo/MASSA_Algorithm.

References

Yang X, Wang Y, Byrne R et al (2019) Concepts of artificial intelligence for computer-assisted drug discovery. Chem Rev 119:10520–10594. https://doi.org/10.1021/acs.chemrev.8b00728
Article CAS PubMed Google Scholar
Masand VH, Mahajan DT, Nazeruddin GM et al (2015) Effect of information leakage and method of splitting (rational and random) on external predictive ability and behavior of different statistical parameters of QSAR model. Med Chem Res 24:1241–1264. https://doi.org/10.1007/s00044-014-1193-8
Article CAS Google Scholar
Andrada MF, Vega-Hissi EG, Estrada MR, Garro Martinez JC (2017) Impact assessment of the rational selection of training and test sets on the predictive ability of QSAR models. SAR QSAR Environ Res 28:1011–1023. https://doi.org/10.1080/1062936X.2017.1397056
Article CAS PubMed Google Scholar
Clark DE (2006) What has computer-aided molecular design ever done for drug discovery? Expert Opin Drug Discov 1:103–110. https://doi.org/10.1517/17460441.1.2.103
Article CAS PubMed Google Scholar
International Council for Harmonisation of Technical Requirements for Pharmaceuticals for Human Use (2017) Assessment and Control of DNA Reactive (Mutagenic) Impurities in Pharmaceuticals to Limit Potential Carcinogenic Risk
Martin TM, Harten P, Young DM et al (2012) Does Rational selection of training and test sets improve the outcome of QSAR modeling? J Chem Inf Model 52:2570–2578. https://doi.org/10.1021/ci300338w
Article CAS PubMed Google Scholar
Cherkasov A, Muratov EN, Fourches D et al (2014) QSAR modeling: where have you been? Where are you going to? J Med Chem 57:4977–5010. https://doi.org/10.1021/jm4004285
Article CAS PubMed PubMed Central Google Scholar
Muratov EN, Bajorath J, Sheridan RP et al (2020) QSAR without borders. Chem Soc Rev 49:3525–3564. https://doi.org/10.1039/D0CS00098A
Article CAS PubMed PubMed Central Google Scholar
Puzyn T, Mostrag-Szlichtyng A, Gajewicz A et al (2011) Investigating the influence of data splitting on the predictive ability of QSAR/QSPR models. Struct Chem 22:795–804. https://doi.org/10.1007/s11224-011-9757-4
Article CAS Google Scholar
Esbensen KH, Geladi P (2010) Principles of proper validation: use and abuse of re-sampling for validation. J Chemom 24:168–187. https://doi.org/10.1002/cem.1310
Article CAS Google Scholar
Hawkins DM, Basak SC, Mills D (2003) Assessing model fit by cross-validation. J Chem Inf Comput Sci 43:579–586. https://doi.org/10.1021/ci025626i
Article CAS PubMed Google Scholar
Golbraikh A, Tropsha A (2000) Predictive QSAR modeling based on diversity sampling of experimental datasets for the training and test set selection. Mol Divers 5:231–243. https://doi.org/10.1023/A:1021372108686
Article CAS Google Scholar
Golbraikh A, Shen M, Xiao Z et al (2003) Rational selection of training and test sets for the development of validated QSAR models. J Comput Aided Mol Des 17:241–253. https://doi.org/10.1023/A:1025386326946
Article CAS PubMed Google Scholar
Wu W, Walczak B, Massart DL et al (1996) Artificial neural networks in classification of NIR spectral data: design of the training set. Chemom Intell Lab Syst 33:35–46. https://doi.org/10.1016/0169-7439(95)00077-1
Article CAS Google Scholar
Kronenberger T, Windshügel B, Wrenger C et al (2018) On the relationship of anthranilic derivatives structure and the FXR (Farnesoid X receptor) agonist activity. J Biomol Struct Dyn 36:4378–4391. https://doi.org/10.1080/07391102.2017.1417161
Article CAS PubMed Google Scholar
Veríssimo GC, Menezes Dutra EF, Teotonio Dias AL et al (2019) HQSAR and random forest-based QSAR models for anti-T. vaginalis activities of nitroimidazoles derivatives. J Mol Graph Model 90:180–191. https://doi.org/10.1016/j.jmgm.2019.04.007
Article CAS PubMed Google Scholar
Gomes RA, Genesi GL, Maltarollo VG, Trossini GHG (2017) Quantitative structure–activity relationships (HQSAR, CoMFA, and CoMSIA) studies for COX-2 selective inhibitors. J Biomol Struct Dyn 35:1436–1445. https://doi.org/10.1080/07391102.2016.1185379
Article CAS PubMed Google Scholar
de Fernandes PO, Martins JPA, de Melo EB et al (2021) Quantitative structure-activity relationship and machine learning studies of 2-thiazolylhydrazone derivatives with anti-Cryptococcus neoformans activity. J Biomol Struct Dyn. https://doi.org/10.1080/073911021935321
Article PubMed Google Scholar
Kronenberger T, Asse LR, Wrenger C et al (2017) Studies of Staphylococcus aureus FabI inhibitors: fragment-based approach based on holographic structure–activity relationship analyses. Future Med Chem 9:135–151. https://doi.org/10.4155/fmc-2016-0179
Article CAS PubMed Google Scholar
Ferreira GM, de Magalhães JG, Maltarollo VG et al (2020) QSAR studies on the human sirtuin 2 inhibition by non-covalent 7,5,2-anilinobenzamide derivatives. J Biomol Struct Dyn 38:354–363. https://doi.org/10.1080/07391102.2019.1574603
Article CAS PubMed Google Scholar
Maltarollo VG (2019) Classification of Staphylococcus aureus FabI inhibitors by machine learning techniques. IJQSPR 4:1–14. https://doi.org/10.4018/IJQSPR.2019100101
Article CAS Google Scholar
Primi MC, Maltarollo VG, Magalhães JG et al (2016) Convergent QSAR studies on a series of NK3 receptor antagonists for schizophrenia treatment. J Enzyme Inhib Med Chem 31:283–294. https://doi.org/10.3109/14756366.2015.1021250
Article CAS PubMed Google Scholar
Popova M, Isayev O, Tropsha A (2018) Deep reinforcement learning for de novo drug design. Sci Adv 4:eaap7885. https://doi.org/10.1126/sciadv.aap7885
Article CAS PubMed PubMed Central Google Scholar
Schneider G (2019) Mind and machine in drug design. Nat Mach Intell 1:128–130. https://doi.org/10.1038/s42256-019-0030-7
Article Google Scholar
Dara S, Dhamercherla S, Jadav SS et al (2022) Machine learning in drug discovery: a review. Artif Intell Rev 55:1947–1999. https://doi.org/10.1007/s10462-021-10058-4
Article PubMed Google Scholar
Ambure P, Halder AK, González Díaz H, Cordeiro MNDS (2019) QSAR-Co: an open source software for developing robust multitasking or multitarget classification-based QSAR models. J Chem Inf Model 59:2538–2544. https://doi.org/10.1021/acs.jcim.9b00295
Article CAS PubMed Google Scholar
Halder AK, Dias Soeiro Cordeiro MN (2021) QSAR-Co-X: an open source toolkit for multitarget QSAR modelling. J Cheminform 13:29. https://doi.org/10.1186/s13321-021-00508-0
Article CAS PubMed PubMed Central Google Scholar
Veríssimo GC (2021) MASSA Algorithm: Molecular data set sampling for training-test separation
Landrum G (2021) RDkit: 2021_03_3 (Q1 2021) Release
Vos NJ de (2015) KModes categorical clustering library
Python Software Foundation argparse—Parser for command-line options, arguments and sub-commands—Python 3.9.7 documentation. https://docs.python.org/3/library/argparse.html. Accessed 5 Oct 2021
scikit-learn: machine learning in Python—scikit-learn 1.0 documentation. https://scikit-learn.org/stable/index.html. Accessed 5 Oct 2021
sklearn.decomposition.PCA. In: scikit-learn. https://www.scikit-learn/stable/modules/generated/sklearn.decomposition.PCA.html. Accessed 5 Oct 2021
scipy.cluster.hierarchy.linkage—SciPy v1.7.1 Manual. https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html. Accessed 8 Oct 2021
scipy.cluster.hierarchy.maxdists—SciPy v1.8.0 Manual. https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.maxdists.html. Accessed 22 Mar 2022
scipy.cluster.hierarchy.fcluster—SciPy v1.7.1 Manual. https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.fcluster.html. Accessed 8 Oct 2021
scipy.cluster.hierarchy.dendrogram—SciPy v1.7.1 Manual. https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.dendrogram.html. Accessed 8 Oct 2021
sklearn.model_selection.train_test_split. In: scikit-learn. https://www.scikit-learn/stable/modules/generated/sklearn.model_selection.train_test_split.html. Accessed 9 Oct 2021
Sutherland JJ, O’Brien LA, Weaver DF (2004) A Comparison of methods for modeling quantitative structure−activity relationships. J Med Chem 47:5541–5554. https://doi.org/10.1021/jm0497141
Article CAS PubMed Google Scholar
Liu C-J, Yu S-L, Liu Y-P et al (2016) Synthesis, cytotoxic activity evaluation and HQSAR study of novel isosteviol derivatives as potential anticancer agents. Eur J Med Chem 115:26–40. https://doi.org/10.1016/j.ejmech.2016.03.009
Article CAS PubMed Google Scholar
Valadares NF, Castilho MS, Polikarpov I, Garratt RC (2007) 2D QSAR studies on thyroid hormone receptor ligands. Bioorg Med Chem 15:4609–4617. https://doi.org/10.1016/j.bmc.2007.04.015
Article CAS PubMed Google Scholar
Ye M, Dawson MI (2009) Studies of cannabinoid-1 receptor antagonists for the treatment of obesity: hologram QSAR model for biarylpyrazolyl oxadiazole ligands. Bioorg Med Chem Lett 19:3310–3315. https://doi.org/10.1016/j.bmcl.2009.04.072
Article CAS PubMed Google Scholar
Jiao L, Wang Y, Qu L et al (2020) Hologram QSAR study on the critical micelle concentration of Gemini surfactants. Colloids Surf, A 586:124226. https://doi.org/10.1016/j.colsurfa.2019.124226
Article CAS Google Scholar
Dassault Systèmes Biovia Corp (2020) BIOVIA discovery studio visualizer 2021
Hawkins PCD, Skillman AG, Warren GL et al (2010) Conformer generation with OMEGA: algorithm and validation using high quality structures from the protein databank and Cambridge structural database. J Chem Inf Model 50:572–584. https://doi.org/10.1021/ci100031x
Article CAS PubMed PubMed Central Google Scholar
OMEGA 2.5.1.4. OpenEye Scientific Software, Santa Fe
QUACPAC 1.6.3.1. OpenEye Scientific Software, Santa Fe
Burns J, Spiekermann K, Bhattacharjee H, et al (2023) Machine Learning Validation via Rational Dataset Sampling with astartes
TRIPOS Associates Inc (2012) Sybyl-X Molecular Modeling Software Packages
Berthold MR, Cebron N, Dill F et al (2009) KNIME—the Konstanz information miner: version 2.0 and beyond. ACM SIGKDD Explor Newsl. https://doi.org/10.1145/16562741656280
Article Google Scholar
Fernandes PO, Martins DM, de Souza BA et al (2021) Molecular insights on ABL kinase activation using tree-based machine learning models and molecular docking. Mol Divers 25:1301–1314. https://doi.org/10.1007/s11030-021-10261-z
Article CAS PubMed PubMed Central Google Scholar
Pedregosa F, Varoquaux G, Gramfort A et al (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830
Google Scholar
Virtanen P, Gommers R, Oliphant TE et al (2020) SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat Methods 17:261–272. https://doi.org/10.1038/s41592-019-0686-2
Article CAS PubMed PubMed Central Google Scholar
Chirico N, Gramatica P (2011) Real external predictivity of QSAR models: how to evaluate it? comparison of different validation criteria and proposal of using the concordance correlation coefficient. J Chem Inf Model 51:2320–2335. https://doi.org/10.1021/ci200211n
Article CAS PubMed Google Scholar
Golbraikh A, Tropsha A (2002) Beware of q2! J Mol Graph Model 20:269–276. https://doi.org/10.1016/S1093-3263(01)00123-1
Article CAS PubMed Google Scholar
Roy K, Kar S, Das RN (2015) A primer on QSAR/QSPR modeling. Springer International Publishing, Cham
Book Google Scholar
Shi LM, Fang H, Tong W et al (2001) QSAR models using a large diverse set of estrogens. J Chem Inf Comput Sci 41:186–195. https://doi.org/10.1021/ci000066d
Article CAS PubMed Google Scholar
Gramatica P, Sangion A (2016) A historical excursus on the statistical validation parameters for QSAR models: a clarification concerning metrics and terminology. J Chem Inf Model 56:1127–1131. https://doi.org/10.1021/acs.jcim.6b00088
Article CAS PubMed Google Scholar
Bae S-Y, Lee J, Jeong J et al (2021) Effective data-balancing methods for class-imbalanced genotoxicity datasets using machine learning algorithms and molecular fingerprints. Comput Toxicol 20:100178. https://doi.org/10.1016/j.comtox.2021.100178
Article CAS Google Scholar
Veríssimo GC, Serafim MSM, Kronenberger T et al (2022) Designing drugs when there is low data availability: one-shot learning and other approaches to face the issues of a long-term concern. Expert Opin Drug Discov 17:929–947. https://doi.org/10.1080/17460441.2022.2114451
Article PubMed Google Scholar
Ambure P, Gajewicz-Skretna A, Cordeiro MNDS, Roy K (2019) New workflow for QSAR model development from small data sets: small dataset curator and small dataset modeler. integration of data curation, exhaustive double cross-validation, and a set of optimal model selection techniques. J Chem Inf Model 59:4070–4076. https://doi.org/10.1021/acs.jcim.9b00476
Article CAS PubMed Google Scholar

Download references

Acknowledgements

The authors would like to thank Conselho Nacional de Desenvolvimento Científico e Tecnológico, Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CNPq), Fundação de Amparo à Pesquisa do Estado de Minas Gerais (FAPEMIG), Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP), Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES) and Pró-Reitoria de Pesquisa of the Universidade Federal de Minas Gerais for financial support, OpenEye Scientific Software for OMEGA and QUACPAC academic licenses and Prof. Dr. Raquel Cardoso de Melo Minardi for her encouragement and for offering the course in which this tool was developed.

Funding

Conselho Nacional de Desenvolvimento Científico e Tecnológico, Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CNPq), Fundação de Amparo à Pesquisa do Estado de Minas Gerais (FAPEMIG), Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP), Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES) and Pró-Reitoria de Pesquisa of the Universidade Federal de Minas Gerais for financial support and academic grants. OpenEye Scientific Software for OMEGA and QUACPAC academic licenses. T.K. is funded by the TüCAD2 and CMIF. TüCAD2 and CMIF are funded by the Federal Ministry of Education and Research (BMBF) and the Baden-Württemberg Ministry of Science as part of the Excellence Strategy of the German Federal and State Governments.

Author information

Authors and Affiliations

Department of Pharmaceutical Products, Faculty of Pharmacy, Federal University of Minas Gerais, Belo Horizonte, MG, 31270-901, Brazil
Gabriel Corrêa Veríssimo, Philipe de Olveira Fernandes & Vinícius Gonçalves Maltarollo
Federal University of ABC, Santo André, SP, 09210-170, Brazil
Simone Queiroz Pantaleão & Kathia Maria Honorio
Department of Computing, Institute of Exact and Biological Sciences, Federal University of Ouro Preto, Ouro Preto, MG, 35400-000, Brazil
Jadson Castro Gertrudes
Department of Pharmaceutical and Medicinal Chemistry, University of Tübingen, Tübingen, BW, 72076, Germany
Thales Kronenberger
School of Arts, Sciences and Humanities, University of São Paulo, São Paulo, SP, 03828-000, Brazil
Kathia Maria Honorio

Authors

Gabriel Corrêa Veríssimo
View author publications
You can also search for this author in PubMed Google Scholar
Simone Queiroz Pantaleão
View author publications
You can also search for this author in PubMed Google Scholar
Philipe de Olveira Fernandes
View author publications
You can also search for this author in PubMed Google Scholar
Jadson Castro Gertrudes
View author publications
You can also search for this author in PubMed Google Scholar
Thales Kronenberger
View author publications
You can also search for this author in PubMed Google Scholar
Kathia Maria Honorio
View author publications
You can also search for this author in PubMed Google Scholar
Vinícius Gonçalves Maltarollo
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

GCV wrote the MASSA algorithm code and applied in the training test splitting, and prepared all the figures and tables. GCV and SQP generated and validated QSAR models. GCV, SQP, POF, and JCG analyzed and compared the obtained data. JCG, TK, KMH, and VGM designed the experiments, and supervised the students. All the authors wrote and reviewed the manuscript.

Corresponding author

Correspondence to Vinícius Gonçalves Maltarollo.

Ethics declarations

Competing interests

The authors have no competing interests (financial or non-financial) to declare that are relevant to the content of this article.

Ethical approval

Not applicable.

Consent to participate

Not applicable.

Consent for publication

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (RAR 29893 KB)

Supplementary file2 (ZIP 17421 KB)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Veríssimo, G.C., Pantaleão, S.Q., Fernandes, P. et al. MASSA Algorithm: an automated rational sampling of training and test subsets for QSAR modeling. J Comput Aided Mol Des 37, 735–754 (2023). https://doi.org/10.1007/s10822-023-00536-y

Download citation

Received: 06 June 2023
Accepted: 14 September 2023
Published: 07 October 2023
Issue Date: December 2023
DOI: https://doi.org/10.1007/s10822-023-00536-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

MASSA Algorithm: an automated rational sampling of training and test subsets for QSAR modeling

Abstract

Access this article

Similar content being viewed by others

Applications of Genetic Algorithms in QSAR/QSPR Modeling

Modelability Criteria: Statistical Characteristics Estimating Feasibility to Build Predictive QSAR Models for a Dataset

An automated framework for QSAR model building

Data availability

Code availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Ethical approval

Consent to participate

Consent for publication

Additional information

Publisher's Note

Supplementary Information

Supplementary file1 (RAR 29893 KB)

Supplementary file2 (ZIP 17421 KB)

Rights and permissions

About this article

Cite this article

Keywords

Navigation

MASSA Algorithm: an automated rational sampling of training and test subsets for QSAR modeling

Abstract

Access this article

Similar content being viewed by others

Applications of Genetic Algorithms in QSAR/QSPR Modeling

Modelability Criteria: Statistical Characteristics Estimating Feasibility to Build Predictive QSAR Models for a Dataset

An automated framework for QSAR model building

Data availability

Code availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Ethical approval

Consent to participate

Consent for publication

Additional information

Publisher's Note

Supplementary Information

Supplementary file1 (RAR 29893 KB)

Supplementary file2 (ZIP 17421 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation