Improvement of multi-task learning by data enrichment: application for drug discovery

Sosnina, Ekaterina A.; Sosnin, Sergey; Fedorov, Maxim V.

doi:10.1007/s10822-023-00500-w

Improvement of multi-task learning by data enrichment: application for drug discovery

Published: 21 March 2023

Volume 37, pages 183–200, (2023)
Cite this article

Journal of Computer-Aided Molecular Design Aims and scope Submit manuscript

Ekaterina A. Sosnina¹,
Sergey Sosnin² &
Maxim V. Fedorov^1,3

650 Accesses
3 Citations
23 Altmetric
3 Mentions
Explore all metrics

Abstract

Multi-task learning in deep neural networks has become a topic of growing importance in many research fields, including drug discovery. However, applying multi-task learning poses new challenges in improving prediction performance. This study investigated the potential of training data enrichment to enhance multi-task model prediction quality in drug discovery. The study evaluated four scenarios with varying degrees of information capacity of the training data and applied two types of test data to evaluate prediction performance. We used three datasets: ViralChEMBL, which consisted of binary activities of compounds against viral species, was applied for the classification task; pQSAR(159) and pQSAR(4267), which consisted of bio-activities of compounds and assays from the research of the profile-QSAR method, were applied for regression tasks. We built multi-task models based on the feed-forward DNNs using the PyTorch framework. Our findings showed that training data enrichment could be an effective means of enhancing prediction performance in multi-task learning, but the degree of improvement depends on the quality of the training data. The more unique compounds and targets the training data included, the more new compound-target interactions are required for prediction improvement. Also, we found out that even using multi-task learning, one could not predict the interactions of compounds that are highly dissimilar from those used for model training. The study provides some recommendations for effectively employing multi-task learning in drug discovery to improve prediction accuracy and facilitate the discovery of novel drug candidates.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Artificial intelligence to deep learning: machine intelligence approach for drug discovery

Article 12 April 2021

Deep learning in drug discovery: an integrative review and future challenges

Article Open access 17 November 2022

From UK-2A to florylpicoxamid: Active learning to identify a mimic of a macrocyclic natural product

Article Open access 17 April 2024

Data availability

All supplementary materials used in this work are located on Zenodo (https://doi.org/10.5281/zenodo.7651581). File SI1: pdf file with the description of the subsets, the hyperparameters of the models, and the prediction results evaluation (PDF); File SI2: zipped file with the data, split into the subsets “trn”, “i”, and “c”. The algorithm used for the research has been made publicly available on GitHub [34].

Abbreviations

MTL:: Multi-task learning
DNN:: Deep neural network
STL:: Single-task learning
AD:: Applicability domain

References

Williams AJ, Pence HE (2017) The future of chemical information is now. Chem Int 39(3):9–14. https://doi.org/10.1515/ci-2017-0304
Article CAS Google Scholar
Tetko IV, Engkvist O, Chen H (2016) Does ‘Big Data’ exist in medicinal chemistry, and if so, how can it be harnessed? Future Med Chem 8(15):1801–1806. https://doi.org/10.4155/fmc-2016-0163
Article CAS PubMed Google Scholar
Nikitina AA, Orlov AA, Kozlovskaya LI, Palyulin VA, Osolodkin DI (2019) Enhanced taxonomy annotation of antiviral activity data from ChEMBL. Database 2019:139. https://doi.org/10.1093/database/bay139
Article CAS Google Scholar
Sosnin S, Karlov D, Tetko IV, Fedorov MV (2019) Comparative study of multitask toxicity modeling on a broad chemical space. J Chem Inf Model 59(3):1062–1072. https://doi.org/10.1021/acs.jcim.8b00685
Article CAS PubMed Google Scholar
Jain S, Siramshetty VB, Alves VM, Muratov EN, Kleinstreuer N, Tropsha A, Nicklaus MC, Simeonov A, Zakharov AV (2021) Large-scale modeling of multispecies acute toxicity end points using consensus of multitask deep learning methods. J Chem Inf Model 61(2):653–663. https://doi.org/10.1021/acs.jcim.0c01164
Article CAS PubMed PubMed Central Google Scholar
Martin EJ, Polyakov VR, Tian L, Perez RC (2017) Profile-QSAR 2.0: kinase virtual screening accuracy comparable to four-concentration IC50s for realistically novel compounds. J Chem Inf Model 57(8):2077–2088. https://doi.org/10.1021/acs.jcim.7b00166
Article CAS PubMed Google Scholar
Martin EJ, Polyakov VR, Zhu X-W, Tian L, Mukherjee P, Liu X (2019) All-assay-Max2 pQSAR: activity predictions as accurate as four-concentration IC50s for 8558 Novartis assays. J Chem Inf Model 59(10):4450–4459. https://doi.org/10.1021/acs.jcim.9b00375
Article CAS PubMed Google Scholar
Sosnin S, Vashurina M, Withnall M, Karpov P, Fedorov M, Tetko I (2018) A survey of multi-task learning methods in chemoinformatics. Mol Inf. https://doi.org/10.1002/minf.201800108
Article Google Scholar
Joshi A, Karimi S, Sparks R, Paris C, MacIntyre CR (2019) Does multi-task learning always help?: an evaluation on health informatics. In: Proceedings of the The 17th annual workshop of the Australasian Language Technology Association. Australasian Language Technology Association, Sydney, pp 151–158
Zhang Y, Yang Q (2021) A survey on multi-task learning. http://arxiv.org/abs/1707.08114 [cs]
Xu Y, Pei J, Lai L (2017) Deep learning based regression and multiclass models for acute oral toxicity prediction with automatic chemical feature extraction. J Chem Inf Model 57(11):2672–2685. https://doi.org/10.1021/acs.jcim.7b00244
Article CAS PubMed Google Scholar
Montanari F, Kuhnke L, Ter Laak A, Clevert D-A (2020) Modeling physico-chemical ADMET endpoints with multitask graph convolutional networks. Molecules 25(1):44. https://doi.org/10.3390/molecules25010044
Article CAS Google Scholar
Lenselink EB, ten Dijke N, Bongers B, Papadatos G, van Vlijmen HWT, Kowalczyk W, IJzerman AP, van Westen GJP, (2017) Beyond the hype: deep neural networks outperform established methods using a ChEMBL bioactivity benchmark set. J Cheminform 9(1):45. https://doi.org/10.1186/s13321-017-0232-0
Yuan H, Paskov I, Paskov H, González AJ, Leslie CS (2016) Multitask learning improves prediction of cancer drug sensitivity. Sci Rep 6(1):31619. https://doi.org/10.1038/srep31619
Article CAS PubMed PubMed Central Google Scholar
Kalakoti Y, Yadav S, Sundar D (2022) Deep neural network-assisted drug recommendation systems for identifying potential drug-target interactions. ACS Omega 7(14):12138–12146. https://doi.org/10.1021/acsomega.2c00424
Article CAS PubMed PubMed Central Google Scholar
Weaver S, Gleeson MP (2008) The importance of the domain of applicability in QSAR modeling. J Mol Graph Model 26(8):1315–1326. https://doi.org/10.1016/j.jmgm.2008.01.002
Article CAS PubMed Google Scholar
Sahigara F, Mansouri K, Ballabio D, Mauri A, Consonni V, Todeschini R (2012) Comparison of different approaches to define the applicability domain of QSAR models. Molecules 17(5):4791–4810. https://doi.org/10.3390/molecules17054791
Article CAS PubMed PubMed Central Google Scholar
Rakhimbekova A, Madzhidov TI, Nugmanov RI, Gimadiev TR, Baskin II, Varnek A (2020) Comprehensive analysis of applicability domains of QSPR models for chemical reactions. Int J Mol Sci 21(15):5542. https://doi.org/10.3390/ijms21155542
Article CAS PubMed PubMed Central Google Scholar
Kar S, Roy K, Leszczynski J (2018) Applicability domain: a step toward confident predictions and decidability for QSAR modeling. In: Nicolotti O (ed) Computational toxicology: methods and protocols. Methods in molecular biology. Springer, New York, pp 141–169. https://doi.org/10.1007/978-1-4939-7899-1_6
Chapter Google Scholar
OECD (2014) Guidance document on the validation of (quantitative) structure-activity relationship [(Q)SAR] models. https://doi.org/10.1787/9789264085442-en
Sheridan RP, Feuston BP, Maiorov VN, Kearsley SK (2004) Similarity to molecules in the training set is a good discriminator for prediction accuracy in QSAR. J Chem Inf Comput Sci 44(6):1912–1928. https://doi.org/10.1021/ci049782w
Article CAS PubMed Google Scholar
Kaneko H, Funatsu K (2014) Applicability domain based on ensemble learning in classification and regression analyses. J Chem Inf Model 54(9):2469–2482. https://doi.org/10.1021/ci500364e
Article CAS PubMed Google Scholar
Tropsha A, Gramatica P, Gombar VK (2003) The importance of being earnest: validation is the absolute essential for successful application and interpretation of QSPR models. QSAR Comb Sci 22(1):69–77. https://doi.org/10.1002/qsar.200390007
Article CAS Google Scholar
Hemmateenejad B, Yazdani M (2009) QSPR models for half-wave reduction potential of steroids: a comparative study between feature selection and feature extraction from subsets of or entire set of descriptors. Anal Chim Acta 634(1):27–35. https://doi.org/10.1016/j.aca.2008.11.062
Article CAS PubMed Google Scholar
Gaulton A, Bellis LJ, Bento AP, Chambers J, Davies M, Hersey A, Light Y, McGlinchey S, Michalovich D, Al-Lazikani B, Overington JP (2012) ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Res 40(Database issue):1100–1107. https://doi.org/10.1093/nar/gkr777. Accessed 8 Jan 2023
Bento AP, Gaulton A, Hersey A, Bellis LJ, Chambers J, Davies M, Krüger FA, Light Y, Mak L, McGlinchey S, Nowotka M, Papadatos G, Santos R, Overington JP (2014) The ChEMBL bioactivity database: an update. Nucleic Acids Res 42(Database issue):1083–1090. https://doi.org/10.1093/nar/gkt1031
Article CAS Google Scholar
Sosnina EA, Sosnin S, Nikitina AA, Nazarov I, Osolodkin DI, Fedorov MV (2020) Recommender systems in antiviral drug discovery. ACS Omega 5(25):15039–15051. https://doi.org/10.1021/acsomega.0c00857
Article CAS PubMed PubMed Central Google Scholar
Landrum G (2016) Rdkit: open-source cheminformatics software
Zhang L, Tan J, Han D, Zhu H (2017) From machine learning to deep learning: progress in machine intelligence for rational drug discovery. Drug Discov Today 22(11):1680–1685. https://doi.org/10.1016/j.drudis.2017.08.010
Article PubMed Google Scholar
Chen H, Engkvist O, Wang Y, Olivecrona M, Blaschke T (2018) The rise of deep learning in drug discovery. Drug Discov Today 23(6):1241–1250. https://doi.org/10.1016/j.drudis.2018.01.039
Article PubMed Google Scholar
Nag S, Baidya ATK, Mandal A, Mathew AT, Das B, Devi B, Kumar R (2022) Deep learning tools for advancing drug discovery and development. 3 Biotech 12(5):110. https://doi.org/10.1007/s13205-022-03165-8
Article PubMed PubMed Central Google Scholar
Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L, Desmaison A, Kopf A, Yang E, DeVito Z, Raison M, Tejani A, Chilamkurthy S, Steiner B, Fang L, Bai J, Chintala S (2019) Pytorch: an imperative style, high-performance deep learning library. In: Wallach H, Larochelle H, Beygelzimer A, d’Alché-Buc F, Fox E, Garnett R (eds) Advances in neural information processing systems 32. Curran Associates Inc., Red Hook, pp 8024–8035. https://doi.org/10.48550/arXiv.1912.01703
Chapter Google Scholar
Bergstra J, Bengio Y (2012) Random search for hyper-parameter optimization. J Mach Learn Res 13(10):281–305
Google Scholar
Sosnina EA, Sosnin S, Fedorov MV (2023) ImprovingMTT. GitHub. https://github.com/ekaterina-sea/ImprovingMTT
Sheridan RP (2013) Time-split cross-validation as a method for estimating the goodness of prospective prediction. J Chem Inf Model 53(4):783–790. https://doi.org/10.1021/ci400084k. Accessed 11 Jan 2023
Schuffenhauer A, Ertl P, Roggo S, Wetzel S, Koch MA, Waldmann H (2007) the scaffold tree—visualization of the scaffold universe by hierarchical scaffold classification. J Chem Inf Model 47(1):47–58. https://doi.org/10.1021/ci600338x. Accessed 11 Jan 2023
Karlov DS, Sosnin S, Tetko IV, Fedorov MV (2019) Chemical space exploration guided by deep neural networks. RSC Adv 9(9):5151–5157. https://doi.org/10.1039/C8RA10182E
Article CAS PubMed PubMed Central Google Scholar
Wainer J, Cawley G (2021) Nested cross-validation when selecting classifiers is overzealous for most practical applications. Expert Syst Appl 182:115222. https://doi.org/10.1016/j.eswa.2021.115222
Article Google Scholar
Lika B, Kolomvatsos K, Hadjiefthymiades S (2014) Facing the cold start problem in recommender systems. Expert Syst Appl 41(4, Part 2):2065–2073. https://doi.org/10.1016/j.eswa.2013.09.005
Article Google Scholar
Sethi R, Mehrotra M (2021) Cold start in recommender systems—a survey from domain perspective. In: Hemanth J, Bestak R, Chen JI-Z (eds) Intelligent data communication technologies and internet of things. Lecture notes on data engineering and communications technologies. Springer, Singapore, pp 223–232. https://doi.org/10.1007/978-981-15-9509-7_19
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830
Google Scholar
Harris CR, Millman KJ, van der Walt SJ, Gommers R, Virtanen P, Cournapeau D, Wieser E, Taylor J, Berg S, Smith NJ, Kern R, Picus M, Hoyer S, van Kerkwijk MH, Brett M, Haldane A, del Río JF, Wiebe M, Peterson P, Gérard-Marchant P, Sheppard K, Reddy T, Weckesser W, Abbasi H, Gohlke C, Oliphant TE (2020) Array programming with NumPy. Nature 585(7825):357–362. https://doi.org/10.1038/s41586-020-2649-2
Article CAS PubMed PubMed Central Google Scholar
Safari S, Baratloo A, Elfil M, Negida A (2016) Evidence based emergency medicine; Part 5 receiver operating curve and area under the curve. Emergency (Tehran) 4(2):111–113. https://doi.org/10.22037/aaem.v4i2.232
Chicco D, Warrens MJ, Jurman G (2021) The coefficient of determination R-squared is more informative than SMAPE, MAE, MAPE, MSE and RMSE in regression analysis evaluation. PeerJ Comput Sci 7:623. https://doi.org/10.7717/peerj-cs.623
Article Google Scholar
Onyutha C (2021) A hydrological model skill score and revised R-squared. Hydrol Res 53(1):51–64. https://doi.org/10.2166/nh.2021.071
Article Google Scholar
Li Z, Kamnitsas K, Glocker B (2021) Analyzing overfitting under class imbalance in neural networks for image segmentation. IEEE Trans Med Imaging 40(3):1065–1077. https://doi.org/10.1109/TMI.2020.3046692, http://arxiv.org/abs/2102.10365 [cs]
Venil P, Vinodhini G, Suban R (2020) A state of the art survey on cold start problem in a collaborative filtering system. Int J Sci Technol Res 9:2606–2612
Google Scholar

Download references

Acknowledgements

The authors acknowledge the use of computational resources of the Skoltech CDISE supercomputer Zhores for obtaining the results presented in this paper. The authors are thankful to Dr. Dmitry Osolodkin for his constructive suggestions and comments.

Funding

The reported study was funded by RFBR according to the research Project No. 19-33-90290.

Author information

Authors and Affiliations

Center for Computational and Data-Intensive Science and Engineering, Skolkovo Institute of Science and Technology, Bolshoy Boulevard 30/1, Moscow, Russia, 143026
Ekaterina A. Sosnina & Maxim V. Fedorov
Department of Pharmaceutical Sciences, Faculty of Life Sciences, University of Vienna, Josef-Holaubek-Platz 2, 1190, Vienna, Austria
Sergey Sosnin
Sirius University of Science and Technology, Olympiisky Prospect 1, Sochi, Russia, 354340
Maxim V. Fedorov

Authors

Ekaterina A. Sosnina
View author publications
You can also search for this author in PubMed Google Scholar
Sergey Sosnin
View author publications
You can also search for this author in PubMed Google Scholar
Maxim V. Fedorov
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

The manuscript was written through contributions of all authors. EAS and SS prepared the methodology and software, EAS prepared and visualized the results, EAS and SS wrote the original draft, MVF supervised the research. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Ekaterina A. Sosnina.

Ethics declarations

Competing interests

The authors have no competing interests to declare that are relevant to the content of this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (PDF 264 KB)

Supplementary file 2 (ZIP 25591 KB)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Sosnina, E.A., Sosnin, S. & Fedorov, M.V. Improvement of multi-task learning by data enrichment: application for drug discovery. J Comput Aided Mol Des 37, 183–200 (2023). https://doi.org/10.1007/s10822-023-00500-w

Download citation

Received: 22 November 2022
Accepted: 21 February 2023
Published: 21 March 2023
Issue Date: April 2023
DOI: https://doi.org/10.1007/s10822-023-00500-w

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Improvement of multi-task learning by data enrichment: application for drug discovery

Abstract

Access this article

Similar content being viewed by others

Artificial intelligence to deep learning: machine intelligence approach for drug discovery

Deep learning in drug discovery: an integrative review and future challenges

From UK-2A to florylpicoxamid: Active learning to identify a mimic of a macrocyclic natural product

Data availability

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's Note

Supplementary Information

Supplementary file 1 (PDF 264 KB)

Supplementary file 2 (ZIP 25591 KB)

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Improvement of multi-task learning by data enrichment: application for drug discovery

Abstract

Access this article

Similar content being viewed by others

Artificial intelligence to deep learning: machine intelligence approach for drug discovery

Deep learning in drug discovery: an integrative review and future challenges

From UK-2A to florylpicoxamid: Active learning to identify a mimic of a macrocyclic natural product

Data availability

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's Note

Supplementary Information

Supplementary file 1 (PDF 264 KB)

Supplementary file 2 (ZIP 25591 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation