Abstract
Multi-task learning in deep neural networks has become a topic of growing importance in many research fields, including drug discovery. However, applying multi-task learning poses new challenges in improving prediction performance. This study investigated the potential of training data enrichment to enhance multi-task model prediction quality in drug discovery. The study evaluated four scenarios with varying degrees of information capacity of the training data and applied two types of test data to evaluate prediction performance. We used three datasets: ViralChEMBL, which consisted of binary activities of compounds against viral species, was applied for the classification task; pQSAR(159) and pQSAR(4267), which consisted of bio-activities of compounds and assays from the research of the profile-QSAR method, were applied for regression tasks. We built multi-task models based on the feed-forward DNNs using the PyTorch framework. Our findings showed that training data enrichment could be an effective means of enhancing prediction performance in multi-task learning, but the degree of improvement depends on the quality of the training data. The more unique compounds and targets the training data included, the more new compound-target interactions are required for prediction improvement. Also, we found out that even using multi-task learning, one could not predict the interactions of compounds that are highly dissimilar from those used for model training. The study provides some recommendations for effectively employing multi-task learning in drug discovery to improve prediction accuracy and facilitate the discovery of novel drug candidates.
Similar content being viewed by others
Data availability
All supplementary materials used in this work are located on Zenodo (https://doi.org/10.5281/zenodo.7651581). File SI1: pdf file with the description of the subsets, the hyperparameters of the models, and the prediction results evaluation (PDF); File SI2: zipped file with the data, split into the subsets “trn”, “i”, and “c”. The algorithm used for the research has been made publicly available on GitHub [34].
Abbreviations
- MTL:
-
Multi-task learning
- DNN:
-
Deep neural network
- STL:
-
Single-task learning
- AD:
-
Applicability domain
References
Williams AJ, Pence HE (2017) The future of chemical information is now. Chem Int 39(3):9–14. https://doi.org/10.1515/ci-2017-0304
Tetko IV, Engkvist O, Chen H (2016) Does ‘Big Data’ exist in medicinal chemistry, and if so, how can it be harnessed? Future Med Chem 8(15):1801–1806. https://doi.org/10.4155/fmc-2016-0163
Nikitina AA, Orlov AA, Kozlovskaya LI, Palyulin VA, Osolodkin DI (2019) Enhanced taxonomy annotation of antiviral activity data from ChEMBL. Database 2019:139. https://doi.org/10.1093/database/bay139
Sosnin S, Karlov D, Tetko IV, Fedorov MV (2019) Comparative study of multitask toxicity modeling on a broad chemical space. J Chem Inf Model 59(3):1062–1072. https://doi.org/10.1021/acs.jcim.8b00685
Jain S, Siramshetty VB, Alves VM, Muratov EN, Kleinstreuer N, Tropsha A, Nicklaus MC, Simeonov A, Zakharov AV (2021) Large-scale modeling of multispecies acute toxicity end points using consensus of multitask deep learning methods. J Chem Inf Model 61(2):653–663. https://doi.org/10.1021/acs.jcim.0c01164
Martin EJ, Polyakov VR, Tian L, Perez RC (2017) Profile-QSAR 2.0: kinase virtual screening accuracy comparable to four-concentration IC50s for realistically novel compounds. J Chem Inf Model 57(8):2077–2088. https://doi.org/10.1021/acs.jcim.7b00166
Martin EJ, Polyakov VR, Zhu X-W, Tian L, Mukherjee P, Liu X (2019) All-assay-Max2 pQSAR: activity predictions as accurate as four-concentration IC50s for 8558 Novartis assays. J Chem Inf Model 59(10):4450–4459. https://doi.org/10.1021/acs.jcim.9b00375
Sosnin S, Vashurina M, Withnall M, Karpov P, Fedorov M, Tetko I (2018) A survey of multi-task learning methods in chemoinformatics. Mol Inf. https://doi.org/10.1002/minf.201800108
Joshi A, Karimi S, Sparks R, Paris C, MacIntyre CR (2019) Does multi-task learning always help?: an evaluation on health informatics. In: Proceedings of the The 17th annual workshop of the Australasian Language Technology Association. Australasian Language Technology Association, Sydney, pp 151–158
Zhang Y, Yang Q (2021) A survey on multi-task learning. http://arxiv.org/abs/1707.08114 [cs]
Xu Y, Pei J, Lai L (2017) Deep learning based regression and multiclass models for acute oral toxicity prediction with automatic chemical feature extraction. J Chem Inf Model 57(11):2672–2685. https://doi.org/10.1021/acs.jcim.7b00244
Montanari F, Kuhnke L, Ter Laak A, Clevert D-A (2020) Modeling physico-chemical ADMET endpoints with multitask graph convolutional networks. Molecules 25(1):44. https://doi.org/10.3390/molecules25010044
Lenselink EB, ten Dijke N, Bongers B, Papadatos G, van Vlijmen HWT, Kowalczyk W, IJzerman AP, van Westen GJP, (2017) Beyond the hype: deep neural networks outperform established methods using a ChEMBL bioactivity benchmark set. J Cheminform 9(1):45. https://doi.org/10.1186/s13321-017-0232-0
Yuan H, Paskov I, Paskov H, González AJ, Leslie CS (2016) Multitask learning improves prediction of cancer drug sensitivity. Sci Rep 6(1):31619. https://doi.org/10.1038/srep31619
Kalakoti Y, Yadav S, Sundar D (2022) Deep neural network-assisted drug recommendation systems for identifying potential drug-target interactions. ACS Omega 7(14):12138–12146. https://doi.org/10.1021/acsomega.2c00424
Weaver S, Gleeson MP (2008) The importance of the domain of applicability in QSAR modeling. J Mol Graph Model 26(8):1315–1326. https://doi.org/10.1016/j.jmgm.2008.01.002
Sahigara F, Mansouri K, Ballabio D, Mauri A, Consonni V, Todeschini R (2012) Comparison of different approaches to define the applicability domain of QSAR models. Molecules 17(5):4791–4810. https://doi.org/10.3390/molecules17054791
Rakhimbekova A, Madzhidov TI, Nugmanov RI, Gimadiev TR, Baskin II, Varnek A (2020) Comprehensive analysis of applicability domains of QSPR models for chemical reactions. Int J Mol Sci 21(15):5542. https://doi.org/10.3390/ijms21155542
Kar S, Roy K, Leszczynski J (2018) Applicability domain: a step toward confident predictions and decidability for QSAR modeling. In: Nicolotti O (ed) Computational toxicology: methods and protocols. Methods in molecular biology. Springer, New York, pp 141–169. https://doi.org/10.1007/978-1-4939-7899-1_6
OECD (2014) Guidance document on the validation of (quantitative) structure-activity relationship [(Q)SAR] models. https://doi.org/10.1787/9789264085442-en
Sheridan RP, Feuston BP, Maiorov VN, Kearsley SK (2004) Similarity to molecules in the training set is a good discriminator for prediction accuracy in QSAR. J Chem Inf Comput Sci 44(6):1912–1928. https://doi.org/10.1021/ci049782w
Kaneko H, Funatsu K (2014) Applicability domain based on ensemble learning in classification and regression analyses. J Chem Inf Model 54(9):2469–2482. https://doi.org/10.1021/ci500364e
Tropsha A, Gramatica P, Gombar VK (2003) The importance of being earnest: validation is the absolute essential for successful application and interpretation of QSPR models. QSAR Comb Sci 22(1):69–77. https://doi.org/10.1002/qsar.200390007
Hemmateenejad B, Yazdani M (2009) QSPR models for half-wave reduction potential of steroids: a comparative study between feature selection and feature extraction from subsets of or entire set of descriptors. Anal Chim Acta 634(1):27–35. https://doi.org/10.1016/j.aca.2008.11.062
Gaulton A, Bellis LJ, Bento AP, Chambers J, Davies M, Hersey A, Light Y, McGlinchey S, Michalovich D, Al-Lazikani B, Overington JP (2012) ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Res 40(Database issue):1100–1107. https://doi.org/10.1093/nar/gkr777. Accessed 8 Jan 2023
Bento AP, Gaulton A, Hersey A, Bellis LJ, Chambers J, Davies M, Krüger FA, Light Y, Mak L, McGlinchey S, Nowotka M, Papadatos G, Santos R, Overington JP (2014) The ChEMBL bioactivity database: an update. Nucleic Acids Res 42(Database issue):1083–1090. https://doi.org/10.1093/nar/gkt1031
Sosnina EA, Sosnin S, Nikitina AA, Nazarov I, Osolodkin DI, Fedorov MV (2020) Recommender systems in antiviral drug discovery. ACS Omega 5(25):15039–15051. https://doi.org/10.1021/acsomega.0c00857
Landrum G (2016) Rdkit: open-source cheminformatics software
Zhang L, Tan J, Han D, Zhu H (2017) From machine learning to deep learning: progress in machine intelligence for rational drug discovery. Drug Discov Today 22(11):1680–1685. https://doi.org/10.1016/j.drudis.2017.08.010
Chen H, Engkvist O, Wang Y, Olivecrona M, Blaschke T (2018) The rise of deep learning in drug discovery. Drug Discov Today 23(6):1241–1250. https://doi.org/10.1016/j.drudis.2018.01.039
Nag S, Baidya ATK, Mandal A, Mathew AT, Das B, Devi B, Kumar R (2022) Deep learning tools for advancing drug discovery and development. 3 Biotech 12(5):110. https://doi.org/10.1007/s13205-022-03165-8
Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L, Desmaison A, Kopf A, Yang E, DeVito Z, Raison M, Tejani A, Chilamkurthy S, Steiner B, Fang L, Bai J, Chintala S (2019) Pytorch: an imperative style, high-performance deep learning library. In: Wallach H, Larochelle H, Beygelzimer A, d’Alché-Buc F, Fox E, Garnett R (eds) Advances in neural information processing systems 32. Curran Associates Inc., Red Hook, pp 8024–8035. https://doi.org/10.48550/arXiv.1912.01703
Bergstra J, Bengio Y (2012) Random search for hyper-parameter optimization. J Mach Learn Res 13(10):281–305
Sosnina EA, Sosnin S, Fedorov MV (2023) ImprovingMTT. GitHub. https://github.com/ekaterina-sea/ImprovingMTT
Sheridan RP (2013) Time-split cross-validation as a method for estimating the goodness of prospective prediction. J Chem Inf Model 53(4):783–790. https://doi.org/10.1021/ci400084k. Accessed 11 Jan 2023
Schuffenhauer A, Ertl P, Roggo S, Wetzel S, Koch MA, Waldmann H (2007) the scaffold tree—visualization of the scaffold universe by hierarchical scaffold classification. J Chem Inf Model 47(1):47–58. https://doi.org/10.1021/ci600338x. Accessed 11 Jan 2023
Karlov DS, Sosnin S, Tetko IV, Fedorov MV (2019) Chemical space exploration guided by deep neural networks. RSC Adv 9(9):5151–5157. https://doi.org/10.1039/C8RA10182E
Wainer J, Cawley G (2021) Nested cross-validation when selecting classifiers is overzealous for most practical applications. Expert Syst Appl 182:115222. https://doi.org/10.1016/j.eswa.2021.115222
Lika B, Kolomvatsos K, Hadjiefthymiades S (2014) Facing the cold start problem in recommender systems. Expert Syst Appl 41(4, Part 2):2065–2073. https://doi.org/10.1016/j.eswa.2013.09.005
Sethi R, Mehrotra M (2021) Cold start in recommender systems—a survey from domain perspective. In: Hemanth J, Bestak R, Chen JI-Z (eds) Intelligent data communication technologies and internet of things. Lecture notes on data engineering and communications technologies. Springer, Singapore, pp 223–232. https://doi.org/10.1007/978-981-15-9509-7_19
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830
Harris CR, Millman KJ, van der Walt SJ, Gommers R, Virtanen P, Cournapeau D, Wieser E, Taylor J, Berg S, Smith NJ, Kern R, Picus M, Hoyer S, van Kerkwijk MH, Brett M, Haldane A, del Río JF, Wiebe M, Peterson P, Gérard-Marchant P, Sheppard K, Reddy T, Weckesser W, Abbasi H, Gohlke C, Oliphant TE (2020) Array programming with NumPy. Nature 585(7825):357–362. https://doi.org/10.1038/s41586-020-2649-2
Safari S, Baratloo A, Elfil M, Negida A (2016) Evidence based emergency medicine; Part 5 receiver operating curve and area under the curve. Emergency (Tehran) 4(2):111–113. https://doi.org/10.22037/aaem.v4i2.232
Chicco D, Warrens MJ, Jurman G (2021) The coefficient of determination R-squared is more informative than SMAPE, MAE, MAPE, MSE and RMSE in regression analysis evaluation. PeerJ Comput Sci 7:623. https://doi.org/10.7717/peerj-cs.623
Onyutha C (2021) A hydrological model skill score and revised R-squared. Hydrol Res 53(1):51–64. https://doi.org/10.2166/nh.2021.071
Li Z, Kamnitsas K, Glocker B (2021) Analyzing overfitting under class imbalance in neural networks for image segmentation. IEEE Trans Med Imaging 40(3):1065–1077. https://doi.org/10.1109/TMI.2020.3046692, http://arxiv.org/abs/2102.10365 [cs]
Venil P, Vinodhini G, Suban R (2020) A state of the art survey on cold start problem in a collaborative filtering system. Int J Sci Technol Res 9:2606–2612
Acknowledgements
The authors acknowledge the use of computational resources of the Skoltech CDISE supercomputer Zhores for obtaining the results presented in this paper. The authors are thankful to Dr. Dmitry Osolodkin for his constructive suggestions and comments.
Funding
The reported study was funded by RFBR according to the research Project No. 19-33-90290.
Author information
Authors and Affiliations
Contributions
The manuscript was written through contributions of all authors. EAS and SS prepared the methodology and software, EAS prepared and visualized the results, EAS and SS wrote the original draft, MVF supervised the research. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors have no competing interests to declare that are relevant to the content of this article.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Sosnina, E.A., Sosnin, S. & Fedorov, M.V. Improvement of multi-task learning by data enrichment: application for drug discovery. J Comput Aided Mol Des 37, 183–200 (2023). https://doi.org/10.1007/s10822-023-00500-w
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10822-023-00500-w