Abstract
Aggregating knowledge about drug, disease, and drug reaction entities across a broader range of domains and languages is critical for information extraction applications. In this work, we present a fine-grained evaluation intended to understand the efficiency of multilingual BERT-based models for biomedical named entity recognition (NER) and multi-label sentence classification. We investigate the role of transfer learning strategies between two English corpora and a novel annotated corpus of Russian reviews about drug therapy. In these corpora, labels for sentences indicate health-related issues or their absence. Sentences that belong to a certain class are additionally labeled at the entity level to identify fine-grained subtypes such as drug names, drug indications, and drug reactions. The evaluation results demonstrate that the BERT training on Russian and English raw reviews (5M in total) provides the best transfer capabilities for adverse drug reactions detection task on the Russian data. The macro F1 score of 74.85% in the NER task was achieved by our RuDR-BERT model. For the classification task, our EnRuDR-BERT model achieved the macro F1 score of 70%, gaining 8.64% over the score of a general-domain BERT model.
Similar content being viewed by others
REFERENCES
Huang, C.C. and Lu, Z., Community challenges in biomedical text mining over 10 years: Success, failure and the future, Briefings Bioinf., 2016, vol. 17, no. 1, pp. 132–144.
Vaswani, A., Shazeer, N., et al., Attention is all you need, Proc. 31st Int. Conf. Neural Information Processing Systems, 2017, pp. 6000–6010.
Devlin, J., Chang, M., et al., BERT: Pre-training of deep bidirectional transformers for language understanding, Proc. Conf. North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019, vol. 1, pp. 4171–4186.
Conneau, A. and Lample, G., Cross-lingual language model pretraining, Adv. Neural Inf. Process. Syst., 2019, vol. 32, pp. 7059–7069.
Lample, G., Conneau, A., et al., Unsupervised machine translation using monolingual corpora only, Proc. Int. Conf. Learning Representations, 2018.
Artetxe, M. and Schwenk, H., Margin-based parallel corpus mining with multilingual sentence embeddings, Proc. 57th Annu. Meet. Association for Computational Linguistics, 2019, pp. 3197–3203.
Tutubalina, E., Alimova, I., et al., The Russian Drug Reaction Corpus and neural models for drug reactions and effectiveness detection in user reviews, Bioinformatics, 2021, vol. 37, no. 2, pp. 243–249.
Alvaro, N., Miyao, Y., and Collier, N., TwiMed: Twitter and PubMed comparable corpus of drugs, diseases, symptoms, and their relations, JMIR Public Health Surveill., 2017, vol. 3, no. 2.
Zolnoori, M., et al., A systematic approach for developing a corpus of patient reported adverse drug events: A case study for SSRI and SNRI medications, J. Biomed. Inf., 2019, vol. 90.
Karimi, S., Metke-Jimenez, A., et al., Cadec: A corpus of adverse drug event annotations, J. Biomed. Inf., 2015, vol. 55, pp. 73–81.
Sarker, A., Belousov, M., et al., Data and systems for medication-related text classification and concept normalization from Twitter: Insights from the Social Media Mining for Health (SMM4H)-2017 shared task, J. Am. Med. Inf. Assoc., 2018, vol. 25, no. 10, pp. 1274–1283.
Moreno, I., Boldrini, E., et al., Drugsemantics: A corpus for named entity recognition in Spanish summaries of product characteristics, J. Biomed. Inf., 2017, vol. 72, pp. 8–22.
Névéol, A., Anderson, R.N., et al., CLEF eHealth 2017 multilingual information extraction task overview: ICD10 coding of death certificates in English and French, CEUR Workshop Proc., 2017, vol. 1866.
Névéol, A., et al., CLEF eHealth 2018 multilingual information extraction task overview: ICD10 coding of death certificates in French, Hungarian and Italian, CEUR Workshop Proc., 2018, vol. 2125.
Shelmanov, A.O., Smirnov, I.V., and Vishneva, E.A., Information extraction from clinical texts in Russian, Computational Linguistics and Intellectual Technologies: Papers from the Annual International Conference “Dialogue,” 2015, no. 14, pp. 560–572.
Miftahutdinov, Z., Sakhovskiy, A., and Tutubalina, E., KFU NLP team at SMM4H 2020 tasks: Cross-lingual transfer learning with pretrained language models for drug reactions, Proc. 5th Social Media Mining for Health Applications Workshop and Shared Task, 2020, pp. 51–56.
Gusev, A., Kuznetsova, A., et al., Bert implementation for detecting adverse drug effects mentions in Russian, Proc. 5th Social Media Mining for Health Applications Workshop and Shared Task, 2020, pp. 46–50.
Alimova, I., Tutubalina, E., et al., A machine learning approach to classification of drug reviews in Russian, Proc. Ivannikov ISPRAS Open Conf., 2017, pp. 64–69.
Klein, A., Alimova, I., et al., Overview of the fifth social media mining for health applications (#SMM4H) shared tasks at COLING 2020, Proc. 5th Social Media Mining for Health Applications Workshop and Shared Task, 2020, pp. 27–36.
Magge, A., Klein, A., et al., Overview of the sixth social media mining for health applications (#SMM4H) shared tasks at NAACL 2021, Proc. 6th Social Media Mining for Health Workshop and Shared Task, 2021, pp. 21–32.
Kuratov, Y. and Arkhipov, M., Adaptation of deep bidirectional multilingual transformers for Russian language, 2019.
Tutubalina, E.V., Miftahutdinov, Z.Sh., et al., Using semantic analysis of texts for the identification of drugs with similar therapeutic effects, Russ. Chem. Bull., 2017, vol. 66, no. 11, pp. 2180–2189.
Funding
This work was supported by the Russian Science Foundation, project no. 23-11-00358.
Author information
Authors and Affiliations
Corresponding authors
Ethics declarations
The authors declare that they have no conflicts of interest.
Additional information
Translated by Yu. Kornienko
Rights and permissions
About this article
Cite this article
Sakhovskiy, A.S., Tutubalina, E.V. Cross-Lingual Transfer Learning in Drug-Related Information Extraction from User-Generated Texts. Program Comput Soft 49, 590–595 (2023). https://doi.org/10.1134/S036176882307006X
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1134/S036176882307006X