Skip to main content
Log in

Cross-Lingual Transfer Learning in Drug-Related Information Extraction from User-Generated Texts

  • Published:
Programming and Computer Software Aims and scope Submit manuscript

Abstract

Aggregating knowledge about drug, disease, and drug reaction entities across a broader range of domains and languages is critical for information extraction applications. In this work, we present a fine-grained evaluation intended to understand the efficiency of multilingual BERT-based models for biomedical named entity recognition (NER) and multi-label sentence classification. We investigate the role of transfer learning strategies between two English corpora and a novel annotated corpus of Russian reviews about drug therapy. In these corpora, labels for sentences indicate health-related issues or their absence. Sentences that belong to a certain class are additionally labeled at the entity level to identify fine-grained subtypes such as drug names, drug indications, and drug reactions. The evaluation results demonstrate that the BERT training on Russian and English raw reviews (5M in total) provides the best transfer capabilities for adverse drug reactions detection task on the Russian data. The macro F1 score of 74.85% in the NER task was achieved by our RuDR-BERT model. For the classification task, our EnRuDR-BERT model achieved the macro F1 score of 70%, gaining 8.64% over the score of a general-domain BERT model.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Notes

  1. https://github.com/google-research/bert.

  2. https://huggingface.co/cimm-kzn/rudr-bert.

  3. https://huggingface.co/cimm-kzn/enrudr-bert.

REFERENCES

  1. Huang, C.C. and Lu, Z., Community challenges in biomedical text mining over 10 years: Success, failure and the future, Briefings Bioinf., 2016, vol. 17, no. 1, pp. 132–144.

    Article  Google Scholar 

  2. Vaswani, A., Shazeer, N., et al., Attention is all you need, Proc. 31st Int. Conf. Neural Information Processing Systems, 2017, pp. 6000–6010.

  3. Devlin, J., Chang, M., et al., BERT: Pre-training of deep bidirectional transformers for language understanding, Proc. Conf. North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019, vol. 1, pp. 4171–4186.

  4. Conneau, A. and Lample, G., Cross-lingual language model pretraining, Adv. Neural Inf. Process. Syst., 2019, vol. 32, pp. 7059–7069.

    Google Scholar 

  5. Lample, G., Conneau, A., et al., Unsupervised machine translation using monolingual corpora only, Proc. Int. Conf. Learning Representations, 2018.

  6. Artetxe, M. and Schwenk, H., Margin-based parallel corpus mining with multilingual sentence embeddings, Proc. 57th Annu. Meet. Association for Computational Linguistics, 2019, pp. 3197–3203.

  7. Tutubalina, E., Alimova, I., et al., The Russian Drug Reaction Corpus and neural models for drug reactions and effectiveness detection in user reviews, Bioinformatics, 2021, vol. 37, no. 2, pp. 243–249.

    Article  Google Scholar 

  8. Alvaro, N., Miyao, Y., and Collier, N., TwiMed: Twitter and PubMed comparable corpus of drugs, diseases, symptoms, and their relations, JMIR Public Health Surveill., 2017, vol. 3, no. 2.

  9. Zolnoori, M., et al., A systematic approach for developing a corpus of patient reported adverse drug events: A case study for SSRI and SNRI medications, J. Biomed. Inf., 2019, vol. 90.

  10. Karimi, S., Metke-Jimenez, A., et al., Cadec: A corpus of adverse drug event annotations, J. Biomed. Inf., 2015, vol. 55, pp. 73–81.

    Article  Google Scholar 

  11. Sarker, A., Belousov, M., et al., Data and systems for medication-related text classification and concept normalization from Twitter: Insights from the Social Media Mining for Health (SMM4H)-2017 shared task, J. Am. Med. Inf. Assoc., 2018, vol. 25, no. 10, pp. 1274–1283.

    Article  Google Scholar 

  12. Moreno, I., Boldrini, E., et al., Drugsemantics: A corpus for named entity recognition in Spanish summaries of product characteristics, J. Biomed. Inf., 2017, vol. 72, pp. 8–22.

    Article  Google Scholar 

  13. Névéol, A., Anderson, R.N., et al., CLEF eHealth 2017 multilingual information extraction task overview: ICD10 coding of death certificates in English and French, CEUR Workshop Proc., 2017, vol. 1866.

  14. Névéol, A., et al., CLEF eHealth 2018 multilingual information extraction task overview: ICD10 coding of death certificates in French, Hungarian and Italian, CEUR Workshop Proc., 2018, vol. 2125.

  15. Shelmanov, A.O., Smirnov, I.V., and Vishneva, E.A., Information extraction from clinical texts in Russian, Computational Linguistics and Intellectual Technologies: Papers from the Annual International Conference “Dialogue,” 2015, no. 14, pp. 560–572.

  16. Miftahutdinov, Z., Sakhovskiy, A., and Tutubalina, E., KFU NLP team at SMM4H 2020 tasks: Cross-lingual transfer learning with pretrained language models for drug reactions, Proc. 5th Social Media Mining for Health Applications Workshop and Shared Task, 2020, pp. 51–56.

  17. Gusev, A., Kuznetsova, A., et al., Bert implementation for detecting adverse drug effects mentions in Russian, Proc. 5th Social Media Mining for Health Applications Workshop and Shared Task, 2020, pp. 46–50.

  18. Alimova, I., Tutubalina, E., et al., A machine learning approach to classification of drug reviews in Russian, Proc. Ivannikov ISPRAS Open Conf., 2017, pp. 64–69.

  19. Klein, A., Alimova, I., et al., Overview of the fifth social media mining for health applications (#SMM4H) shared tasks at COLING 2020, Proc. 5th Social Media Mining for Health Applications Workshop and Shared Task, 2020, pp. 27–36.

  20. Magge, A., Klein, A., et al., Overview of the sixth social media mining for health applications (#SMM4H) shared tasks at NAACL 2021, Proc. 6th Social Media Mining for Health Workshop and Shared Task, 2021, pp. 21–32.

  21. Kuratov, Y. and Arkhipov, M., Adaptation of deep bidirectional multilingual transformers for Russian language, 2019.

  22. Tutubalina, E.V., Miftahutdinov, Z.Sh., et al., Using semantic analysis of texts for the identification of drugs with similar therapeutic effects, Russ. Chem. Bull., 2017, vol. 66, no. 11, pp. 2180–2189.

    Article  Google Scholar 

Download references

Funding

This work was supported by the Russian Science Foundation, project no. 23-11-00358.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to A. S. Sakhovskiy or E. V. Tutubalina.

Ethics declarations

The authors declare that they have no conflicts of interest.

Additional information

Translated by Yu. Kornienko

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sakhovskiy, A.S., Tutubalina, E.V. Cross-Lingual Transfer Learning in Drug-Related Information Extraction from User-Generated Texts. Program Comput Soft 49, 590–595 (2023). https://doi.org/10.1134/S036176882307006X

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1134/S036176882307006X

Keywords:

Navigation