Infectious risk events and their novelty in event-based surveillance: new definitions and annotated corpus

Delon, François; Bédubourg, Gabriel; Bouscarrat, Léo; Meynard, Jean-Baptiste; Valois, Aude; Queyriaux, Benjamin; Ramisch, Carlos; Tanti, Marc

doi:10.1007/s10579-024-09728-w

Infectious risk events and their novelty in event-based surveillance: new definitions and annotated corpus

Original Paper
Published: 05 March 2024

(2024)
Cite this article

Language Resources and Evaluation Aims and scope Submit manuscript

François Delon^1,2,
Gabriel Bédubourg^1,2,
Léo Bouscarrat^3,4,
Jean-Baptiste Meynard²,
Aude Valois^2,5,
Benjamin Queyriaux^2,6,
Carlos Ramisch⁴ &
…
Marc Tanti^1,2

103 Accesses
1 Citation
Explore all metrics

Abstract

Event-based surveillance (EBS) requires the analysis of an ever-increasing volume of documents, requiring automated processing to support human analysts. Few annotated corpora are available for the evaluation of information extraction tools in the EBS domain. The main objective of this work was to build a corpus containing documents which are representative of those collected in the current EBS information systems, and to annotate them with events and their novelty. We proposed new definitions of infectious events and their novelty suited to the background work of analysts working in the EBS domain, and we compiled a corpus of 305 documents describing 283 infectious events. There were 36 included documents in French, representing a total of 11 events, with the remainder in English. We annotated novelty for the 110 most recent documents in the corpus, resulting in 101 events. The inter-annotator agreement was 0.74 for event identification (F1-Score) and 0.69 [95% CI: 0.51; 0.88] (Kappa) for novelty annotation. The overall agreement for entity annotation was lower, with a significant variation according to the type of entities considered (range 0.30–0.68). This corpus is a useful tool for creating and evaluating algorithms and methods submitted by EBS research teams for event detection and annotation of their novelties, aiming at the operational improvement of document flow processing. The small size of this corpus makes it less suitable for training natural language processing models, although this limitation tends to fade away when using few-shots learning methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Unified approach to retrospective event detection for event- based epidemic intelligence

Article 09 October 2021

Elaboration of a new framework for fine-grained epidemiological annotation

Article Open access 26 October 2022

Digital surveillance in Latin American diseases outbreaks: information extraction from a novel Spanish corpus

Article Open access 23 December 2022

Data availability

The corpus generated during the current study, including the source URLs of the documents and the annotations, is available from the corresponding author on reasonable request (Zenodo repository: https://doi.org/10.5281/zenodo.8414785). The source texts may be published under license and so are not publicly available.

Notes

https://archive.org/web/

Abbreviations

EBS:: Event-based surveillance
EBS-IS:: Event-based surveillance information system
IAA:: Inter-annotator agreement
STE:: Stack Exchange
TREC:: Text REtrieval Conference
UMLS:: Unified Medical Language System

References

Abbood, A., Ullrich, A., Busche, R., & Ghozzi, S. (2020). EventEpi-a natural language processing framework for event-based surveillance. PLoS Computational Biology, 16(11), e1008277.
Article ADS CAS PubMed PubMed Central Google Scholar
Barto, A., Mirolli, M., & Baldassarre, G. (2013). Novelty or surprise? Frontiers in Psychology, 11(4), 907.
Google Scholar
Bentivogli L, Clark P, Dagan I, Giampiccolo D. (2010) The sixth PASCAL recognizing textual entailment challenge. In: Proceedings of the third text analysis conference, TAC 2010, Gaithersburg, Maryland, USA, 15–16. NIST; Retrieved November 2010 from: https://tac.nist.gov/publications/2010/additional.papers/RTE6_overview.proceedings.pdf
Bentivogli L, Clark P, Dagan I, Giampiccolo D. (2011) The seventh PASCAL recognizing textual entailment challenge
Bodenreider, O. (2004). The unified medical language system (UMLS): Integrating biomedical terminology. Nucleic Acids Research. https://doi.org/10.1093/nar/gkh061
Article PubMed PubMed Central Google Scholar
Breit, N. A., Allen, T., Arnold, B., Huff, A., Madoff, L., & Pollack, M. (2016). 20.142 evaluation of ProMED-mail global surveillance capability. International Journal of Infectious Diseases , 53, 140.
Article Google Scholar
Brownstein, J. S., Freifeld, C. C., Reis, B. Y., & Mandl, K. D. (2008). Surveillance sans frontières: Internet-based emerging infectious disease intelligence and the healthMap project. PLoS Medicine, 5(7), e151.
Article PubMed PubMed Central Google Scholar
Carter, D., Stojanovic, M., Hachey, P., Fournier, K., Rodier, S., Wang, Y., & De Bruijn, B. (2020). Global public health surveillance using media reports: Redesigning GPHIN. Student Health Technol Inform, 16(270), 843–847.
Google Scholar
Collier, N., Doan, S., Kawazoe, A., Goodwin, R. M., Conway, M., Tateno, Y., Ngo, Q. H., Dien, D., Kawtrakul, A., Takeuchi, K., & Shigematsu, M. (2008). BioCaster: Detecting public health rumors with a web-based text mining system. Bioinformatics, 24(24), 2940–2941.
Article CAS PubMed PubMed Central Google Scholar
Conway, M., Kawazoe, A., Chanlekha, H., & Collier, N. (2010). Developing a disease outbreak event corpus. Journal of Medical Internet Research, 12(3), e43.
Article PubMed PubMed Central Google Scholar
David G, Christopher C, Stephanie S. (2001) TDT3 multilanguage text version 2.0. Linguistic Data Consortium, p. 371712 KB. Retrieved October 10, 2023 from: https://catalog.ldc.upenn.edu/LDC2001T58
European Centre for Disease Prevention and Control (2022) Operational tool on rapid risk assessment methodology 2019. Retrieved September 4, 2022 from: https://www.ecdc.europa.eu/en/publications-data/operational-tool-rapid-risk-assessment-methodology-ecdc-2019
Forman, G., & Scholz, M. (2010). Apples-to-apples in cross-validation studies: Pitfalls in classifier performance measurement. ACM SIGKDD Explor Newsl., 12(1), 49–57.
Article Google Scholar
Gamon M. (2006) Graph-based text representation for novelty detection. In: Proceedings of TextGraphs: The First Workshop on Graph Based Methods for Natural Language Processing. Retrieved September 4, 2022 from: https://aclanthology.org/W06-3803
Ghosal, T., Edithal, V., Ekbal, A., Bhattacharyya, P., Chivukula, S., & Tsatsaronis, G. (2020). Is your document novel? Let attention guide you. An attention-based model for document-level novelty detection. Natural Language Engineering, 24(27), 1–28.
Google Scholar
Ghosal T, Edithal V, Ekbal A, Bhattacharyya P, Tsatsaronis G, Chivukula SSSK. (2018) Novelty Goes Deep. A Deep Neural Solution To Document Level Novelty Detection. In: Proceedings of the 27th International Conference on Computational Linguistics. Retrieved September 4, 2022 from: https://aclanthology.org/C18-1237
Ghosal T, Edithal V, Saikh T, Bhattacharjee S, Ekbal A, Bhattacharyya P. (2022) Novelty detection in community question answering forums. In: Proceedings of the 36th Pacific Asia Conference on Language, Information and Computation (pp. 525–32).
Ghosal, T., Saikh, T., Biswas, T., Ekbal, A., & Bhattacharyya, P. (2022). Novelty detection: A perspective from natural language processing. Computational Linguistics, 48(1), 77–117.
Article Google Scholar
Ghosal T, Salam A, Tiwari S, Ekbal A, Bhattacharyya P. (2018) TAP-DLND 1.0 : A corpus for document level novelty detection. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). Retrieved September 4, 2022 from: https://aclanthology.org/L18-1559
Ghozzi S. ( 2019) Towards anomaly detection in EIOS: Natural language processing and supervised learning can help detect signals. In Seoul. p. 28.
Greiner, R., & Genesereth, M. R. (1983). What’s new? A semantic definition of novelty. Proceedings of the Eighth International Joint Conference on Artificial Intelligence, 1, 450–454.
Google Scholar
Halterman A. (2023) Mordecai 3: A neural geoparser and event geocoder. ArXiv Prepr ArXiv230313675. Retrieved March 23, 2023 from: https://arxiv.org/abs/2303.13675v1
Hripcsak, G., & Rothschild, A. S. (2005). Agreement, the f-measure, and reliability in information retrieval. Journal of the American Medical Informatics Association, 12(3), 296–298.
Article PubMed PubMed Central Google Scholar
Huang J, Li C, Subudhi K, Jose D, Balakrishnan S, Chen W, Peng B, Gao J, Han J, (2022) Few-shot named entity recognition: A comprehensive study. Preprint retrieved from http://arxiv.org/abs/2012.14978
Kaiser, R., Coulombier, D., Baldari, M., Morgan, D., & Paquet, C. (2006). What is epidemic intelligence, and how is it being improved in Europe? Europe’s Journal on Infectious Disease Surveillance, Epidemiology, Prevention and Control, 11(5), 2892.
Google Scholar
Karkali M, Rousseau F, Ntoulas A, Vazirgiannis M. (2013) Efficient online novelty detection in news streams. In Web Information Systems Engineering–WISE 2013: 14th International Conference, pp. 57–71.
Klie JC, Bugert M, Boullosa B, Eckart de Castilho R, Gurevych I. (2018) The INCEpTION platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations. Retrieved September 4, 2022 from: https://aclanthology.org/C18-2002
Lejeune, G., Brixtel, R., Doucet, A., & Lucas, N. (2015). Multilingual event extraction for epidemic detection. Artificial Intelligence in Medicine, 65(2), 131–143.
Article PubMed Google Scholar
Lejeune, G., Brixtel, R., Lecluze, C., Doucet, A., & Lucas, N. (2013). DAnIEL parsimonious yet high-coverage multilingual epidemic surveillance. In Proceedings of TALN 2013, 3, 787–788.
Google Scholar
Malvy, D., Gaüzère, B. A., & Migliani, R. (2019). Epidemic and emerging prone-infectious diseases: Lessons learned and ways forward. Presse Medicale Paris Fr 1983, 48(12), 1536–1550.
Google Scholar
Misra R. (2022) News Category Dataset.
Mozetič, I., Grčar, M., & Smailović, J. (2016). Multilingual twitter sentiment classification: The role of human annotators. PLoS ONE, 11(5), e0155036.
Article PubMed PubMed Central Google Scholar
Mukherjee S, Awadallah AH. (2020) Uncertainty-aware Self-training for Few-shot Text Classification. In Advances in Neural Information Processing Systems 33. Retrieved September 4, 2022 from: https://www.microsoft.com/en-us/research/publication/uncertainty-self-training-few-shot-bert/
Mutuvi S, Doucet A, Lejeune G, Odeo M. (2020) A dataset for multi-lingual epidemiological event extraction. In: Proceedings of the Twelfth Language Resources and Evaluation Conference. pp. 4139–4144. Retrieved October 2, 2023 from: https://aclanthology.org/2020.lrec-1.509
Neumann M, King D, Beltagy I, Ammar W. (2019) ScispaCy: Fast and robust models for biomedical natural language processing. In: Proceedings of the 18th BioNLP Workshop and Shared Task, pp. 319–327. Retrieved from: https://www.aclweb.org/anthology/W19-5034
Ng, V., Rees, E. E., Niu, J., Zaghool, A., Ghiasbeglou, H., & Verster, A. (2020). Application of natural language processing algorithms for extracting information from news articles in event-based surveillance. Canada Communicable Disease Report = Releve des Maladies Transmissibles au Canada, 46(6), 186–191.
Article PubMed PubMed Central Google Scholar
Niu J, Ng V, Penn G, Rees EE. (2020) Temporal histories of epidemic events (THEE): A case study in temporal annotation for public health. In: Proceedings of the 12th Language Resources and Evaluation Conference, pp. 2223–2230. Retrieved September 4, 2022 from: https://aclanthology.org/2020.lrec-1.271
Paquet, C., Coulombier, D., Kaiser, R., & Ciotti, M. (2006). Epidemic intelligence: A new framework for strengthening disease surveillance in Europe. Europe’sJournal on Infectious Disease Surveillance, Epidemiology, Prevention and Control, 11(12), 212–214.
CAS Google Scholar
Pearman, O., Boykoff, M., Osborne-Gowey, J., Aoyagi, M., Ballantyne, A. G., Chandler, P., Daly, M., Doi, K., Fernández-Reyes, R., Jiménez-Gómez, I., & Nacu-Schmidt, A. (2021). COVID-19 media coverage decreasing despite deepening crisis. Lancet Planet Health, 5(1), e6–e7.
Article PubMed Google Scholar
Pustejovsky J, Lee K, Bunt H, Romary L. (2010) ISO-TimeML: An international standard for semantic annotation. In: Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC’10). Retrieved September 4, 2022 from: http://www.lrec-conf.org/proceedings/lrec2010/pdf/55_Paper.pdf
Rabatel, J., Arsevska, E., & Roche, M. (2018). PADI-web corpus: Labeled textual data in animal health domain. Data in Brief, 23(22), 643–646.
Google Scholar
Soboroff I, Harman D. (2005) Novelty detection: The TREC experience. In: Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing. Retrieved September 4, 2022, from: https://aclanthology.org/H05-1014
Steinberger R, Fuart F, van der Goot E, Best C, Etter P, Yangarber R. (2008) Text mining from the web for medical intelligence. Amsterdam (The Netherlands): IOS Press. Retrieved from: http://langtech.jrc.it/Documents/2009_MMDSS_Medical-Intelligence.pdf, http://www.iospress.nl/loadtop/load.php?isbn=9781586038984
Valentin, S., Lancelot, R., & Roche, M. (2021). Identifying associations between epidemiological entities in news data for animal disease surveillance. Artificial Intelligence in Agriculture, 1(5), 163–174.
Article Google Scholar
van der Goot, E., Tanev, H., & Linge, J. (2013). Combining twitter and media reports on public health events in medisys. In Proceedings of the 22nd International Conference on World Wide Web. https://doi.org/10.1145/2487788.2488028
Article Google Scholar
Wick M. (2015) Geonames ontology. Retrieved April 22, 2015 from: http://www.geonames.org/about.html
Wickham H. (2022) Rvest: Easily harvest (Scrape) web pages.
Williams, G. S., Impouma, B., Mboussou, F., Lee, T. M. H., Ogundiran, O., Okot, C., Metcalf, T., Stephen, M., Fekadu, S. T., Wolfe, C. M., & Farham, B. (2021). Implementing epidemic intelligence in the WHO African region for early detection and response to acute public health events. Epidemiology and Infection, 14(149), e261.
Article Google Scholar
Zhang, Y., Callan, J., & Minka, T. (2002). Novelty and redundancy detection in adaptive filtering. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. https://doi.org/10.1145/564376.564393
Article Google Scholar

Download references

Acknowledgements

The Eura Nova Company supported the technical deployment of the annotation platform.

Funding

We carried out the annotation of the documents on an online platform, whose hosting was financed by the Eura Nova Company.

Author information

Authors and Affiliations

Aix Marseille Univ, Inserm, IRD, SESSTIM, Sciences Economiques & Sociales de la Santé & Traitement de l’Information Médicale, ISSPAM, Marseille, France
François Delon, Gabriel Bédubourg & Marc Tanti
French Defense Health Service, Paris, France
François Delon, Gabriel Bédubourg, Jean-Baptiste Meynard, Aude Valois, Benjamin Queyriaux & Marc Tanti
EURA NOVA, Marseille, France
Léo Bouscarrat
CNRS, LIS, Université de Toulon, Aix-Marseille Université, Marseille, France
Léo Bouscarrat & Carlos Ramisch
Centre Hospitalier de Cayenne, Cayenne, French Guiana
Aude Valois
HIPS Agency GmbH, Munich, Germany
Benjamin Queyriaux

Authors

François Delon
View author publications
You can also search for this author in PubMed Google Scholar
Gabriel Bédubourg
View author publications
You can also search for this author in PubMed Google Scholar
Léo Bouscarrat
View author publications
You can also search for this author in PubMed Google Scholar
Jean-Baptiste Meynard
View author publications
You can also search for this author in PubMed Google Scholar
Aude Valois
View author publications
You can also search for this author in PubMed Google Scholar
Benjamin Queyriaux
View author publications
You can also search for this author in PubMed Google Scholar
Carlos Ramisch
View author publications
You can also search for this author in PubMed Google Scholar
Marc Tanti
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

FD wrote the annotation guide and the paper. FD, GB, LB, CR and MT prepared the annotation campaign (organizations and first annotation runs). FD and BQ annotated the documents. GB did the adjudication. FD carried out the alignment of events and entities, with verification by AV. JBM and MT supervised the work. All authors proofread and edited the original manuscript.

Corresponding author

Correspondence to François Delon.

Ethics declarations

Conflict of interest

This work is included in the research work of LB, research work co-financed by the Eura Nova Company.

Ethical approval

Not applicable.

Consent to participate

Not applicable.

Consent for publication

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (DOCX 132 KB)

Supplementary file2 (DOCX 100 KB)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Delon, F., Bédubourg, G., Bouscarrat, L. et al. Infectious risk events and their novelty in event-based surveillance: new definitions and annotated corpus. Lang Resources & Evaluation (2024). https://doi.org/10.1007/s10579-024-09728-w

Download citation

Accepted: 07 February 2024
Published: 05 March 2024
DOI: https://doi.org/10.1007/s10579-024-09728-w

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Infectious risk events and their novelty in event-based surveillance: new definitions and annotated corpus

Abstract

Access this article

Similar content being viewed by others

Unified approach to retrospective event detection for event- based epidemic intelligence

Elaboration of a new framework for fine-grained epidemiological annotation

Digital surveillance in Latin American diseases outbreaks: information extraction from a novel Spanish corpus

Data availability

Notes

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Consent to participate

Consent for publication

Additional information

Publisher's Note

Supplementary Information

Supplementary file1 (DOCX 132 KB)

Supplementary file2 (DOCX 100 KB)

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Infectious risk events and their novelty in event-based surveillance: new definitions and annotated corpus

Abstract

Access this article

Similar content being viewed by others

Unified approach to retrospective event detection for event- based epidemic intelligence

Elaboration of a new framework for fine-grained epidemiological annotation

Digital surveillance in Latin American diseases outbreaks: information extraction from a novel Spanish corpus

Data availability

Notes

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Consent to participate

Consent for publication

Additional information

Publisher's Note

Supplementary Information

Supplementary file1 (DOCX 132 KB)

Supplementary file2 (DOCX 100 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation