Abstract
Event-based surveillance (EBS) requires the analysis of an ever-increasing volume of documents, requiring automated processing to support human analysts. Few annotated corpora are available for the evaluation of information extraction tools in the EBS domain. The main objective of this work was to build a corpus containing documents which are representative of those collected in the current EBS information systems, and to annotate them with events and their novelty. We proposed new definitions of infectious events and their novelty suited to the background work of analysts working in the EBS domain, and we compiled a corpus of 305 documents describing 283 infectious events. There were 36 included documents in French, representing a total of 11 events, with the remainder in English. We annotated novelty for the 110 most recent documents in the corpus, resulting in 101 events. The inter-annotator agreement was 0.74 for event identification (F1-Score) and 0.69 [95% CI: 0.51; 0.88] (Kappa) for novelty annotation. The overall agreement for entity annotation was lower, with a significant variation according to the type of entities considered (range 0.30–0.68). This corpus is a useful tool for creating and evaluating algorithms and methods submitted by EBS research teams for event detection and annotation of their novelties, aiming at the operational improvement of document flow processing. The small size of this corpus makes it less suitable for training natural language processing models, although this limitation tends to fade away when using few-shots learning methods.
Similar content being viewed by others
Data availability
The corpus generated during the current study, including the source URLs of the documents and the annotations, is available from the corresponding author on reasonable request (Zenodo repository: https://doi.org/10.5281/zenodo.8414785). The source texts may be published under license and so are not publicly available.
Abbreviations
- EBS:
-
Event-based surveillance
- EBS-IS:
-
Event-based surveillance information system
- IAA:
-
Inter-annotator agreement
- STE:
-
Stack Exchange
- TREC:
-
Text REtrieval Conference
- UMLS:
-
Unified Medical Language System
References
Abbood, A., Ullrich, A., Busche, R., & Ghozzi, S. (2020). EventEpi-a natural language processing framework for event-based surveillance. PLoS Computational Biology, 16(11), e1008277.
Barto, A., Mirolli, M., & Baldassarre, G. (2013). Novelty or surprise? Frontiers in Psychology, 11(4), 907.
Bentivogli L, Clark P, Dagan I, Giampiccolo D. (2010) The sixth PASCAL recognizing textual entailment challenge. In: Proceedings of the third text analysis conference, TAC 2010, Gaithersburg, Maryland, USA, 15–16. NIST; Retrieved November 2010 from: https://tac.nist.gov/publications/2010/additional.papers/RTE6_overview.proceedings.pdf
Bentivogli L, Clark P, Dagan I, Giampiccolo D. (2011) The seventh PASCAL recognizing textual entailment challenge
Bodenreider, O. (2004). The unified medical language system (UMLS): Integrating biomedical terminology. Nucleic Acids Research. https://doi.org/10.1093/nar/gkh061
Breit, N. A., Allen, T., Arnold, B., Huff, A., Madoff, L., & Pollack, M. (2016). 20.142 evaluation of ProMED-mail global surveillance capability. International Journal of Infectious Diseases , 53, 140.
Brownstein, J. S., Freifeld, C. C., Reis, B. Y., & Mandl, K. D. (2008). Surveillance sans frontières: Internet-based emerging infectious disease intelligence and the healthMap project. PLoS Medicine, 5(7), e151.
Carter, D., Stojanovic, M., Hachey, P., Fournier, K., Rodier, S., Wang, Y., & De Bruijn, B. (2020). Global public health surveillance using media reports: Redesigning GPHIN. Student Health Technol Inform, 16(270), 843–847.
Collier, N., Doan, S., Kawazoe, A., Goodwin, R. M., Conway, M., Tateno, Y., Ngo, Q. H., Dien, D., Kawtrakul, A., Takeuchi, K., & Shigematsu, M. (2008). BioCaster: Detecting public health rumors with a web-based text mining system. Bioinformatics, 24(24), 2940–2941.
Conway, M., Kawazoe, A., Chanlekha, H., & Collier, N. (2010). Developing a disease outbreak event corpus. Journal of Medical Internet Research, 12(3), e43.
David G, Christopher C, Stephanie S. (2001) TDT3 multilanguage text version 2.0. Linguistic Data Consortium, p. 371712 KB. Retrieved October 10, 2023 from: https://catalog.ldc.upenn.edu/LDC2001T58
European Centre for Disease Prevention and Control (2022) Operational tool on rapid risk assessment methodology 2019. Retrieved September 4, 2022 from: https://www.ecdc.europa.eu/en/publications-data/operational-tool-rapid-risk-assessment-methodology-ecdc-2019
Forman, G., & Scholz, M. (2010). Apples-to-apples in cross-validation studies: Pitfalls in classifier performance measurement. ACM SIGKDD Explor Newsl., 12(1), 49–57.
Gamon M. (2006) Graph-based text representation for novelty detection. In: Proceedings of TextGraphs: The First Workshop on Graph Based Methods for Natural Language Processing. Retrieved September 4, 2022 from: https://aclanthology.org/W06-3803
Ghosal, T., Edithal, V., Ekbal, A., Bhattacharyya, P., Chivukula, S., & Tsatsaronis, G. (2020). Is your document novel? Let attention guide you. An attention-based model for document-level novelty detection. Natural Language Engineering, 24(27), 1–28.
Ghosal T, Edithal V, Ekbal A, Bhattacharyya P, Tsatsaronis G, Chivukula SSSK. (2018) Novelty Goes Deep. A Deep Neural Solution To Document Level Novelty Detection. In: Proceedings of the 27th International Conference on Computational Linguistics. Retrieved September 4, 2022 from: https://aclanthology.org/C18-1237
Ghosal T, Edithal V, Saikh T, Bhattacharjee S, Ekbal A, Bhattacharyya P. (2022) Novelty detection in community question answering forums. In: Proceedings of the 36th Pacific Asia Conference on Language, Information and Computation (pp. 525–32).
Ghosal, T., Saikh, T., Biswas, T., Ekbal, A., & Bhattacharyya, P. (2022). Novelty detection: A perspective from natural language processing. Computational Linguistics, 48(1), 77–117.
Ghosal T, Salam A, Tiwari S, Ekbal A, Bhattacharyya P. (2018) TAP-DLND 1.0 : A corpus for document level novelty detection. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). Retrieved September 4, 2022 from: https://aclanthology.org/L18-1559
Ghozzi S. ( 2019) Towards anomaly detection in EIOS: Natural language processing and supervised learning can help detect signals. In Seoul. p. 28.
Greiner, R., & Genesereth, M. R. (1983). What’s new? A semantic definition of novelty. Proceedings of the Eighth International Joint Conference on Artificial Intelligence, 1, 450–454.
Halterman A. (2023) Mordecai 3: A neural geoparser and event geocoder. ArXiv Prepr ArXiv230313675. Retrieved March 23, 2023 from: https://arxiv.org/abs/2303.13675v1
Hripcsak, G., & Rothschild, A. S. (2005). Agreement, the f-measure, and reliability in information retrieval. Journal of the American Medical Informatics Association, 12(3), 296–298.
Huang J, Li C, Subudhi K, Jose D, Balakrishnan S, Chen W, Peng B, Gao J, Han J, (2022) Few-shot named entity recognition: A comprehensive study. Preprint retrieved from http://arxiv.org/abs/2012.14978
Kaiser, R., Coulombier, D., Baldari, M., Morgan, D., & Paquet, C. (2006). What is epidemic intelligence, and how is it being improved in Europe? Europe’s Journal on Infectious Disease Surveillance, Epidemiology, Prevention and Control, 11(5), 2892.
Karkali M, Rousseau F, Ntoulas A, Vazirgiannis M. (2013) Efficient online novelty detection in news streams. In Web Information Systems Engineering–WISE 2013: 14th International Conference, pp. 57–71.
Klie JC, Bugert M, Boullosa B, Eckart de Castilho R, Gurevych I. (2018) The INCEpTION platform: Machine-assisted and knowledge-oriented interactive annotation. In: Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations. Retrieved September 4, 2022 from: https://aclanthology.org/C18-2002
Lejeune, G., Brixtel, R., Doucet, A., & Lucas, N. (2015). Multilingual event extraction for epidemic detection. Artificial Intelligence in Medicine, 65(2), 131–143.
Lejeune, G., Brixtel, R., Lecluze, C., Doucet, A., & Lucas, N. (2013). DAnIEL parsimonious yet high-coverage multilingual epidemic surveillance. In Proceedings of TALN 2013, 3, 787–788.
Malvy, D., Gaüzère, B. A., & Migliani, R. (2019). Epidemic and emerging prone-infectious diseases: Lessons learned and ways forward. Presse Medicale Paris Fr 1983, 48(12), 1536–1550.
Misra R. (2022) News Category Dataset.
Mozetič, I., Grčar, M., & Smailović, J. (2016). Multilingual twitter sentiment classification: The role of human annotators. PLoS ONE, 11(5), e0155036.
Mukherjee S, Awadallah AH. (2020) Uncertainty-aware Self-training for Few-shot Text Classification. In Advances in Neural Information Processing Systems 33. Retrieved September 4, 2022 from: https://www.microsoft.com/en-us/research/publication/uncertainty-self-training-few-shot-bert/
Mutuvi S, Doucet A, Lejeune G, Odeo M. (2020) A dataset for multi-lingual epidemiological event extraction. In: Proceedings of the Twelfth Language Resources and Evaluation Conference. pp. 4139–4144. Retrieved October 2, 2023 from: https://aclanthology.org/2020.lrec-1.509
Neumann M, King D, Beltagy I, Ammar W. (2019) ScispaCy: Fast and robust models for biomedical natural language processing. In: Proceedings of the 18th BioNLP Workshop and Shared Task, pp. 319–327. Retrieved from: https://www.aclweb.org/anthology/W19-5034
Ng, V., Rees, E. E., Niu, J., Zaghool, A., Ghiasbeglou, H., & Verster, A. (2020). Application of natural language processing algorithms for extracting information from news articles in event-based surveillance. Canada Communicable Disease Report = Releve des Maladies Transmissibles au Canada, 46(6), 186–191.
Niu J, Ng V, Penn G, Rees EE. (2020) Temporal histories of epidemic events (THEE): A case study in temporal annotation for public health. In: Proceedings of the 12th Language Resources and Evaluation Conference, pp. 2223–2230. Retrieved September 4, 2022 from: https://aclanthology.org/2020.lrec-1.271
Paquet, C., Coulombier, D., Kaiser, R., & Ciotti, M. (2006). Epidemic intelligence: A new framework for strengthening disease surveillance in Europe. Europe’sJournal on Infectious Disease Surveillance, Epidemiology, Prevention and Control, 11(12), 212–214.
Pearman, O., Boykoff, M., Osborne-Gowey, J., Aoyagi, M., Ballantyne, A. G., Chandler, P., Daly, M., Doi, K., Fernández-Reyes, R., Jiménez-Gómez, I., & Nacu-Schmidt, A. (2021). COVID-19 media coverage decreasing despite deepening crisis. Lancet Planet Health, 5(1), e6–e7.
Pustejovsky J, Lee K, Bunt H, Romary L. (2010) ISO-TimeML: An international standard for semantic annotation. In: Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC’10). Retrieved September 4, 2022 from: http://www.lrec-conf.org/proceedings/lrec2010/pdf/55_Paper.pdf
Rabatel, J., Arsevska, E., & Roche, M. (2018). PADI-web corpus: Labeled textual data in animal health domain. Data in Brief, 23(22), 643–646.
Soboroff I, Harman D. (2005) Novelty detection: The TREC experience. In: Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing. Retrieved September 4, 2022, from: https://aclanthology.org/H05-1014
Steinberger R, Fuart F, van der Goot E, Best C, Etter P, Yangarber R. (2008) Text mining from the web for medical intelligence. Amsterdam (The Netherlands): IOS Press. Retrieved from: http://langtech.jrc.it/Documents/2009_MMDSS_Medical-Intelligence.pdf, http://www.iospress.nl/loadtop/load.php?isbn=9781586038984
Valentin, S., Lancelot, R., & Roche, M. (2021). Identifying associations between epidemiological entities in news data for animal disease surveillance. Artificial Intelligence in Agriculture, 1(5), 163–174.
van der Goot, E., Tanev, H., & Linge, J. (2013). Combining twitter and media reports on public health events in medisys. In Proceedings of the 22nd International Conference on World Wide Web. https://doi.org/10.1145/2487788.2488028
Wick M. (2015) Geonames ontology. Retrieved April 22, 2015 from: http://www.geonames.org/about.html
Wickham H. (2022) Rvest: Easily harvest (Scrape) web pages.
Williams, G. S., Impouma, B., Mboussou, F., Lee, T. M. H., Ogundiran, O., Okot, C., Metcalf, T., Stephen, M., Fekadu, S. T., Wolfe, C. M., & Farham, B. (2021). Implementing epidemic intelligence in the WHO African region for early detection and response to acute public health events. Epidemiology and Infection, 14(149), e261.
Zhang, Y., Callan, J., & Minka, T. (2002). Novelty and redundancy detection in adaptive filtering. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. https://doi.org/10.1145/564376.564393
Acknowledgements
The Eura Nova Company supported the technical deployment of the annotation platform.
Funding
We carried out the annotation of the documents on an online platform, whose hosting was financed by the Eura Nova Company.
Author information
Authors and Affiliations
Contributions
FD wrote the annotation guide and the paper. FD, GB, LB, CR and MT prepared the annotation campaign (organizations and first annotation runs). FD and BQ annotated the documents. GB did the adjudication. FD carried out the alignment of events and entities, with verification by AV. JBM and MT supervised the work. All authors proofread and edited the original manuscript.
Corresponding author
Ethics declarations
Conflict of interest
This work is included in the research work of LB, research work co-financed by the Eura Nova Company.
Ethical approval
Not applicable.
Consent to participate
Not applicable.
Consent for publication
Not applicable.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Delon, F., Bédubourg, G., Bouscarrat, L. et al. Infectious risk events and their novelty in event-based surveillance: new definitions and annotated corpus. Lang Resources & Evaluation (2024). https://doi.org/10.1007/s10579-024-09728-w
Accepted:
Published:
DOI: https://doi.org/10.1007/s10579-024-09728-w