Semi-supervised geological disasters named entity recognition using few labeled data

Lei, Xinya; Song, Weijing; Fan, Runyu; Feng, Ruyi; Wang, Lizhe

doi:10.1007/s10707-022-00474-1

Semi-supervised geological disasters named entity recognition using few labeled data

Published: 18 October 2022

Volume 27, pages 263–288, (2023)
Cite this article

GeoInformatica Aims and scope Submit manuscript

Xinya Lei^1,2,
Weijing Song^1,2,
Runyu Fan^1,2,
Ruyi Feng^1,2 &
…
Lizhe Wang^1,2

970 Accesses
4 Citations
Explore all metrics

Abstract

The geological disasters Named Entity Recognition (NER) method aims to recognize entities reflecting disaster event information in unstructured texts to construct a geohazard knowledge graph that can provide a reference for disaster emergency response. Without training on large-scale labeled data, current NER methods based on deep learning models cannot identify specific geological disaster entities from geological disaster situation reports. However, manually labeling geohazard situation reports is tedious and time-consuming. As a result, we present Semi-GDNER, a semi-supervised geological disasters NER approach that can effectively extract six kinds of geological disaster entities when a few manually labeled and unlabeled in-domain data are available. It is divided into two stages: (1) transferring the parameters of the pre-trained BERT-base model to the BERT layer of the backbone model BERT-BiLSTM-CRF and training the backbone model with a few labeled data; (2) continuing training the backbone model by expanding the training set with unlabeled data using a self-training (ST) strategy. To reduce noise in the second stage, we select the pseudo-labeled samples with high confidence to join the training set in each ST iteration. Experiments on our constructed Geological Disaster NER data show that our approach achieves a higher F1 (0.88) than other NER approaches (including five supervised NER approaches and a semi-supervised NER approach using the ST strategy of expanding the training set with all pseudo-labeled data), demonstrating the effectiveness of our approach. Furthermore, experiments on four general Chinese NER datasets show that the framework of our approach is transferable.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Ontology-Based BERT Model for Automated Information Extraction from Geological Hazard Reports

Article 18 October 2023

CnGeoPLM: Contextual knowledge selection and embedding with pretrained language representation model for the geoscience domain

Article 07 October 2023

Few-shot learning for name entity recognition in geological text based on GeoBERT

Article 11 March 2022

Data Availability

The Geo-Disaster-NER dataset and the code for the Semi-GDNER approach are available in Github, https://github.com/xiaoleicug/GeoDisaster-NER. Other NER datasets are derived from public sources, links to which are provided in the article.

Notes

https://nlp.stanford.edu/software/CRF-NER.shtml
https://github.com/Lynten/stanford-corenlp
https://github.com/google-research/bert
https://huggingface.co/bert-base-chinese/tree/main
http://www.mnr.gov.cn/gk/dzzhzqxqbg/
https://bosonnlp.com/dev/resource
https://github.com/zjy-ucas/ChineseNER

References

Abu-Salih B (2021) Domain-specific knowledge graphs: A survey. J Netw Comput Appl 185. https://doi.org/10.1016/j.jnca.2021.103076
Article Google Scholar
Banujan K, Kumara BT, Paik I (2018) Twitter and Online News analytics for Enhancing Post-Natural Disaster Management Activities. In: 2018 9th International Conference on Awareness Science and Technology, iCAST 2018, IEEE, pp 302–307
Chinchor N, Robinson P (1998) Appendix E: MUC-7 named entity task definition (version 3.5). In: Seventh Message Understanding Conference (MUC-7), Fairfax, Virginia
Cho HC, Okazaki N, Miwa M, Tsujii J (2013) Named entity recognition with multiple segment representations. Inf Process Manage 49(4):954–965. https://doi.org/10.1016/j.ipm.2013.03.002
Article Google Scholar
Cui Y, Che W, Liu T, Qin B, Yang Z (2021) Pre-Training with Whole Word Masking for Chinese BERT. IEEE/ACM Trans Audio Speech Lang Process 29:3504–3514. https://doi.org/10.1109/TASLP.2021.3124365
Article Google Scholar
Dai Z, Wang X, Ni P, Li Y, Li G, Bai X (2019) Named entity recognition using BERT BiLSTM CRF for Chinese electronic health records. In: 2019 12th international congress on image and signal processing, biomedical engineering and informatics (cisp-bmei), IEEE, pp 1–5
Devlin J, Chang MW, Lee K, Toutanova K (2019) BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT 2019, pp 4171–4186
Google Scholar
Ding L (2022) A Bootstrapped Chinese Biomedical Named Entity Recognition Model Incorporating Lexicons. In: EEKE 2022, June 20-24, 2022, Germany and online, Association for Computing Machinery, vol1
Du J, Grave E, Gunel B, Chaudhary V, Celebi O, Auli M, Stoyanov V, Conneau A (2020) Self-training improves pre-training for natural language understanding. arXiv:10.48550/arXiv.2010.02194
EPeters M, Neumann M, Iyyer M, Gardner M, (2018) Deep contextualized word representations. In: Proceedings of NAACL-HLT 2018, pp 2227–2237
Fan R, Wang L, Yan J, Song W, Zhu Y, Chen X (2020) Deep learning-based named entity recognition and knowledge graph construction for geological hazards. ISPRS Int J Geo Inf 9(1):15. https://doi.org/10.3390/ijgi9010015
Article Google Scholar
Gao S, Kotevska O, Sorokine A, Christian JB (2021) A pre-training and self-training approach for biomedical named entity recognition. PLoS ONE 16(2):1–23. https://doi.org/10.1371/journal.pone.0246310
Article Google Scholar
Gelernter J, Balaji S (2013) An algorithm for local geoparsing of microtext. GeoInformatica 17(4):635–667. https://doi.org/10.1007/s10707-012-0173-8
Article Google Scholar
Hu X, Zhou Z, Sun Y, Kersten J, Klan F, Fan H, Wiegmann M (2022) GazPNE2: A general place name extractor for microblogs fusing gazetteers and pretrained transformer models. IEEE Internet of Things Journal 4662(NOVEMBER 2021):1–13. https://doi.org/10.1109/JIOT.2022.3150967
Huang J, Li C, Subudhi K, Jose D, Balakrishnan S, Chen W, Peng B, Gao J, Han J (2021) Few-shot named entity recognition: An empirical baseline study. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp 10408–10423
Huang W, Hu D, Deng Z (2020) Nie J (2020) Named entity recognition for Chinese judgment documents based on BiLSTM and CRF. Eurasip J Image Video Process 1:52. https://doi.org/10.1186/s13640-020-00539-x
Article Google Scholar
Huang Z, Xu W, Yu K (2015) Bidirectional lstm-crf models for sequence tagging. arXiv preprint arXiv:1508.01991
Kang M, Lee KH (2021) Filtered BERT : Similarity Filter-Based Augmentation with Bidirectional Transfer Learning for Protected Health Information Prediction in Clinical Documents. Appl Sci 11(3668):1–9. https://doi.org/10.3390/app11083668
Article Google Scholar
Kang T, Perotte A, Tang Y, Ta C, Weng C (2021) UMLS-based data augmentation for natural language processing of clinical research literature. J Am Med Inform Assoc 28(4):812–823. https://doi.org/10.1093/jamia/ocaa309
Article Google Scholar
Kingma DP, Ba JL (2015) Adam: A method for stochastic optimization. 3rd International Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings. OpenReview.net, San Diego, CA, USA, pp 1–15
Google Scholar
Lafferty JD, McCallum A, Pereira FCN (2001) Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In: Proceedings of the Eighteenth International Conference on Machine Learning, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, pp 282–289
Levow GA (2006) The third international chinese language processing bakeoff: Word segmentation and named entity recognition. In: Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing, pp 108–117
Li J, Sun A, Han J, Li C (2020) A survey on deep learning for named entity recognition. IEEE Trans Knowl Data Eng 34(1):50–70. https://doi.org/10.1109/TKDE.2020.2981314
Article Google Scholar
Li Z, Gan Z, Zhang B, Chen Y, Wan J, Liu K, Zhao J, Liu S (2021) Semi-supervised noisy label learning for chinese clinical named entity recognition. Data Intell 3(3):389–401. https://doi.org/10.1162/dint\_a_00099
Article Google Scholar
Liu H, Qiu Q, Wu L, Li W, Wang B, Zhou Y (2022a) Few-shot learning for name entity recognition in geological text based on GeoBERT. Earth Science Informatics pp 1–13. https://doi.org/10.1007/s12145-022-00775-x
Liu P, Guo Y, Wang F, Li G (2022b) Chinese named entity recognition: The state of the art. Neurocomputing 473:37–53. https://doi.org/10.1016/j.neucom.2021.10.101
Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V (2019) Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692
Loshchilov I, Hutter F (2019) Decoupled weight decay regularization. 7th International Conference on Learning Representations, ICLR 2019. OpenReview.net, New Orleans, Louisiana, USA, pp 1–18
Google Scholar
Luo L, Yang Z, Yang P, Zhang Y, Wang L, Lin H, Wang J (2018) An attention-based BiLSTM-CRF approach to document-level chemical named entity recognition. Bioinformatics (Oxford, England) 34(8):1381–1388. https://doi.org/10.1093/bioinformatics/btx761
Article Google Scholar
Lv X, Xie Z, Xu D, Jin X, Ma K, Tao L, Qiu Q, Pan Y (2021) Chinese named entity recognition in the geoscience domain based on bert. Earth and Space Science p e2021EA002166. https://doi.org/10.1029/2021EA002166
Ma Y, Xie Z, Li G, Ma K, Huang Z, Qiu Q, Liu H (2022) Text visualization for geological hazard documents via text mining and natural language processing. Earth Sci Inf 15(1):439–454. https://doi.org/10.1007/s12145-021-00732-0
Article Google Scholar
McCallum A, Li W (2003) Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003 - Volume 4, pp 188–191
Article Google Scholar
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. In: 1st International Conference on Learning Representations, ICLR 2013 - Workshop Track Proceedings, pp 1–12
Peters ME, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L (2018) Improving Language Understanding by Generative Pre-Training. https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf. Accessed 2018
Qiu X, Sun T, Xu Y, Shao Y, Dai N, Huang X (2020) Pre-trained models for natural language processing: A survey. SCIENCE CHINA Technol Sci 63(10):1872–1897. https://doi.org/10.1007/s11431-020-1647-3
Article Google Scholar
Qiu Linyao (2017) A Smart Aggregation Method of Spatial-temopral Data for Natural Disaster Emergency Tasks. PhD thesis, Wuhan University
Scalia G, Francalanci C, Pernici B (2022) CIME: Context-aware geolocation of emergency-related posts. GeoInformatica 26(1):125–157. https://doi.org/10.1007/s10707-021-00446-x
Article Google Scholar
Shen Y, Yun H, CLipton Z, Kronrod Y, Anandkumar A (2017) Deep Active Learning for Named Entity Recognition. In: Proceedings of the 2nd Workshop on Representation Learning for NLP, pp 252–256
Sun Y, Wang S, Li Y, Feng S, Tian H, Wu H, Wang H (2020) Ernie 2.0: A continual pre-training framework for language understanding. In: The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20), pp 8968–8975
Sundermeyer M, Schlüter R, Ney H (2012) LSTM neural networks for language modeling. INTERSPEECH 2012, 13th Annual Conference of the International Speech Communication Association. ISCA, Portland, OR, USA, pp 194–197
Google Scholar
Tang P, Yang P, Shi Y, Zhou Y, Lin F, Wang Y (2020) Recognizing Chinese judicial named entity using BiLSTM-CRF. In: Journal of Physics: Conference Series, IOP Publishing, vol 1592, p 012040
Wang Y, Sun Y, Ma Z, Gao L, Xu Y (2020) Named entity recognition in Chinese medical literature using pretraining models. Scientific Programming 2020. https://doi.org/10.1155/2020/8812754
Xu L, Tong Y, Dong Q, Liao Y, Yu C, Tian Y, Liu W, Li L, Liu C, Zhang X (2020) CLUENER2020: Fine-grained Named Entity Recognition Dataset and Benchmark for Chinese. ArXiV:2001.04351
Yang J, Zhang Y, Li L, Li X (2018) YEDDA: A Lightweight Collaborative Text Span Annotation Tool. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics
Yang Z, Dai Z, Yang Y, Carbonell J, Salakhutdinov R, Le QV (2019) XLNet: Generalized autoregressive pretraining for language understanding. In: 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), NeurIPS, pp 1–11
Yao L, Huang H, Wang KW, Chen SH, Xiong Q (2020) Fine-Grained Mechanical Chinese Named Entity Recognition Based on ALBERT-AttBiLSTM-CRF. Symmetry 12(12):1–21. https://doi.org/10.3390/sym12121986
Article Google Scholar
Yarowsky D (1995) Unsupervised Word Sense Disambiguation Rivaling Supervised Methods. In: Proceedings of the 33rd Annual Meeting on Association for Computational Linguistics, Association for Computational Linguistics, Cambridge, Massachusetts,USA, pp 189–196
Yates A, Banko M, Broadhead M, Cafarella M, Etzioni O, Soderland S (2007) TextRunner: Open information extraction on the web. In: Proceedings of Human Language Technologies: The Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT), Association for Computational Linguistics, Rochester, New York, USA, pp 25–26
Ying X (2019) An overview of overfitting and its solutions. In: Journal of Physics: Conference Series, IOP Publishing, vol 1168, p 022022
Zheng X, Han J, Sun A (2018) A Survey of Location Prediction on Twitter. IEEE Trans Knowl Data Eng 30(9):1652–1671. https://doi.org/10.1109/TKDE.2018.2807840
Article Google Scholar

Download references

Acknowledgements

The authors thank the researchers for sharing their data. The authors are equally grateful to the editors and reviewers for their valuable comments on the manuscript.

Funding

This paper is funded by National Natural Science Foundation of China (No. 41925007 and U21A2013) and Hubei Natural Science Foundation of China (No. 2019CFA023).

Author information

Authors and Affiliations

School of Computer Science, China University of Geosciences, Wuhan, 430074, China
Xinya Lei, Weijing Song, Runyu Fan, Ruyi Feng & Lizhe Wang
Hubei Key Laboratory of Intelligent Geo-Information Processing, Wuhan, 430074, China
Xinya Lei, Weijing Song, Runyu Fan, Ruyi Feng & Lizhe Wang

Authors

Xinya Lei
View author publications
You can also search for this author in PubMed Google Scholar
Weijing Song
View author publications
You can also search for this author in PubMed Google Scholar
Runyu Fan
View author publications
You can also search for this author in PubMed Google Scholar
Ruyi Feng
View author publications
You can also search for this author in PubMed Google Scholar
Lizhe Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lizhe Wang.

Ethics declarations

Competing interests

The authors have no competing interests to declare that are relevant to the content of this article.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Lei, X., Song, W., Fan, R. et al. Semi-supervised geological disasters named entity recognition using few labeled data. Geoinformatica 27, 263–288 (2023). https://doi.org/10.1007/s10707-022-00474-1

Download citation

Received: 09 November 2021
Revised: 26 July 2022
Accepted: 17 August 2022
Published: 18 October 2022
Issue Date: April 2023
DOI: https://doi.org/10.1007/s10707-022-00474-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Semi-supervised geological disasters named entity recognition using few labeled data

Abstract

Access this article

Similar content being viewed by others

Ontology-Based BERT Model for Automated Information Extraction from Geological Hazard Reports

CnGeoPLM: Contextual knowledge selection and embedding with pretrained language representation model for the geoscience domain

Few-shot learning for name entity recognition in geological text based on GeoBERT

Data Availability

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Semi-supervised geological disasters named entity recognition using few labeled data

Abstract

Access this article

Similar content being viewed by others

Ontology-Based BERT Model for Automated Information Extraction from Geological Hazard Reports

CnGeoPLM: Contextual knowledge selection and embedding with pretrained language representation model for the geoscience domain

Few-shot learning for name entity recognition in geological text based on GeoBERT

Data Availability

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation