A spatially-aware algorithm for location extraction from structured documents

Sharma, Praval; Samal, Ashok; Soh, Leen-Kiat; Joshi, Deepti

doi:10.1007/s10707-022-00482-1

A spatially-aware algorithm for location extraction from structured documents

Published: 04 November 2022

Volume 27, pages 645–679, (2023)
Cite this article

GeoInformatica Aims and scope Submit manuscript

Praval Sharma ORCID: orcid.org/0000-0002-9485-1644¹,
Ashok Samal¹,
Leen-Kiat Soh¹ &
…
Deepti Joshi²

689 Accesses
2 Citations
Explore all metrics

Abstract

Place names facilitate locating and distinguishing geographic space where human activities and natural phenomena occur. Extracting place names at multiple spatial resolutions from text is beneficial in several tasks such as identifying the location of events, enriching gazetteers, discovering connections between events and places, etc. Most modern place name extraction approaches generalize the linguistic rules and lexical features as a universal rule and ignore patterns inherent in place names in the geographic contexts. As a result, they lack spatial awareness to effectively identify place names from different geographic contexts, especially the lesser-known place names. In this research, we develop a novel Spatially-Aware Location Extraction (SALE) algorithm for place name extraction from structured documents that uses a hybrid approach comprising of knowledge-driven and data-driven methods. We build a custom named entity recognition (NER) system based on the conditional random field (CRF) and train/ fine-tune it using spatial features extracted from a dataset based on a given geographic region. SALE uses multiple pathways, including the use of the spatially tuned NER to enhance the efficacy in our place names extraction. The experimental results using a large geographic region show that our algorithm outperforms well-known state-of-the-art place name recognizers.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Clustering-based disambiguation of fine-grained place names from descriptions

Article 25 January 2019

A Practical Approach to Extracting Names of Geographical Entities and Their Relations from the Web

Deriving the Geographic Footprint of Cognitive Regions

Data availability

The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.

References

Perko D, Jordan P, Komac B (2017) Exonyms and other geographical names. Acta Geogr Slov 57(1):99–107
Article Google Scholar
Jones CB, Abdelmoty AI, Finch D, Fu G, Vaid S (2004) The SPIRIT spatial search engine: Architecture, ontologies and spatial indexing. In: International Conference on Geographic Information Science
Murphy AB (1998) Rediscovering the importance of geography. Chronicle of Higher Education
Kapur A (2019) Mapping place names of India. Routledge and CRC Press, New York
Book Google Scholar
Gao S, Li L, Li W, Janowicz K, Zhang Y (2017) Constructing gazetteers from volunteered big geo-data based on Hadoop. Comput Environ Urban Syst 61:172–186
Article Google Scholar
Leetaru KH (2011) Culturomics 2.0: Forecasting large-scale human behavior using global news media tone in time and space. First Monday 16(9):1–22
Chen H, Vasardani M, Winter S (2019) Clustering-based disambiguation of fine-grained place names from descriptions. GeoInformatica 23:449–472
Article Google Scholar
Shi L, Wu Y, Liu L, Sun X, Jiang L (2018) Event detection and identification of influential spreaders in social media data streams. Big Data Min Anal 1(1):34–46
Article Google Scholar
Laere OV, Quinn J, Schockaert S, Dhoedt B (2014) Spatially aware term selection for geotagging. IEEE Trans Knowl Data Eng 26(1):221–234
Article Google Scholar
Tobler W (1970) A computer movie simulating urban growth in the Detroit region. Econ Geogr 46:234–240
Article Google Scholar
Lafferty J, McCallum A, Pereira F (2001) Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: International Conference on Machine Learning, ICML 2001
Krippendorff K (1980) Content analysis: an introduction to its methodology. Sage Publication, London
MATH Google Scholar
Weiss AS (2019) Journalists and their perceptions of location: making meaning in the community. Journal Stud 21(3):352–369
MathSciNet Google Scholar
Goggin G, Martin F, Dwyer T (2015) Locative news. Journal Stud 16(1):41–59
Google Scholar
Nyre L, Bjørnestad S, Tessem B, Øie KV (2012) Locative journalism: Designing a location-dependent news medium for smartphones. Convergence 18(3):297–314
Article Google Scholar
Jansson A, Lindell J (2015) News media consumption in the transmedia age. Journal Stud 16(1):79–96
Google Scholar
Kadmon N (2001) Toponymy: The lore, laws and language of geographical names. Vantage Press Inc, New York
Google Scholar
Tuan Y-F (1991) Language and the making of place: A narrative-descriptive approach. Ann Assoc Am Geogr 81(4):684–696
Article MathSciNet Google Scholar
Tuan Y-F (1977) Space and place: The perspective of experience. University of Minnesota Press, Minneapolis
Google Scholar
Basso KH (1988) “Speaking with names”: Language and landscape among the Western Apache. Cult Anthropol 3(2):99–130
Article Google Scholar
Rose-Redwood RS, Alderman DH, Azaryahu M (2010) Geographies of toponymic inscription: New directions in critical place name studies. Prog Hum Geogr 34(4):453–470
Qian X, Zhao Y, Han J (2015) Image location estimation by salient region matching. IEEE Trans Image Process 24(11):4348–4358
Article MathSciNet MATH Google Scholar
Ozdikis O, Ramampiaro H, Nørvag K (2018) Spatial statistics of term co-occurrences for location prediction of tweets. In: European Conference on Information Retrieval
Pritt SW (2012) Geolocation of photographs by means of horizon matching with digital elevation models. In: IEEE International Geoscience and Remote Sensing Symposium. Munich, Germany
Amitay E, Har’El N, Sivan R, Soffer A (2004) Web-a-Where: Geotagging web content. In: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Larson RR (1996) Geographic information retrieval and spatial browsing. In: Smith LC and Gluck M (eds) GIS and libraries: Patrons, maps and spatial information. University of Illinois at Urbana-Champaign, Urbana, pp 81–124
Purves RS, Clough P, Jones CB, Arampatzis A, Bucher B, Finch D, Fu G, Joho H, Syed KA, Vaid S, Yang B (2007) The design and implementation of SPIRIT: A spatially aware search engine for information retrieval on the Internet. Int J Geogr Inf Sci 21(7):717–745
Article Google Scholar
DeLozier G, Baldridge J, London L (2015) Gazetteer-independent toponym resolution using geographic word profiles. In: Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, Austin, Texas
Yu J, Rafiei D (2016) Geotagging named entities in news and online documents. In: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, Indianapolis, Indiana
Karimzadeh M, Huang W, Banerjee S, Wallgrün JO, Hardisty F, Pezanowski S, Mitra P, MacEachren AM (2013) GeoTxt: A web API to leverage place references in text. In: Proceedings of the 7th Workshop on Geographic Information Retrieval, Orlando, Florida
Hu Y, Mao H, McKenzie G (2018) A natural language processing and geospatial clustering framework for harvesting local place names from geotagged housing advertisements. Int J Geogr Inf Sci 33(4):714–738
Article Google Scholar
Teitlery BE, Lieberman MD, Panozzoy D, Sankaranarayanan J, Samety H, Sperling J (2008) NewsStand: A new view on news. In: Proceedings of the 16th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, Irvine, California
Grover C, Tobin R, Byrne K, Woollard M, Reid J, Dunn S, Ball J (2010) Use of the Edinburgh Geoparser for georeferencing digitized historical collections. Philos Trans R Soc Lond A Math Phys Eng Sci 368(1925):3875–3889
Lieberman MD, Samet H, Sankaranarayanan J (2010) Geotagging with local lexicons to build indexes for textually-specified spatial data. In: IEEE 26th International Conference on Data Engineering, ICDE 2010, Long Beach, California
Gelernter J, Balaji S (2013) An algorithm for local geoparsing of microtext. GeoInformatica 17:635–667
Article Google Scholar
Scalia G, Francalanci C, Pernici B (2022) CIME: Context-aware geolocation of emergency-related posts. GeoInformatica 26:125–157
Article Google Scholar
Stokes N, Li Y, Moffat A, Rong J (2008) An empirical study of the effects of NLP components on geographic IR performance. Int J Geogr Inf Sci 22(3):247–264
Article Google Scholar
Nadeau D, Sekine S (2007) A survey of named entity recognition and classification. Linguisticae Investigationes 30(1):3–26
Leidner JL, Lieberman MD (2011) Detecting geographical references in the form of place names and associated spatial natural language. SIGSPATIAL Special 3(2):5–11
Article Google Scholar
Finkel JR, Grenager T, Manning C (2005) Incorporating non-local information into information extraction systems by Gibbs sampling. In: Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics, ACL 2005
Ratinov L, Roth D (2009) Design challenges and misconceptions in named entity recognition. In: Proceedings of the Thirteenth Conference on Computational Natural Language Learning, CoNLL 2009, Boulder, Colorado
Marrero M, Urbano J, Sanchez-Cuadrado S, Morato J, Gomez-Berbis JM (2013) Named entity recognition: Fallacies, challenges and opportunities. Comput Stand Interfaces 35(5):482–489
Article Google Scholar
Vilain M, Su J, Lubar S (2007) Entity extraction is a boring solved problem-or is it? In: Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics, Rochester, New York
Mota C, Grishman R (2008) Is this NE tagger getting old? In: Proceedings of the Sixth International Conference on Language Resources and Evaluation, LREC 2008, Marrakech, Morocco
Lingad J, Karimi S, Yin J (2013) Location extraction from disaster-related microblogs. In: Proceedings of the 22nd International Conference on World Wide Web, Rio de Janeiro, Brazil
Grishman R, Sundheim B (1996) Message understanding conference-6: A brief history. In: COLING 1996: Proceedings of the 16th conference on Computational linguistics, Copenhagen, Denmark
Rau LF (1991) Extracting company names from text. In: The Seventh IEEE Conference on Artificial Intelligence Application, Miami Beach, Florida
Black W, Rinaldi WJ, Mowatt F (1998) D FACILE:escription of the NE system used for MUC-7. In: Seventh Message Understanding Conference, MUC-7, Fairfax, Virginia
Krupka GR, Hausman K (2005) IsoQuest inc.: Description of the NetOwl extractor system as used for MUC-7. In: Seventh Message Understanding Conference, MUC-7. Fairfax, Virginia
Humphreys K, Gaizauskas R, Azzam S, Huyck C, Mitchel B, Cunningham H, Wilks Y (1998) University of Sheffield: Description of the LaSIE-II system as used for MUC-7. In: Seventh Message Understanding Conference, MUC-7, Fairfax, Virginia
Li J, Sun A, Han J, Li C (2020) A survey on deep learning for named entity recognition. IEEE Trans Knowl Data Eng 34(1):50–70
Bikel DM, Schwartz R, Weischedel RM (1999) An algorithm that learns what’s in a name. Mach Learn 34:211–231
Article MATH Google Scholar
Zhou G, Jian S (2002) Named entity recognition using an HMM-based chunk tagger. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA
Curran J, Clark S (2003) Language independent NER using a maximum entropy tagger. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003
Sang EFTK, Meulder FD (2003) Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003
Bikel DM, Miller S, Schwartz R, Weischedel R (1997) Nymble: A high-performance learning name-finder. In: Fifth Conference on Applied Natural Language Processing, Washington, DC, USA
Borthwick A, Sterling J, Agichtein E, Grishman R (1999) Exploiting diverse knowledge sources via maximum entropy in named entity recognition. In: Sixth Workshop on Very Large Corpora
Chieu HL, Ng HT (2002) Named entity recognition: A maximum entropy approach using global information. In: CoLING 2002
Bender O, Och FJ, Ney H (2003) Maximum entropy models for named entity recognition. In: Proceedings of the seventh conference on Natural Language Learning at HLT-NAACL
Toutanova K, Klein D, Manning CD, Singer Y (2003) Feature-rich part-of-speech tagging with a cyclic dependency network. In: Conference of the North American Chapter of the Association for Computational Linguistics & Human Language Technologies, NAACL-HLT 2003
McCallum A, Li W (2003) Early results for NER with CRF, feature induction and word embeddings. In: Proceedings of the seventh conference on Natural Language Learning at HLT-NAACL
Krishnan V, Manning CD (2006) An effective two-stage model for exploiting non-local dependencies in named entity recognition. In: 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, Sydney, Australia
Asahara M, Matsumoto Y (2003) Japanese named entity extraction with redundant morphological analysis. In: Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics
Li Y, Bontcheva K, Cunningham H (2005) SVM based learning system for information extraction. In: Deterministic and Statistical Methods in Machine Learning, pp 319–339
Szarvas G, Farkas R, Kocsor A (2006) A multilingual named entity recognition system using boosting and C4.5 decision tree learning algorithms. Discovery Science, pp 267–278
Chiu JP, Nichols E (2016) Named entity recognition with bidirectional LSTM-CNNs. Trans Association Comput Linguistics 4:357–370
Article Google Scholar
Huang Z, Xu W, Yu K (2015) Bidirectional LSTM-CRF models for sequence tagging. ArXiv, abs/1508.01991
Lample G, Ballesteros M, Subramanian S, Kawakami K, Dyer C (2016) Neural Architectures for Named Entity Recognition. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, California
Ma X, Hovy E (2016) End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany
Yang Z, Salakhutdinov R, Cohen WW (2017) Transfer learning for sequence tagging with hierarchical recurrent networks. In: ICLR 2017
Ji H, Grishman R (2006) Data selection in semi-supervised learning for name tagging. In: Proceedings of the Workshop on Information Extraction Beyond the Document, Sydney, Australia
Turian J, Ratinov L-A, Bengio Y (2010) A simple and general method for semi-supervised learning. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Alfonseca E, Manandhar S (2002) An unsupervised method for general named entity. In: Proceedings for International Conference on General WordNet
Li D, Savova G, Kipper-Schuler K (2008) Conditional random fields and support vector machines for disorder named entity recognition in clinical texts. In: Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing, Columbus, Ohio
Ritter A, Clark S, Mausam, Etzioni O (2011) Named entity recognition in tweets: An experimental study. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, Edinburgh, Scotland, UK
Han J, Sun A, Cong G, Zhao WX, Ji Z, Phan MC (2018) Linking fine-grained locations in user comments. IEEE Trans Knowl Data Eng 30(1):59–72
Article Google Scholar
Rocktäschel T, Weidlich M, Leser U (2012) ChemSpot: A hybrid system for chemical named entity recognition. Bioinformatics 28(12):1633–1640
Article Google Scholar
Settles B (2004) Biomedical named entity recognition using conditional random fields and rich feature sets. In: Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications, NLPBA/BioNLP, Geneva, Switzerland
Times of India (2016) Available from: https://timesofindia.indiatimes.com/city/pune/Senior-citizen-killed-in-accident/articleshow/53880438.cms. Accessed 25 Sept 2021
Baldwin T, Kordoni V, Villavicencio A (2009) Prepositions in applications: A survey and introduction to the special issue. Comput Linguistics 35(2):119–150
Article Google Scholar
The Merriam-Webster Dictionary (2020) Available from: https://www.merriam-webster.com/dictionary/preposition#other-words. Accessed 21 Oct 2021
The Free Dictionary. Available from: https://www.thefreedictionary.com/List-of-prepositions.htm. Accessed 21 Oct 2021
Chen SF, Goodman J (1999) An empirical study of smoothing techniques for language modeling. Comput Speech Lang 13(4):359–393
Article Google Scholar
Agrawal R, Imielinski T, Swami A(1993) Mining association rules between sets of items in large databases. In: ACM SIGMOD International Conference on Management of data, Washington DC, USA
Sang EFTK (2002) Introduction to the CoNLL-2002 shared task: Language-independent named entity recognition. In: Proceedings for Conference on Natural Language Learning
Friburger N, Maurel D (2002) Textual similarity based on proper names. In: Mathematical Formal Information Retrieval, MFIR 2002
Gaizauskas R, Wakao T, Humphreys K, Cunningham H, Wilks Y (1995) University of Sheffield: Description of the LaSIE system as used for MUC-6. In: Message Understanding Conference-6 pp6
Hammersley J, Clifford P (1971) Markov fields on finite graphs and lattices. Unpublished Paper
Sha F, Pereira F (2003) Shallow parsing with conditional random fields. In: Proceedings of Human Language Technology, NAACL, Canada
Harrower T (2010) Inside reporting: A practical guide to the craft of journalism. McGraw-Hill Education, New York
Google Scholar
Krippendorff K (2004) Content Analysis: An Introduction to its methodology. SAGE Publications, Thousand Oaks
Google Scholar
Artstein R, Poesio M (2008) Inter-coder agreement for computational linguistics. Comput Linguistics 34(4):555–596
Article Google Scholar
Scott WA (1955) Reliability of content analysis: The case of nominal scale coding. Pub Opin Q 19(3):321–325
Article Google Scholar
Cohen J (1960) A coefficient of agreement for nominal scales. Educ Psychol Meas 20(1):37–46
Article Google Scholar
Davies M, Fleiss JL (1982) Measuring agreement for multinomial data. Biometrics 38(4):1047–1051
Article MATH Google Scholar
Passonneau R (2006) Measuring agreement on set-valued items (MASI) for semantic and pragmatic annotation. In: Proceedings of the Fifth International Conference on Language Resources and Evaluation, LREC 2006, Genoa, Italy
Krippendorff K, Craggs R (2016) The reliability of multi-valued coding of data. Commun Methods Meas 10(4):181–198
Article Google Scholar
Krippendorff K (2011) Agreement and information in the reliability of coding. Commun Methods Meas 5(2):93–112
Article Google Scholar
Burnard L (2000) Reference guide for the British national corpus. Oxford University, Oxford, UK
Google Scholar
Okazaki N (2007) CRFsuite: A fast implementation of conditional random fields (CRFs). Available from: http://www.chokkan.org/software/crfsuite/. Accessed 2 Sept 2021
Chinchor NA (1998) Overview of MUC-7. In: Seventh Message Understanding Conference, MUC-7. Fairfax, Virginia
Gardner M, Grus J, Neumann M, Tafjord O, Dasigi P, Liu NF, Peters M, Schmitz M, Zettlemoyer LS (2018) AllenNLP: A deep semantic natural language processing platform. In: Proceedings of Workshop for NLP Open Source Software, NLP-OSS, Melbourne, Australia
Ripley BD. Modelling spatial patterns. J R Stat Soc Ser B 39:172–92

Download references

Acknowledgements

The authors thank Anup Adhikari and Nisha Kiran Poudel for annotating the news reports for the ground truth development to evaluate our algorithm.

Author information

Authors and Affiliations

School of Computing, University of Nebraska-Lincoln, Lincoln, NE, USA
Praval Sharma, Ashok Samal & Leen-Kiat Soh
Department of Cyber and Computer Sciences, The Citadel, Charleston, SC, USA
Deepti Joshi

Authors

Praval Sharma
View author publications
You can also search for this author in PubMed Google Scholar
Ashok Samal
View author publications
You can also search for this author in PubMed Google Scholar
Leen-Kiat Soh
View author publications
You can also search for this author in PubMed Google Scholar
Deepti Joshi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Praval Sharma.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Sharma, P., Samal, A., Soh, LK. et al. A spatially-aware algorithm for location extraction from structured documents. Geoinformatica 27, 645–679 (2023). https://doi.org/10.1007/s10707-022-00482-1

Download citation

Received: 21 February 2022
Revised: 01 August 2022
Accepted: 29 September 2022
Published: 04 November 2022
Issue Date: October 2023
DOI: https://doi.org/10.1007/s10707-022-00482-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A spatially-aware algorithm for location extraction from structured documents

Abstract

Access this article

Similar content being viewed by others

Clustering-based disambiguation of fine-grained place names from descriptions

A Practical Approach to Extracting Names of Geographical Entities and Their Relations from the Web

Deriving the Geographic Footprint of Cognitive Regions

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A spatially-aware algorithm for location extraction from structured documents

Abstract

Access this article

Similar content being viewed by others

Clustering-based disambiguation of fine-grained place names from descriptions

A Practical Approach to Extracting Names of Geographical Entities and Their Relations from the Web

Deriving the Geographic Footprint of Cognitive Regions

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation