Skip to main content

Advertisement

Log in

A spatially-aware algorithm for location extraction from structured documents

  • Published:
GeoInformatica Aims and scope Submit manuscript

Abstract

Place names facilitate locating and distinguishing geographic space where human activities and natural phenomena occur. Extracting place names at multiple spatial resolutions from text is beneficial in several tasks such as identifying the location of events, enriching gazetteers, discovering connections between events and places, etc. Most modern place name extraction approaches generalize the linguistic rules and lexical features as a universal rule and ignore patterns inherent in place names in the geographic contexts. As a result, they lack spatial awareness to effectively identify place names from different geographic contexts, especially the lesser-known place names. In this research, we develop a novel Spatially-Aware Location Extraction (SALE) algorithm for place name extraction from structured documents that uses a hybrid approach comprising of knowledge-driven and data-driven methods. We build a custom named entity recognition (NER) system based on the conditional random field (CRF) and train/ fine-tune it using spatial features extracted from a dataset based on a given geographic region. SALE uses multiple pathways, including the use of the spatially tuned NER to enhance the efficacy in our place names extraction. The experimental results using a large geographic region show that our algorithm outperforms well-known state-of-the-art place name recognizers.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Algorithm 1:
Algorithm 2:
Fig. 7
Algorithm 3:
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

Data availability

The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.

References

  1. Perko D, Jordan P, Komac B (2017) Exonyms and other geographical names. Acta Geogr Slov 57(1):99–107

    Article  Google Scholar 

  2. Jones CB, Abdelmoty AI, Finch D, Fu G, Vaid S (2004) The SPIRIT spatial search engine: Architecture, ontologies and spatial indexing. In: International Conference on Geographic Information Science

  3. Murphy AB (1998) Rediscovering the importance of geography. Chronicle of Higher Education

  4. Kapur A (2019) Mapping place names of India. Routledge and CRC Press, New York

    Book  Google Scholar 

  5. Gao S, Li L, Li W, Janowicz K, Zhang Y (2017) Constructing gazetteers from volunteered big geo-data based on Hadoop. Comput Environ Urban Syst 61:172–186

    Article  Google Scholar 

  6. Leetaru KH (2011) Culturomics 2.0: Forecasting large-scale human behavior using global news media tone in time and space. First Monday 16(9):1–22

  7. Chen H, Vasardani M, Winter S (2019) Clustering-based disambiguation of fine-grained place names from descriptions. GeoInformatica 23:449–472

    Article  Google Scholar 

  8. Shi L, Wu Y, Liu L, Sun X, Jiang L (2018) Event detection and identification of influential spreaders in social media data streams. Big Data Min Anal 1(1):34–46

    Article  Google Scholar 

  9. Laere OV, Quinn J, Schockaert S, Dhoedt B (2014) Spatially aware term selection for geotagging. IEEE Trans Knowl Data Eng 26(1):221–234

    Article  Google Scholar 

  10. Tobler W (1970) A computer movie simulating urban growth in the Detroit region. Econ Geogr 46:234–240

    Article  Google Scholar 

  11. Lafferty J, McCallum A, Pereira F (2001) Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: International Conference on Machine Learning, ICML 2001

  12. Krippendorff K (1980) Content analysis: an introduction to its methodology. Sage Publication, London

    MATH  Google Scholar 

  13. Weiss AS (2019) Journalists and their perceptions of location: making meaning in the community. Journal Stud 21(3):352–369

    MathSciNet  Google Scholar 

  14. Goggin G, Martin F, Dwyer T (2015) Locative news. Journal Stud 16(1):41–59

    Google Scholar 

  15. Nyre L, Bjørnestad S, Tessem B, Øie KV (2012) Locative journalism: Designing a location-dependent news medium for smartphones. Convergence 18(3):297–314

    Article  Google Scholar 

  16. Jansson A, Lindell J (2015) News media consumption in the transmedia age. Journal Stud 16(1):79–96

    Google Scholar 

  17. Kadmon N (2001) Toponymy: The lore, laws and language of geographical names. Vantage Press Inc, New York

    Google Scholar 

  18. Tuan Y-F (1991) Language and the making of place: A narrative-descriptive approach. Ann Assoc Am Geogr 81(4):684–696

    Article  MathSciNet  Google Scholar 

  19. Tuan Y-F (1977) Space and place: The perspective of experience. University of Minnesota Press, Minneapolis

    Google Scholar 

  20. Basso KH (1988) “Speaking with names”: Language and landscape among the Western Apache. Cult Anthropol 3(2):99–130

    Article  Google Scholar 

  21. Rose-Redwood RS, Alderman DH, Azaryahu M (2010) Geographies of toponymic inscription: New directions in critical place name studies. Prog Hum Geogr 34(4):453–470

  22. Qian X, Zhao Y, Han J (2015) Image location estimation by salient region matching. IEEE Trans Image Process 24(11):4348–4358

    Article  MathSciNet  MATH  Google Scholar 

  23. Ozdikis O, Ramampiaro H, Nørvag K (2018) Spatial statistics of term co-occurrences for location prediction of tweets. In: European Conference on Information Retrieval

  24. Pritt SW (2012) Geolocation of photographs by means of horizon matching with digital elevation models. In: IEEE International Geoscience and Remote Sensing Symposium. Munich, Germany

  25. Amitay E, Har’El N, Sivan R, Soffer A (2004) Web-a-Where: Geotagging web content. In: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval

  26. Larson RR (1996) Geographic information retrieval and spatial browsing. In: Smith LC and Gluck M (eds) GIS and libraries: Patrons, maps and spatial information. University of Illinois at Urbana-Champaign, Urbana, pp 81–124

  27. Purves RS, Clough P, Jones CB, Arampatzis A, Bucher B, Finch D, Fu G, Joho H, Syed KA, Vaid S, Yang B (2007) The design and implementation of SPIRIT: A spatially aware search engine for information retrieval on the Internet. Int J Geogr Inf Sci 21(7):717–745

    Article  Google Scholar 

  28. DeLozier G, Baldridge J, London L (2015) Gazetteer-independent toponym resolution using geographic word profiles. In: Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, Austin, Texas

  29. Yu J, Rafiei D (2016) Geotagging named entities in news and online documents. In: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, Indianapolis, Indiana

  30. Karimzadeh M, Huang W, Banerjee S, Wallgrün JO, Hardisty F, Pezanowski S, Mitra P, MacEachren AM (2013) GeoTxt: A web API to leverage place references in text. In: Proceedings of the 7th Workshop on Geographic Information Retrieval, Orlando, Florida

  31. Hu Y, Mao H, McKenzie G (2018) A natural language processing and geospatial clustering framework for harvesting local place names from geotagged housing advertisements. Int J Geogr Inf Sci 33(4):714–738

    Article  Google Scholar 

  32. Teitlery BE, Lieberman MD, Panozzoy D, Sankaranarayanan J, Samety H, Sperling J (2008) NewsStand: A new view on news. In: Proceedings of the 16th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, Irvine, California

  33. Grover C, Tobin R, Byrne K, Woollard M, Reid J, Dunn S, Ball J (2010) Use of the Edinburgh Geoparser for georeferencing digitized historical collections. Philos Trans R Soc Lond A Math Phys Eng Sci 368(1925):3875–3889

  34. Lieberman MD, Samet H, Sankaranarayanan J (2010) Geotagging with local lexicons to build indexes for textually-specified spatial data. In: IEEE 26th International Conference on Data Engineering, ICDE 2010, Long Beach, California

  35. Gelernter J, Balaji S (2013) An algorithm for local geoparsing of microtext. GeoInformatica 17:635–667

    Article  Google Scholar 

  36. Scalia G, Francalanci C, Pernici B (2022) CIME: Context-aware geolocation of emergency-related posts. GeoInformatica 26:125–157

    Article  Google Scholar 

  37. Stokes N, Li Y, Moffat A, Rong J (2008) An empirical study of the effects of NLP components on geographic IR performance. Int J Geogr Inf Sci 22(3):247–264

    Article  Google Scholar 

  38. Nadeau D, Sekine S (2007) A survey of named entity recognition and classification. Linguisticae Investigationes 30(1):3–26

  39. Leidner JL, Lieberman MD (2011) Detecting geographical references in the form of place names and associated spatial natural language. SIGSPATIAL Special 3(2):5–11

    Article  Google Scholar 

  40. Finkel JR, Grenager T, Manning C (2005) Incorporating non-local information into information extraction systems by Gibbs sampling. In: Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics, ACL 2005

  41. Ratinov L, Roth D (2009) Design challenges and misconceptions in named entity recognition. In: Proceedings of the Thirteenth Conference on Computational Natural Language Learning, CoNLL 2009, Boulder, Colorado

  42. Marrero M, Urbano J, Sanchez-Cuadrado S, Morato J, Gomez-Berbis JM (2013) Named entity recognition: Fallacies, challenges and opportunities. Comput Stand Interfaces 35(5):482–489

    Article  Google Scholar 

  43. Vilain M, Su J, Lubar S (2007) Entity extraction is a boring solved problem-or is it? In: Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics, Rochester, New York

  44. Mota C, Grishman R (2008) Is this NE tagger getting old? In: Proceedings of the Sixth International Conference on Language Resources and Evaluation, LREC 2008, Marrakech, Morocco

  45. Lingad J, Karimi S, Yin J (2013) Location extraction from disaster-related microblogs. In: Proceedings of the 22nd International Conference on World Wide Web, Rio de Janeiro, Brazil

  46. Grishman R, Sundheim B (1996) Message understanding conference-6: A brief history. In: COLING 1996: Proceedings of the 16th conference on Computational linguistics, Copenhagen, Denmark

  47. Rau LF (1991) Extracting company names from text. In: The Seventh IEEE Conference on Artificial Intelligence Application, Miami Beach, Florida

  48. Black W, Rinaldi WJ, Mowatt F (1998) D FACILE:escription of the NE system used for MUC-7. In: Seventh Message Understanding Conference, MUC-7, Fairfax, Virginia

  49. Krupka GR, Hausman K (2005) IsoQuest inc.: Description of the NetOwl extractor system as used for MUC-7. In: Seventh Message Understanding Conference, MUC-7. Fairfax, Virginia

  50. Humphreys K, Gaizauskas R, Azzam S, Huyck C, Mitchel B, Cunningham H, Wilks Y (1998) University of Sheffield: Description of the LaSIE-II system as used for MUC-7. In: Seventh Message Understanding Conference, MUC-7, Fairfax, Virginia

  51. Li J, Sun A, Han J, Li C (2020) A survey on deep learning for named entity recognition. IEEE Trans Knowl Data Eng 34(1):50–70

  52. Bikel DM, Schwartz R, Weischedel RM (1999) An algorithm that learns what’s in a name. Mach Learn 34:211–231

    Article  MATH  Google Scholar 

  53. Zhou G, Jian S (2002) Named entity recognition using an HMM-based chunk tagger. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA

  54. Curran J, Clark S (2003) Language independent NER using a maximum entropy tagger. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003

  55. Sang EFTK, Meulder FD (2003) Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003

  56. Bikel DM, Miller S, Schwartz R, Weischedel R (1997) Nymble: A high-performance learning name-finder. In: Fifth Conference on Applied Natural Language Processing, Washington, DC, USA

  57. Borthwick A, Sterling J, Agichtein E, Grishman R (1999) Exploiting diverse knowledge sources via maximum entropy in named entity recognition. In: Sixth Workshop on Very Large Corpora

  58. Chieu HL, Ng HT (2002) Named entity recognition: A maximum entropy approach using global information. In: CoLING 2002

  59. Bender O, Och FJ, Ney H (2003) Maximum entropy models for named entity recognition. In: Proceedings of the seventh conference on Natural Language Learning at HLT-NAACL

  60. Toutanova K, Klein D, Manning CD, Singer Y (2003) Feature-rich part-of-speech tagging with a cyclic dependency network. In: Conference of the North American Chapter of the Association for Computational Linguistics & Human Language Technologies, NAACL-HLT 2003

  61. McCallum A, Li W (2003) Early results for NER with CRF, feature induction and word embeddings. In: Proceedings of the seventh conference on Natural Language Learning at HLT-NAACL

  62. Krishnan V, Manning CD (2006) An effective two-stage model for exploiting non-local dependencies in named entity recognition. In: 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, Sydney, Australia

  63. Asahara M, Matsumoto Y (2003) Japanese named entity extraction with redundant morphological analysis. In: Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics

  64. Li Y, Bontcheva K, Cunningham H (2005) SVM based learning system for information extraction. In: Deterministic and Statistical Methods in Machine Learning, pp 319–339

  65. Szarvas G, Farkas R, Kocsor A (2006) A multilingual named entity recognition system using boosting and C4.5 decision tree learning algorithms. Discovery Science, pp 267–278

  66. Chiu JP, Nichols E (2016) Named entity recognition with bidirectional LSTM-CNNs. Trans Association Comput Linguistics 4:357–370

    Article  Google Scholar 

  67. Huang Z, Xu W, Yu K (2015) Bidirectional LSTM-CRF models for sequence tagging. ArXiv, abs/1508.01991

  68. Lample G, Ballesteros M, Subramanian S, Kawakami K, Dyer C (2016) Neural Architectures for Named Entity Recognition. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, California

  69. Ma X, Hovy E (2016) End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany

  70. Yang Z, Salakhutdinov R, Cohen WW (2017) Transfer learning for sequence tagging with hierarchical recurrent networks. In: ICLR 2017

  71. Ji H, Grishman R (2006) Data selection in semi-supervised learning for name tagging. In: Proceedings of the Workshop on Information Extraction Beyond the Document, Sydney, Australia

  72. Turian J, Ratinov L-A, Bengio Y (2010) A simple and general method for semi-supervised learning. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics

  73. Alfonseca E, Manandhar S (2002) An unsupervised method for general named entity. In: Proceedings for International Conference on General WordNet

  74. Li D, Savova G, Kipper-Schuler K (2008) Conditional random fields and support vector machines for disorder named entity recognition in clinical texts. In: Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing, Columbus, Ohio

  75. Ritter A, Clark S, Mausam, Etzioni O (2011) Named entity recognition in tweets: An experimental study. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, Edinburgh, Scotland, UK

  76. Han J, Sun A, Cong G, Zhao WX, Ji Z, Phan MC (2018) Linking fine-grained locations in user comments. IEEE Trans Knowl Data Eng 30(1):59–72

    Article  Google Scholar 

  77. Rocktäschel T, Weidlich M, Leser U (2012) ChemSpot: A hybrid system for chemical named entity recognition. Bioinformatics 28(12):1633–1640

    Article  Google Scholar 

  78. Settles B (2004) Biomedical named entity recognition using conditional random fields and rich feature sets. In: Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications, NLPBA/BioNLP, Geneva, Switzerland

  79. Times of India (2016) Available from: https://timesofindia.indiatimes.com/city/pune/Senior-citizen-killed-in-accident/articleshow/53880438.cms. Accessed 25 Sept 2021

  80. Baldwin T, Kordoni V, Villavicencio A (2009) Prepositions in applications: A survey and introduction to the special issue. Comput Linguistics 35(2):119–150

    Article  Google Scholar 

  81. The Merriam-Webster Dictionary (2020) Available from: https://www.merriam-webster.com/dictionary/preposition#other-words. Accessed 21 Oct 2021

  82. The Free Dictionary. Available from: https://www.thefreedictionary.com/List-of-prepositions.htm. Accessed 21 Oct 2021

  83. Chen SF, Goodman J (1999) An empirical study of smoothing techniques for language modeling. Comput Speech Lang 13(4):359–393

    Article  Google Scholar 

  84. Agrawal R, Imielinski T, Swami A(1993) Mining association rules between sets of items in large databases. In: ACM SIGMOD International Conference on Management of data, Washington DC, USA

  85. Sang EFTK (2002) Introduction to the CoNLL-2002 shared task: Language-independent named entity recognition. In: Proceedings for Conference on Natural Language Learning

  86. Friburger N, Maurel D (2002) Textual similarity based on proper names. In: Mathematical Formal Information Retrieval, MFIR 2002

  87. Gaizauskas R, Wakao T, Humphreys K, Cunningham H, Wilks Y (1995) University of Sheffield: Description of the LaSIE system as used for MUC-6. In: Message Understanding Conference-6 pp6

  88. Hammersley J, Clifford P (1971) Markov fields on finite graphs and lattices. Unpublished Paper

  89. Sha F, Pereira F (2003) Shallow parsing with conditional random fields. In: Proceedings of Human Language Technology, NAACL, Canada

  90. Harrower T (2010) Inside reporting: A practical guide to the craft of journalism. McGraw-Hill Education, New York

    Google Scholar 

  91. Krippendorff K (2004) Content Analysis: An Introduction to its methodology. SAGE Publications, Thousand Oaks

    Google Scholar 

  92. Artstein R, Poesio M (2008) Inter-coder agreement for computational linguistics. Comput Linguistics 34(4):555–596

    Article  Google Scholar 

  93. Scott WA (1955) Reliability of content analysis: The case of nominal scale coding. Pub Opin Q 19(3):321–325

    Article  Google Scholar 

  94. Cohen J (1960) A coefficient of agreement for nominal scales. Educ Psychol Meas 20(1):37–46

    Article  Google Scholar 

  95. Davies M, Fleiss JL (1982) Measuring agreement for multinomial data. Biometrics 38(4):1047–1051

    Article  MATH  Google Scholar 

  96. Passonneau R (2006) Measuring agreement on set-valued items (MASI) for semantic and pragmatic annotation. In: Proceedings of the Fifth International Conference on Language Resources and Evaluation, LREC 2006, Genoa, Italy

  97. Krippendorff K, Craggs R (2016) The reliability of multi-valued coding of data. Commun Methods Meas 10(4):181–198

    Article  Google Scholar 

  98. Krippendorff K (2011) Agreement and information in the reliability of coding. Commun Methods Meas 5(2):93–112

    Article  Google Scholar 

  99. Burnard L (2000) Reference guide for the British national corpus. Oxford University, Oxford, UK

    Google Scholar 

  100. Okazaki N (2007) CRFsuite: A fast implementation of conditional random fields (CRFs). Available from: http://www.chokkan.org/software/crfsuite/. Accessed 2 Sept 2021

  101. Chinchor NA (1998) Overview of MUC-7. In: Seventh Message Understanding Conference, MUC-7. Fairfax, Virginia

  102. Gardner M, Grus J, Neumann M, Tafjord O, Dasigi P, Liu NF, Peters M, Schmitz M, Zettlemoyer LS (2018) AllenNLP: A deep semantic natural language processing platform. In: Proceedings of Workshop for NLP Open Source Software, NLP-OSS, Melbourne, Australia

  103. Ripley BD. Modelling spatial patterns. J R Stat Soc Ser B 39:172–92

Download references

Acknowledgements

The authors thank Anup Adhikari and Nisha Kiran Poudel for annotating the news reports for the ground truth development to evaluate our algorithm.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Praval Sharma.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sharma, P., Samal, A., Soh, LK. et al. A spatially-aware algorithm for location extraction from structured documents. Geoinformatica 27, 645–679 (2023). https://doi.org/10.1007/s10707-022-00482-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10707-022-00482-1

Keywords

Navigation