Skip to main content
Log in

Text-based paper-level classification procedure for non-traditional sciences using a machine learning approach

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Science as a whole is organized into broad fields, and as a consequence, research, resources, students, etc., are also classified, assigned, or invited following a similar structure. Some fields have been established for centuries, and some others are just flourishing. Funding, staff, etc., to support fields are offered if there is some activity on it, commonly measured in terms of the number of published scientific papers. How to find them? There exist well-respected listings where scientific journals are ascribed to one or more knowledge fields. Such lists are human-made, but the complexity begins when a field covers more than one area of knowledge. How to discern if a particular paper is devoted to a field not considered in such lists? In this work, we propose a methodology able to classify the universe of papers into two classes; those belonging to the field of interest, and those that do not. This proposed procedure learns from the title and abstract of papers published in monothematic or “pure” journals. Provided that such journals exist, the procedure could be applied to any field of knowledge. We tested the process with Geographic Information Science. The field has contacts with Computer Science, Mathematics, Cartography, and others, a fact which makes the task very difficult. We also tested our procedure and analyzed its results with three different criteria, illustrating its power and capabilities. Interesting findings were found, where our proposed solution reached similar results as human taggers also similar results compared with state-of-the-art related work.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. https://www.webofscience.com/wos/woscc/basic-search.

  2. https://www.scopus.com/home.uri.

  3. https://incites.help.clarivate.com/Content/Research-Areas/citation-topics.htm.

  4. https://github.com/LucasLopesSI/SIG-Classifier.

  5. https://onlinelibrary.wiley.com/journal/14679671.

  6. https://www.journals.elsevier.com/computers-and-geosciences.

  7. https://www.tandfonline.com/toc/tgis20/current.

  8. https://www.utpjournals.press/loi/cart.

  9. https://www.hindawi.com/journals/jtm/.

  10. https://cdnsciencepub.com/toc/geomat/73/2.

  11. https://dl.acm.org/journal/klu-gein.

  12. https://github.com/tensorflow/tensor2tensor.

  13. https://radimrehurek.com/gensim/models/doc2vec.html.

  14. https://huggingface.co/bert-base-cased?.

  15. https://pypi.org/project/krippendorff/.

References

  1. Shu F, Julien C-A, Zhang L, Qiu J, Zhang J, Larivière V (2019) Comparing journal and paper level classifications of science. J Inform 13(1):202–225

    Article  Google Scholar 

  2. Waltman L, van Eck NJ (2012) A new methodology for constructing a publication-level classification system of science. J Am Soc Inf Sci Technol 63(12):2378–2392

    Article  Google Scholar 

  3. Le Q, Mikolov T (2014) Distributed representations of sentences and documents. In: International conference on machine learning. PMLR, pp. 1188–1196

  4. Devlin J, Chang M-W, Lee K, Toutanova K (2018) BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805

  5. Chen G, Chen J, Shao Y, Xiao L (2022) Automatic noise reduction of domain-specific bibliographic datasets using positive-unlabeled learning. Scientometrics 128:1–18

    Google Scholar 

  6. Kreutz CK, Sahitaj P, Schenkel R (2020) Evaluating semantometrics from computer science publications. Scientometrics 125(3):2915–2954

    Article  Google Scholar 

  7. Kozlowski D, Dusdal J, Pang J, Zilian A (2021) Semantic and relational spaces in science of science: deep learning models for article vectorisation. Scientometrics 126(7):5881–5910

    Article  Google Scholar 

  8. Roudsari AH, Afshar J, Lee W, Lee S (2022) PatentNet: multi-label classification of patent documents using deep learning based language understanding. Scientometrics 127:1–25

    Google Scholar 

  9. Chen H, Nguyen H, Alghamdi A (2022) Constructing a high-quality dataset for automated creation of summaries of fundamental contributions of research articles. Scientometrics 127:1–15

    Article  Google Scholar 

  10. Wang Q, Waltman L (2016) Large-scale analysis of the accuracy of the journal classification systems of web of science and Scopus. J Informet 10(2):347–364

    Article  Google Scholar 

  11. Lv Y, Xie Z, Zuo X, Song Y (2022) A multi-view method of scientific paper classification via heterogeneous graph embeddings. Scientometrics 127(8):4847–4872

    Article  Google Scholar 

  12. Shen S, Liu J, Lin L, Huang Y, Zhang L, Liu C, Feng Y, Wang D (2022) SsciBERT: a pre-trained language model for social science texts. Scientometrics 128:1–23

    Google Scholar 

  13. Raan AV (2003) The use of bibliometric analysis in research performance assessment and monitoring of interdisciplinary scientific developments. TATuP-Z Technikfolgenabschätzung Theorie Praxis 12(1):20–29

    Article  Google Scholar 

  14. de Solla Price DJ (1965) Networks of scientific papers. Science 149:510–515

    Article  Google Scholar 

  15. Small H (1973) Co-citation in the scientific literature: a new measure of the relationship between two documents. J Am Soc Inf Sci 24(3):265–269

    Article  Google Scholar 

  16. Klavans R, Boyack K (2017) Which type of citation analysis generates the most accurate taxonomy of scientific and technical knowledge? J Am Soc Inf Sci 68:984–998

    Google Scholar 

  17. Pech G, Delgado C, Sorella SP (2022) Classifying papers into subfields using abstracts, titles, keywords and keywords plus through pattern detection and optimization procedures: an application in physics. J Assoc Inf Sci Technol 73:1–16

    Article  Google Scholar 

  18. Leydesdorff L, Bornmann L (2016) The operationalization of “fields’’ as WoS subject categories (WC s) in evaluative bibliometrics: the cases of “library and information science’’ and “science & technology studies’’. J Am Soc Inf Sci 67(3):707–714

    Google Scholar 

  19. Boyack KW, Newman D, Duhon RJ, Klavans R, Patek M, Biberstine JR, Schijvenaars B, Skupin A, Ma N, Börner K (2011) Clustering more than two million biomedical publications: comparing the accuracies of nine text-based similarity approaches. PLoS ONE 6(3):18029

    Article  Google Scholar 

  20. Wang S, Mao J, Cao Y, Li G (2022) Integrated knowledge content in an interdisciplinary field: identification, classification, and application. Scientometrics 127(11):6581–6614

    Article  Google Scholar 

  21. Liu B, Lee WS, Yu PS, Li X (2002) Partially supervised classification of text documents. In: International conference on machine learning, ICML. Sydney, NSW, vol 2, pp 387–394

  22. Priem J, Piwowar H, Orr R (2022) Openalex: A fully-open index of scholarly works, authors, venues, institutions, and concepts. arXiv preprint arXiv:2205.01833

  23. Selivanova IV, Kosyakov DV, Dubovitskii DA, Guskov AE (2021) Expert, journal, and automatic classification of full texts and annotations of scientific articles. Autom Doc Math Linguist 55:178–189

    Article  Google Scholar 

  24. Muller M, Wolf C, Andres J, Desmond M, Joshi NN, Ashktorab Z, Sharma A, Brimijoin K, Pan Q, Duesterwald E (2021) Designing ground truth and the social life of labels. In: Proceedings of ACM human factors in computing systems (CHI’21), Article No 94, pp 1–16

  25. Huang W (2022) What were GIScience scholars interested in during the past decades? J Geovis Spat Anal 6(1):1–21

    Article  Google Scholar 

  26. López-Vázquez C, Gonzalez-Campos ME, Bernabé-Poveda MA, Moctezuma D, Hochsztain E, Barrera MA, Granell-Canut C, León-Pazmiño MF, López-Ramírez P, Morocho-Zurita V et al (2022) Building a gold standard dataset to identify articles about geographic information science. IEEE Access 10:19926–19936

    Article  Google Scholar 

  27. Clark S, Pulman S (2007) Combining symbolic and distributional models of meaning. Retrieved from https://www.aaai.org/Papers/Symposia/Spring/2007/SS-07-08/SS07-08-008.pdf

  28. Bender EM, Koller A (2020) Climbing towards NLU: on meaning, form, and understanding in the age of data. In: Proceedings of the 58th annual meeting of the association for computational linguistics (ACL2020). ACL, pp 5185–5198

  29. Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781

  30. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, vol 30

  31. González-Carvajal, S, Garrido-Merchán EC (2020) Comparing BERT against traditional machine learning text classification. arXiv preprint arXiv:2005.13012

  32. Rodrigues J, Gomes L, Silva J, Branco A, Santos R, Cardoso HL, Osório T (2023) Advancing neural encoding of Portuguese with transformer Albertina PT

  33. Cañete J, Chaperon G, Fuentes R, Ho J-H, Kang H, Pérez J (2020) Spanish pre-trained BERT model and evaluation data. In: PML4DC at ICLR 2020

  34. Akhtar Z (2020) BERT base vs BERT large. https://iq.opengenus.org/bert-base-vs-bert-large/. Accessed on 10 Nov 2022

  35. Briggs J (2021) BERT For next sentence prediction. https://towardsdatascience.com/bert-for-next-sentence-prediction-466b67f8226f. Accessed on 10 Nov 2022

  36. Wu X, Dong W, Wu L, Liu Y (2022) Research themes of geographical information science during 1991–2020: a retrospective bibliometric analysis. Int J Geogr Inf Sci 37:243

    Article  Google Scholar 

  37. Wiebe J, Bruce R, O’Hara TP (1999) Development and use of a gold-standard data set for subjectivity classifications. In: Proceedings of the 37th annual meeting of the association for computational linguistics, pp 246–253

  38. McCulloh I, Burck J, Behling J, Burks M, Parker J (2018) Leadership of data annotation teams. In: 2018 International workshop on social sensing (SocialSens). IEEE, pp. 26–31

  39. Goldstein EB, Buscombe D, Lazarus ED, Mohanty SD, Rafique SN, Anarde KA, Ashton AD, Beuzen T, Castagno KA, Cohn N et al (2021) Labeling poststorm coastal imagery for machine learning: measurement of interrater agreement. Earth Space Sci 8(9):e2021EA001896

    Article  Google Scholar 

  40. Boesser CT (2020) Comparing human and machine learning classification of human factors in incident reports from aviation. PhD thesis, University of Central Florida

  41. Krippendorff K (2009) Testing the reliability of content analysis data. The content analysis reader, 350–357

  42. Krippendorff K (2011) Agreement and information in the reliability of coding. Commun Methods Meas 5(2):93–112

    Article  Google Scholar 

Download references

Acknowledgements

The authors thank the grant provided by IDEAIS Project—CYTED: Programa Iberoamericano de Ciencia y Tecnología para el Desarrollo, Number 519RT0579.

Author information

Authors and Affiliations

Authors

Contributions

DM: Conceptualization, Methodology, Data Curation, Software, Investigation, Writing—original draft, Visualization, Funding acquisition. CL-V: Conceptualization, Methodology, Investigation, Writing—original draft, Visualization, Funding acquisition. LLR: Methodology, Data Curation, Software, Investigation, Writing—original draft, Visualization. NTR: Methodology, Investigation, Writing. JdJPA: Methodology, Investigation, Writing.

Corresponding author

Correspondence to Daniela Moctezuma.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Moctezuma, D., López-Vázquez, C., Lopes, L. et al. Text-based paper-level classification procedure for non-traditional sciences using a machine learning approach. Knowl Inf Syst 66, 1503–1520 (2024). https://doi.org/10.1007/s10115-023-02023-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-023-02023-0

Keywords

Navigation