Text-based paper-level classification procedure for non-traditional sciences using a machine learning approach

Moctezuma, Daniela; López-Vázquez, Carlos; Lopes, Lucas; Trevisan, Norton; Pérez, José

doi:10.1007/s10115-023-02023-0

Text-based paper-level classification procedure for non-traditional sciences using a machine learning approach

Regular Paper
Published: 13 December 2023

Volume 66, pages 1503–1520, (2024)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Daniela Moctezuma¹,
Carlos López-Vázquez²^na1,
Lucas Lopes³^na1,
Norton Trevisan³^na1 &
…
José Pérez³^na1

211 Accesses
Explore all metrics

Abstract

Science as a whole is organized into broad fields, and as a consequence, research, resources, students, etc., are also classified, assigned, or invited following a similar structure. Some fields have been established for centuries, and some others are just flourishing. Funding, staff, etc., to support fields are offered if there is some activity on it, commonly measured in terms of the number of published scientific papers. How to find them? There exist well-respected listings where scientific journals are ascribed to one or more knowledge fields. Such lists are human-made, but the complexity begins when a field covers more than one area of knowledge. How to discern if a particular paper is devoted to a field not considered in such lists? In this work, we propose a methodology able to classify the universe of papers into two classes; those belonging to the field of interest, and those that do not. This proposed procedure learns from the title and abstract of papers published in monothematic or “pure” journals. Provided that such journals exist, the procedure could be applied to any field of knowledge. We tested the process with Geographic Information Science. The field has contacts with Computer Science, Mathematics, Cartography, and others, a fact which makes the task very difficult. We also tested our procedure and analyzed its results with three different criteria, illustrating its power and capabilities. Interesting findings were found, where our proposed solution reached similar results as human taggers also similar results compared with state-of-the-art related work.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The journal coverage of Web of Science, Scopus and Dimensions: A comparative analysis

Article 26 March 2021

Artificial intelligence to automate the systematic review of scientific literature

Article Open access 11 May 2023

The journal coverage of Web of Science and Scopus: a comparative analysis

Article 19 October 2015

Notes

References

Shu F, Julien C-A, Zhang L, Qiu J, Zhang J, Larivière V (2019) Comparing journal and paper level classifications of science. J Inform 13(1):202–225
Article Google Scholar
Waltman L, van Eck NJ (2012) A new methodology for constructing a publication-level classification system of science. J Am Soc Inf Sci Technol 63(12):2378–2392
Article Google Scholar
Le Q, Mikolov T (2014) Distributed representations of sentences and documents. In: International conference on machine learning. PMLR, pp. 1188–1196
Devlin J, Chang M-W, Lee K, Toutanova K (2018) BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805
Chen G, Chen J, Shao Y, Xiao L (2022) Automatic noise reduction of domain-specific bibliographic datasets using positive-unlabeled learning. Scientometrics 128:1–18
Google Scholar
Kreutz CK, Sahitaj P, Schenkel R (2020) Evaluating semantometrics from computer science publications. Scientometrics 125(3):2915–2954
Article Google Scholar
Kozlowski D, Dusdal J, Pang J, Zilian A (2021) Semantic and relational spaces in science of science: deep learning models for article vectorisation. Scientometrics 126(7):5881–5910
Article Google Scholar
Roudsari AH, Afshar J, Lee W, Lee S (2022) PatentNet: multi-label classification of patent documents using deep learning based language understanding. Scientometrics 127:1–25
Google Scholar
Chen H, Nguyen H, Alghamdi A (2022) Constructing a high-quality dataset for automated creation of summaries of fundamental contributions of research articles. Scientometrics 127:1–15
Article Google Scholar
Wang Q, Waltman L (2016) Large-scale analysis of the accuracy of the journal classification systems of web of science and Scopus. J Informet 10(2):347–364
Article Google Scholar
Lv Y, Xie Z, Zuo X, Song Y (2022) A multi-view method of scientific paper classification via heterogeneous graph embeddings. Scientometrics 127(8):4847–4872
Article Google Scholar
Shen S, Liu J, Lin L, Huang Y, Zhang L, Liu C, Feng Y, Wang D (2022) SsciBERT: a pre-trained language model for social science texts. Scientometrics 128:1–23
Google Scholar
Raan AV (2003) The use of bibliometric analysis in research performance assessment and monitoring of interdisciplinary scientific developments. TATuP-Z Technikfolgenabschätzung Theorie Praxis 12(1):20–29
Article Google Scholar
de Solla Price DJ (1965) Networks of scientific papers. Science 149:510–515
Article Google Scholar
Small H (1973) Co-citation in the scientific literature: a new measure of the relationship between two documents. J Am Soc Inf Sci 24(3):265–269
Article Google Scholar
Klavans R, Boyack K (2017) Which type of citation analysis generates the most accurate taxonomy of scientific and technical knowledge? J Am Soc Inf Sci 68:984–998
Google Scholar
Pech G, Delgado C, Sorella SP (2022) Classifying papers into subfields using abstracts, titles, keywords and keywords plus through pattern detection and optimization procedures: an application in physics. J Assoc Inf Sci Technol 73:1–16
Article Google Scholar
Leydesdorff L, Bornmann L (2016) The operationalization of “fields’’ as WoS subject categories (WC s) in evaluative bibliometrics: the cases of “library and information science’’ and “science & technology studies’’. J Am Soc Inf Sci 67(3):707–714
Google Scholar
Boyack KW, Newman D, Duhon RJ, Klavans R, Patek M, Biberstine JR, Schijvenaars B, Skupin A, Ma N, Börner K (2011) Clustering more than two million biomedical publications: comparing the accuracies of nine text-based similarity approaches. PLoS ONE 6(3):18029
Article Google Scholar
Wang S, Mao J, Cao Y, Li G (2022) Integrated knowledge content in an interdisciplinary field: identification, classification, and application. Scientometrics 127(11):6581–6614
Article Google Scholar
Liu B, Lee WS, Yu PS, Li X (2002) Partially supervised classification of text documents. In: International conference on machine learning, ICML. Sydney, NSW, vol 2, pp 387–394
Priem J, Piwowar H, Orr R (2022) Openalex: A fully-open index of scholarly works, authors, venues, institutions, and concepts. arXiv preprint arXiv:2205.01833
Selivanova IV, Kosyakov DV, Dubovitskii DA, Guskov AE (2021) Expert, journal, and automatic classification of full texts and annotations of scientific articles. Autom Doc Math Linguist 55:178–189
Article Google Scholar
Muller M, Wolf C, Andres J, Desmond M, Joshi NN, Ashktorab Z, Sharma A, Brimijoin K, Pan Q, Duesterwald E (2021) Designing ground truth and the social life of labels. In: Proceedings of ACM human factors in computing systems (CHI’21), Article No 94, pp 1–16
Huang W (2022) What were GIScience scholars interested in during the past decades? J Geovis Spat Anal 6(1):1–21
Article Google Scholar
López-Vázquez C, Gonzalez-Campos ME, Bernabé-Poveda MA, Moctezuma D, Hochsztain E, Barrera MA, Granell-Canut C, León-Pazmiño MF, López-Ramírez P, Morocho-Zurita V et al (2022) Building a gold standard dataset to identify articles about geographic information science. IEEE Access 10:19926–19936
Article Google Scholar
Clark S, Pulman S (2007) Combining symbolic and distributional models of meaning. Retrieved from https://www.aaai.org/Papers/Symposia/Spring/2007/SS-07-08/SS07-08-008.pdf
Bender EM, Koller A (2020) Climbing towards NLU: on meaning, form, and understanding in the age of data. In: Proceedings of the 58th annual meeting of the association for computational linguistics (ACL2020). ACL, pp 5185–5198
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, vol 30
González-Carvajal, S, Garrido-Merchán EC (2020) Comparing BERT against traditional machine learning text classification. arXiv preprint arXiv:2005.13012
Rodrigues J, Gomes L, Silva J, Branco A, Santos R, Cardoso HL, Osório T (2023) Advancing neural encoding of Portuguese with transformer Albertina PT
Cañete J, Chaperon G, Fuentes R, Ho J-H, Kang H, Pérez J (2020) Spanish pre-trained BERT model and evaluation data. In: PML4DC at ICLR 2020
Akhtar Z (2020) BERT base vs BERT large. https://iq.opengenus.org/bert-base-vs-bert-large/. Accessed on 10 Nov 2022
Briggs J (2021) BERT For next sentence prediction. https://towardsdatascience.com/bert-for-next-sentence-prediction-466b67f8226f. Accessed on 10 Nov 2022
Wu X, Dong W, Wu L, Liu Y (2022) Research themes of geographical information science during 1991–2020: a retrospective bibliometric analysis. Int J Geogr Inf Sci 37:243
Article Google Scholar
Wiebe J, Bruce R, O’Hara TP (1999) Development and use of a gold-standard data set for subjectivity classifications. In: Proceedings of the 37th annual meeting of the association for computational linguistics, pp 246–253
McCulloh I, Burck J, Behling J, Burks M, Parker J (2018) Leadership of data annotation teams. In: 2018 International workshop on social sensing (SocialSens). IEEE, pp. 26–31
Goldstein EB, Buscombe D, Lazarus ED, Mohanty SD, Rafique SN, Anarde KA, Ashton AD, Beuzen T, Castagno KA, Cohn N et al (2021) Labeling poststorm coastal imagery for machine learning: measurement of interrater agreement. Earth Space Sci 8(9):e2021EA001896
Article Google Scholar
Boesser CT (2020) Comparing human and machine learning classification of human factors in incident reports from aviation. PhD thesis, University of Central Florida
Krippendorff K (2009) Testing the reliability of content analysis data. The content analysis reader, 350–357
Krippendorff K (2011) Agreement and information in the reliability of coding. Commun Methods Meas 5(2):93–112
Article Google Scholar

Download references

Acknowledgements

The authors thank the grant provided by IDEAIS Project—CYTED: Programa Iberoamericano de Ciencia y Tecnología para el Desarrollo, Number 519RT0579.

Author information

Carlos López-Vázquez, Lucas Lopes Resende, Norton Trevisan Roman, José de Jesús Pérez-Alcázar have contributed equally to this work.

Authors and Affiliations

Centro de Investigación en Ciencias de Información Geoespacial, Circuito Tecnopolo Norte, No.107 Col. Tecnopolo Pocitos II, Aguascalientes, 20313, Ags., Mexico
Daniela Moctezuma
LatinGEO Lab IGM+ORT, Universidad ORT, Cuareim 1451, Montevideo, 11100, Uruguay
Carlos López-Vázquez
School of Arts Sciences and Humanities, University of São Paulo, São Paulo, Brazil
Lucas Lopes, Norton Trevisan & José Pérez

Authors

Daniela Moctezuma
View author publications
You can also search for this author in PubMed Google Scholar
Carlos López-Vázquez
View author publications
You can also search for this author in PubMed Google Scholar
Lucas Lopes
View author publications
You can also search for this author in PubMed Google Scholar
Norton Trevisan
View author publications
You can also search for this author in PubMed Google Scholar
José Pérez
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

DM: Conceptualization, Methodology, Data Curation, Software, Investigation, Writing—original draft, Visualization, Funding acquisition. CL-V: Conceptualization, Methodology, Investigation, Writing—original draft, Visualization, Funding acquisition. LLR: Methodology, Data Curation, Software, Investigation, Writing—original draft, Visualization. NTR: Methodology, Investigation, Writing. JdJPA: Methodology, Investigation, Writing.

Corresponding author

Correspondence to Daniela Moctezuma.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Moctezuma, D., López-Vázquez, C., Lopes, L. et al. Text-based paper-level classification procedure for non-traditional sciences using a machine learning approach. Knowl Inf Syst 66, 1503–1520 (2024). https://doi.org/10.1007/s10115-023-02023-0

Download citation

Received: 21 March 2023
Revised: 15 June 2023
Accepted: 06 November 2023
Published: 13 December 2023
Issue Date: February 2024
DOI: https://doi.org/10.1007/s10115-023-02023-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Text-based paper-level classification procedure for non-traditional sciences using a machine learning approach

Abstract

Access this article

Similar content being viewed by others

The journal coverage of Web of Science, Scopus and Dimensions: A comparative analysis

Artificial intelligence to automate the systematic review of scientific literature

The journal coverage of Web of Science and Scopus: a comparative analysis

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Text-based paper-level classification procedure for non-traditional sciences using a machine learning approach

Abstract

Access this article

Similar content being viewed by others

The journal coverage of Web of Science, Scopus and Dimensions: A comparative analysis

Artificial intelligence to automate the systematic review of scientific literature

The journal coverage of Web of Science and Scopus: a comparative analysis

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation