FullStop: punctuation and segmentation prediction for Dutch with transformers

Vandeghinste, Vincent; Guhr, Oliver

doi:10.1007/s10579-023-09676-x

FullStop: punctuation and segmentation prediction for Dutch with transformers

Original Paper
Published: 14 July 2023

(2023)
Cite this article

Language Resources and Evaluation Aims and scope Submit manuscript

Vincent Vandeghinste^1,2 &
Oliver Guhr³

125 Accesses
1 Citation
Explore all metrics

Abstract

When applying automated speech recognition (ASR) for Belgian Dutch, the output consists of an unsegmented stream of words, without any punctuation. A next step is to perform segmentation and insert punctuation, making the ASR output more readable and easy to manually correct. We present the first (as far as we know) publicly available punctuation insertion system for Dutch that functions at a usable level and that is publicly available. The model we present here is an extension of the approach of Guhr et al. (In: Swiss Text Analytics Conference. Shared task on Sentence End and Punctuation Prediction in NLG Text, 2021) for Dutch: we finetuned the Dutch language model RobBERT on a punctuation prediction sequence classification task. The model was finetuned on two datasets: the Dutch side of Europarl and the SoNaR corpus. For every word in the input sequence, the model predicts a punctuation marker that follows the word. In cases where the language is unknown or where code switching applies, we have extended an existing multilingual model with Dutch. Previous work showed that such a multilingual model, based on “xlm-roberta-base” performs on par or sometimes even better than the monolingual cases. The system was evaluated on in-domain data as a classifier and on out-of-domain data as a sentence segmentation system through full stop prediction. The evaluations on sentence segmentation on out of domain data show that models finetuned on SoNaR show the best results, which can be attributed to SoNaR being a reference corpus containing different language registers. The multilingual models show an even better precision (at the cost of a lower recall) compared to the monolingual models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Sequence Labeling Algorithms for Punctuation Restoration in Brazilian Portuguese Texts

Evaluation of Transformer-Based Models for Punctuation and Capitalization Restoration in Spanish and Portuguese

An Empirical Study on Punctuation Restoration for English, Mandarin, and Code-Switching Speech

Notes

References

Aronoff, M. (2007).Language (linguistics).Scholarpedia253175. revision #121088
Attia, M., Al-Badrashiny, M., & Diab, M. (2014). GWU-HASP: Hybrid Arabic spelling and punctuation corrector. Proceedings of the EMNLP 2014 workshop on Arabic natural language processing (ANLP) (pp. 148–154). Doha, QatarAssociation for Computational Linguistics. https://aclanthology.org/W14-3620
Che, X., Wang, C., Yang, H., & Meinel, C. (2016). Proceedings of the tenth international conference on language resources and evaluation (LREC’16) (pp. 654–658). Portorož, SloveniaEuropean Language Resources Association (ELRA). https://aclanthology.org/L16-1103
Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., & Stoyanov, V. (2020). Unsupervised cross-lingual representation learning at scale. Proceedings of the 58th annual meeting of the association for computational linguistics (pp. 8440–8451).
de Vries, W., van Cranenburgh, A., Bisazza, A., Caselli, T., van Noord, G., & Nissim, M. (2019). BERTje: A Dutch BERT Model. arXiv.https://arxiv.org/abs/1912.09582
Delobelle, P., Winters, T., & Berendt, B. (2020). RobBERT: a Dutch RoBERTa-based Language Model. Findings of the association for computational linguistics: Emnlp 2020 (pp. 3255–3265). https://aclanthology.org/2020.findings-emnlp.292
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: Human language technologies, Volume 1 (Long and Short Papers) (pp. 4171–4186). Minneapolis, Minnesota. https://aclanthology.org/N19-1423
Guerreiro, N. M., Rei, R., & Batista, F. (2021). Towards better subtitles: A multilingual approach for punctuation restoration of speech transcripts. Expert Systems with Applications, 186, 115740.
Article Google Scholar
Guhr, O., Schumann, A. K., Bahrmann, F., & Böhme, H. J. (2021). FullStop: Multilingual deep models for punctuation prediction. Proceedings of the swiss text analytics conference (pp. 14–16). Shared task on Sentence End and Punctuation Prediction in NLG Text. Switzerland: Winterthur.
Klein, G., Kim, Y., Deng, Y., Senellart, J., & Rush, A. (2017). OpenNMT: Open-source toolkit for neural machine translation. Proceedings of ACL 2017, system demonstrations (pp. 67–72). Vancouver, Canada Association for Computational Linguistics. https://aclanthology.org/P17-4012
Koehn, P. (2005 13-15). Europarl: A parallel corpus for statistical machine translation. Proceedings of machine translation summit X: Papers (pp. 79–86). Phuket, Thailand. https://aclanthology.org/2005.mtsummit-papers.11
Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., & Herbst, E. (2007). Moses: Open source toolkit for statistical machine translation. Proceedings of the 45th annual meeting of the association for computational linguistics companion volume. Proceedings of the demo and poster sessions (pp. 177–180). Prague, Czech Republic. https://aclanthology.org/P07-2045
Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C. H., & Kang, J. (2019). BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4), 1234–1240.
Article Google Scholar
Li, X., & Lin, E. (2020). A 43 language multilingual punctuation prediction neural network model. Interspeech (pp. 1067–1071).
Lison, P., & Tiedemann, J. (2016). OpenSubtitles 2016: Extracting large parallel corpora from movie and TV subtitles. Proceedings of the tenth international conference on language resources and evaluation (LREC’16) (pp. 923–929). Portorož, SloveniaEuropean Language Resources Association (ELRA). https://aclanthology.org/L16-1147
Lu, W., & Ng, H. T. (2010). Better punctuation prediction with dynamic conditional random fields. Proceedings of the 2010 conference on empirical methods in natural language processing (EMNLP) (pp. 177–186). Association for Computational Linguistics: Cambridge, MA.
Masiello-Ruiz, J. M., Cuadrado, J. L. L., & Martínez, P. (2021). Participation of HULAT-UC3M in SEPP-NLG 2021 shared task (short paper). Proceedings of the swiss text analytics conference (pp. 14–16). Shared task on Sentence End and Punctuation Prediction in NLG Text. Switzerland: Winterthur.
Oostdijk, N., Goedertier, W., van Eynde, F., Boves, L., Martens, J. P., Moortgat, M., & Baayen, H. (2002). Experiences from the spoken Dutch corpus project. Proceedings of the third international conference on language resources and evaluation (LREC’02). Las Palmas, Canary Islands - SpainEuropean Language Resources Association (ELRA). http://www.lrec-conf.org/proceedings/lrec2002/pdf/98.pdf
Oostdijk, N., Reynaert, M., Hoste, V., & Schuurman, I. (2013). The construction of a 500 million word reference corpus of contemporary written Dutch. Essential Speech and Language Technology for Dutch: Results by the STEVIN-programme. Springer Verlag.
Book Google Scholar
Păiş, V., & Tufiş, D. (2022). Capitalization and punctuation restoration: A survey. Artificial Intelligence Review, 55(3), 1681–1722.
Article Google Scholar
Pennington, J., Socher, R., & Manning, C. (2014). GloVe: Global vectors for word representation. Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532–1543). Doha, QatarAssociation for Computational Linguistics. https://aclanthology.org/D14-1162 https://doi.org/10.3115/v1/D14-1162
Petasis, G., Vichot, F., Wolinski, F., Paliouras, G., Karkaletsis, V., & Spyropoulos, C.D. (2001). Using machine learning to maintain rule-based named-entity recognition and classification systems. Proceedings of the 39th annual meeting of the association for computational linguistics (pp. 426–433). Toulouse, FranceAssociation for Computational Linguistics.https://aclanthology.org/P01-1055 https://doi.org/10.3115/1073012.1073067
Shazeer, N., & Stern, M. (2018 Jul 10–15) . Adafactor: Adaptive learning rates with sublinear memory cost. J. Dy & A. Krause (Eds.), Proceedings of the 35th international conference on machine learning (Vol. 80, pp. 4596–4604). PMLR.
Stolcke, A., & Shriberg, E. (1996). Automatic linguistic segmentation of conversational speech. Proceeding of fourth international conference on spoken language processing. ICSLP ’96 (Vol. 2, p. 1005-1008).
Sunkara, M., Ronanki, S., Dixit, K., Bodapati, S., & Kirchhoff, K. (2020). Robust prediction of punctuation and truecasing for medical ASR. Proceedings of the first workshop on natural language processing for medical conversations (pp. 53–62). Online Association for Computational Linguistics. https://aclanthology.org/2020.nlpmc-1.8
Susanto, R.H., Chieu, H.L., & Lu, W. (2016). Learning to capitalize with character-level recurrent neural networks: An empirical study. Proceedings of the 2016 conference on empirical methods in natural language processing (pp. 2090–2095). Austin, TexasAssociation for Computational Linguistics. https://aclanthology.org/D16-1225
Tiedemann, J. (2012). Parallel data, tools and interfaces in OPUS. Proceedings of the eighth international conference on language resources and evaluation (LREC’12) (pp. 2214–2218). Istanbul, Turkey European Language Resources Association (ELRA). http://www.lrec-conf.org/proceedings/lrec2012/pdf/463_Paper.pdf
Tilk, O., & Alumäe, T. (2016). Bidirectional recurrent neural network with attention mechanism for punctuation restoration. Interspeech. (pp. 3047–3051). San Francisco, USA.
Tuggener, D., & Aghaebrahimian, A. (2021). The sentence end and punctuation prediction in NLG Text (SEPP-NLG) Shared task 2021. Proceedings of the Swiss text analytics conference 2021.
Van Dyck, B., BabaAli, B., & Van Compernolle, D. (2021). A hybrid ASR system for Southern Dutch. Computational Linguistics in the Netherlands Journal 1127-34. https://clinjournal.org/clinj/article/view/119
Vandeghinste, V., & Bulté, B. (2019 Dec). Linguistic proxies of readability: Comparing easy-to-read and regular newspaper dutch. Computational Linguistics in the Netherlands Journal981-100. https://www.clinjournal.org/clinj/article/view/97
Vandeghinste, V., Van Dyck, B., De Coster, M., & Goddefroy, M. (2022). BeCoS corpus: Belgian Covid-19 sign language corpus. A corpus for training sign language recognition and translation. Computational Linguistics in the Netherlands Journal, 12, 7–17.
Google Scholar
Vandeghinste, V., Verwimp, L., Pelemans, J., & Wambacq, P. (2018). A comparison of different punctuation prediction approaches in a translation context. Proceedings of the 21st annual conference of the European association for machine translation (pp. 269–278). Universitat d’Alacant, Alacant, Spain.
Vaswani, A. , Shazeer, N. , Parmar, N. , Uszkoreit, J. , Jones, L. , Gomez, A.N., & Polosukhin, I. (2017). Attention is all you need. I. Guyon et al. (Eds.),Advances in neural information processing systems (Vol. 30). Curran Associates, Inc.
Wittenburg, P. , Brugman, H. , Russel, A. , Klassmann, A., & Sloetjes, H. (2006). ELAN: A professional framework for multimodality research. Proceedings of the fifth international conference on language resources and evaluation (LREC’06). Genoa, Italy European Language Resources Association (ELRA).

Download references

Funding

Work in this paper is partly financed by the SignON project.https://signon-project.eu This project has received funding from the European Union’s Horizon 2020 Research and Innovation Programme under Grant Agreement No. 101017255. The SABeD project is funded by KU Leuven Internal Funding, Research Project 3H200610. Oliver Guhr has been funded by the European Social Fund (ESF), SAB grant number 100339497 and the European Regional Development Funds (ERDF) (ERDF-100346119).

Author information

Authors and Affiliations

Instituut voor de Nederlandse Taal, Rapenburg 61, Leiden, 2311 GJ, The Netherlands
Vincent Vandeghinste
Centre for Computational Linguistics, Leuven.AI, KU Leuven, Blijde Inkomststraat 21, Leuven, 3000, Belgium
Vincent Vandeghinste
Künstliche Intelligenz / Kognitive Robotik, Hochschule für Technik und Wirtschaft, Friedrich-List-Platz 1, Dresden, 01069, Germany
Oliver Guhr

Authors

Vincent Vandeghinste
View author publications
You can also search for this author in PubMed Google Scholar
Oliver Guhr
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

VV took the initiative, prepared the data, performed the experiments on out of domain data and on the baseline. OG adapted the existing approach to Dutch, performed finetuning of the models and performed the classification experiments. Both authors reviewed the manuscript.

Corresponding author

Correspondence to Vincent Vandeghinste.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

In this appendix we present more detailed classification evaluation results, including precision and recall. See Table 10, 11, 12, 13.

Table 10 Monolingual Europarl model tested on Nl EuroParl data

Full size table

Table 11 Monolingual SoNaR model tested on Nl SoNaR

Full size table

Table 12 Multilingual EP model tested on Nl Europarl data

Full size table

Table 13 Multilingual EP+SoNaR tested on Nl, Europarl + Nl SoNaR data

Full size table

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Vandeghinste, V., Guhr, O. FullStop: punctuation and segmentation prediction for Dutch with transformers. Lang Resources & Evaluation (2023). https://doi.org/10.1007/s10579-023-09676-x

Download citation

Accepted: 08 June 2023
Published: 14 July 2023
DOI: https://doi.org/10.1007/s10579-023-09676-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

FullStop: punctuation and segmentation prediction for Dutch with transformers

Abstract

Access this article

Similar content being viewed by others

Sequence Labeling Algorithms for Punctuation Restoration in Brazilian Portuguese Texts

Evaluation of Transformer-Based Models for Punctuation and Capitalization Restoration in Spanish and Portuguese

An Empirical Study on Punctuation Restoration for English, Mandarin, and Code-Switching Speech

Notes

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendix

Rights and permissions

About this article

Cite this article

Keywords

Navigation

FullStop: punctuation and segmentation prediction for Dutch with transformers

Abstract

Access this article

Similar content being viewed by others

Sequence Labeling Algorithms for Punctuation Restoration in Brazilian Portuguese Texts

Evaluation of Transformer-Based Models for Punctuation and Capitalization Restoration in Spanish and Portuguese

An Empirical Study on Punctuation Restoration for English, Mandarin, and Code-Switching Speech

Notes

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation