Skip to main content
Log in

FullStop: punctuation and segmentation prediction for Dutch with transformers

  • Original Paper
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

When applying automated speech recognition (ASR) for Belgian Dutch, the output consists of an unsegmented stream of words, without any punctuation. A next step is to perform segmentation and insert punctuation, making the ASR output more readable and easy to manually correct. We present the first (as far as we know) publicly available punctuation insertion system for Dutch that functions at a usable level and that is publicly available. The model we present here is an extension of the approach of Guhr et al. (In: Swiss Text Analytics Conference. Shared task on Sentence End and Punctuation Prediction in NLG Text, 2021) for Dutch: we finetuned the Dutch language model RobBERT on a punctuation prediction sequence classification task. The model was finetuned on two datasets: the Dutch side of Europarl and the SoNaR corpus. For every word in the input sequence, the model predicts a punctuation marker that follows the word. In cases where the language is unknown or where code switching applies, we have extended an existing multilingual model with Dutch. Previous work showed that such a multilingual model, based on “xlm-roberta-base” performs on par or sometimes even better than the monolingual cases. The system was evaluated on in-domain data as a classifier and on out-of-domain data as a sentence segmentation system through full stop prediction. The evaluations on sentence segmentation on out of domain data show that models finetuned on SoNaR show the best results, which can be attributed to SoNaR being a reference corpus containing different language registers. The multilingual models show an even better precision (at the cost of a lower recall) compared to the monolingual models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. https://fortunelords.com/youtube-statistics/

  2. https://www.arts.kuleuven.be/ling/language-education-society/projects/sabed.

  3. https://vlo.clarin.eu.

  4. https://www.signon-project.eu/.

  5. http://hdl.handle.net/10032/tm-a2-h5.

  6. http://www.opensubtitles.org/.

  7. https://huggingface.co/oliverguhr/

  8. https://github.com/oliverguhr/deepmultilingualpunctuation

  9. https://huggingface.co/oliverguhr

  10. https://github.com/VincentCCL/Segment_FullStop

References

  • Aronoff, M. (2007).Language (linguistics).Scholarpedia253175. revision #121088

  • Attia, M., Al-Badrashiny, M., & Diab, M. (2014). GWU-HASP: Hybrid Arabic spelling and punctuation corrector. Proceedings of the EMNLP 2014 workshop on Arabic natural language processing (ANLP) (pp. 148–154). Doha, QatarAssociation for Computational Linguistics. https://aclanthology.org/W14-3620

  • Che, X., Wang, C., Yang, H., & Meinel, C. (2016). Proceedings of the tenth international conference on language resources and evaluation (LREC’16) (pp. 654–658). Portorož, SloveniaEuropean Language Resources Association (ELRA). https://aclanthology.org/L16-1103

  • Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., & Stoyanov, V. (2020). Unsupervised cross-lingual representation learning at scale. Proceedings of the 58th annual meeting of the association for computational linguistics (pp. 8440–8451).

  • de Vries, W., van Cranenburgh, A., Bisazza, A., Caselli, T., van Noord, G., & Nissim, M. (2019). BERTje: A Dutch BERT Model. arXiv.https://arxiv.org/abs/1912.09582

  • Delobelle, P., Winters, T., & Berendt, B. (2020). RobBERT: a Dutch RoBERTa-based Language Model. Findings of the association for computational linguistics: Emnlp 2020 (pp. 3255–3265). https://aclanthology.org/2020.findings-emnlp.292

  • Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: Human language technologies, Volume 1 (Long and Short Papers) (pp. 4171–4186). Minneapolis, Minnesota. https://aclanthology.org/N19-1423

  • Guerreiro, N. M., Rei, R., & Batista, F. (2021). Towards better subtitles: A multilingual approach for punctuation restoration of speech transcripts. Expert Systems with Applications, 186, 115740.

    Article  Google Scholar 

  • Guhr, O., Schumann, A. K., Bahrmann, F., & Böhme, H. J. (2021). FullStop: Multilingual deep models for punctuation prediction. Proceedings of the swiss text analytics conference (pp. 14–16). Shared task on Sentence End and Punctuation Prediction in NLG Text. Switzerland: Winterthur.

  • Klein, G., Kim, Y., Deng, Y., Senellart, J., & Rush, A. (2017). OpenNMT: Open-source toolkit for neural machine translation. Proceedings of ACL 2017, system demonstrations (pp. 67–72). Vancouver, Canada Association for Computational Linguistics. https://aclanthology.org/P17-4012

  • Koehn, P. (2005 13-15). Europarl: A parallel corpus for statistical machine translation. Proceedings of machine translation summit X: Papers (pp. 79–86). Phuket, Thailand. https://aclanthology.org/2005.mtsummit-papers.11

  • Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., & Herbst, E. (2007). Moses: Open source toolkit for statistical machine translation. Proceedings of the 45th annual meeting of the association for computational linguistics companion volume. Proceedings of the demo and poster sessions (pp. 177–180). Prague, Czech Republic. https://aclanthology.org/P07-2045

  • Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C. H., & Kang, J. (2019). BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4), 1234–1240.

    Article  Google Scholar 

  • Li, X., & Lin, E. (2020). A 43 language multilingual punctuation prediction neural network model. Interspeech (pp. 1067–1071).

  • Lison, P., & Tiedemann, J. (2016). OpenSubtitles 2016: Extracting large parallel corpora from movie and TV subtitles. Proceedings of the tenth international conference on language resources and evaluation (LREC’16) (pp. 923–929). Portorož, SloveniaEuropean Language Resources Association (ELRA). https://aclanthology.org/L16-1147

  • Lu, W., & Ng, H. T. (2010). Better punctuation prediction with dynamic conditional random fields. Proceedings of the 2010 conference on empirical methods in natural language processing (EMNLP) (pp. 177–186). Association for Computational Linguistics: Cambridge, MA.

  • Masiello-Ruiz, J. M., Cuadrado, J. L. L., & Martínez, P. (2021). Participation of HULAT-UC3M in SEPP-NLG 2021 shared task (short paper). Proceedings of the swiss text analytics conference (pp. 14–16). Shared task on Sentence End and Punctuation Prediction in NLG Text. Switzerland: Winterthur.

  • Oostdijk, N., Goedertier, W., van Eynde, F., Boves, L., Martens, J. P., Moortgat, M., & Baayen, H. (2002). Experiences from the spoken Dutch corpus project. Proceedings of the third international conference on language resources and evaluation (LREC’02). Las Palmas, Canary Islands - SpainEuropean Language Resources Association (ELRA). http://www.lrec-conf.org/proceedings/lrec2002/pdf/98.pdf

  • Oostdijk, N., Reynaert, M., Hoste, V., & Schuurman, I. (2013). The construction of a 500 million word reference corpus of contemporary written Dutch. Essential Speech and Language Technology for Dutch: Results by the STEVIN-programme. Springer Verlag.

    Book  Google Scholar 

  • Păiş, V., & Tufiş, D. (2022). Capitalization and punctuation restoration: A survey. Artificial Intelligence Review, 55(3), 1681–1722.

    Article  Google Scholar 

  • Pennington, J., Socher, R., & Manning, C. (2014). GloVe: Global vectors for word representation. Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532–1543). Doha, QatarAssociation for Computational Linguistics. https://aclanthology.org/D14-1162 https://doi.org/10.3115/v1/D14-1162

  • Petasis, G., Vichot, F., Wolinski, F., Paliouras, G., Karkaletsis, V., & Spyropoulos, C.D. (2001). Using machine learning to maintain rule-based named-entity recognition and classification systems. Proceedings of the 39th annual meeting of the association for computational linguistics (pp. 426–433). Toulouse, FranceAssociation for Computational Linguistics.https://aclanthology.org/P01-1055 https://doi.org/10.3115/1073012.1073067

  • Shazeer, N., & Stern, M. (2018 Jul 10–15) . Adafactor: Adaptive learning rates with sublinear memory cost. J. Dy & A. Krause (Eds.), Proceedings of the 35th international conference on machine learning (Vol. 80, pp. 4596–4604). PMLR.

  • Stolcke, A., & Shriberg, E. (1996). Automatic linguistic segmentation of conversational speech. Proceeding of fourth international conference on spoken language processing. ICSLP ’96 (Vol. 2, p. 1005-1008).

  • Sunkara, M., Ronanki, S., Dixit, K., Bodapati, S., & Kirchhoff, K. (2020). Robust prediction of punctuation and truecasing for medical ASR. Proceedings of the first workshop on natural language processing for medical conversations (pp. 53–62). Online Association for Computational Linguistics. https://aclanthology.org/2020.nlpmc-1.8

  • Susanto, R.H., Chieu, H.L., & Lu, W. (2016). Learning to capitalize with character-level recurrent neural networks: An empirical study. Proceedings of the 2016 conference on empirical methods in natural language processing (pp. 2090–2095). Austin, TexasAssociation for Computational Linguistics. https://aclanthology.org/D16-1225

  • Tiedemann, J. (2012). Parallel data, tools and interfaces in OPUS. Proceedings of the eighth international conference on language resources and evaluation (LREC’12) (pp. 2214–2218). Istanbul, Turkey European Language Resources Association (ELRA). http://www.lrec-conf.org/proceedings/lrec2012/pdf/463_Paper.pdf

  • Tilk, O., & Alumäe, T. (2016). Bidirectional recurrent neural network with attention mechanism for punctuation restoration. Interspeech. (pp. 3047–3051). San Francisco, USA.

  • Tuggener, D., & Aghaebrahimian, A. (2021). The sentence end and punctuation prediction in NLG Text (SEPP-NLG) Shared task 2021. Proceedings of the Swiss text analytics conference 2021.

  • Van Dyck, B., BabaAli, B., & Van Compernolle, D. (2021). A hybrid ASR system for Southern Dutch. Computational Linguistics in the Netherlands Journal 1127-34. https://clinjournal.org/clinj/article/view/119

  • Vandeghinste, V., & Bulté, B. (2019 Dec). Linguistic proxies of readability: Comparing easy-to-read and regular newspaper dutch. Computational Linguistics in the Netherlands Journal981-100. https://www.clinjournal.org/clinj/article/view/97

  • Vandeghinste, V., Van Dyck, B., De Coster, M., & Goddefroy, M. (2022). BeCoS corpus: Belgian Covid-19 sign language corpus. A corpus for training sign language recognition and translation. Computational Linguistics in the Netherlands Journal, 12, 7–17.

    Google Scholar 

  • Vandeghinste, V., Verwimp, L., Pelemans, J., & Wambacq, P. (2018). A comparison of different punctuation prediction approaches in a translation context. Proceedings of the 21st annual conference of the European association for machine translation (pp. 269–278). Universitat d’Alacant, Alacant, Spain.

  • Vaswani, A. , Shazeer, N. , Parmar, N. , Uszkoreit, J. , Jones, L. , Gomez, A.N., & Polosukhin, I. (2017). Attention is all you need. I. Guyon et al. (Eds.),Advances in neural information processing systems (Vol. 30). Curran Associates, Inc.

  • Wittenburg, P. , Brugman, H. , Russel, A. , Klassmann, A., & Sloetjes, H. (2006). ELAN: A professional framework for multimodality research. Proceedings of the fifth international conference on language resources and evaluation (LREC’06). Genoa, Italy European Language Resources Association (ELRA).

Download references

Funding

Work in this paper is partly financed by the SignON project.https://signon-project.eu This project has received funding from the European Union’s Horizon 2020 Research and Innovation Programme under Grant Agreement No. 101017255. The SABeD project is funded by KU Leuven Internal Funding, Research Project 3H200610. Oliver Guhr has been funded by the European Social Fund (ESF), SAB grant number 100339497 and the European Regional Development Funds (ERDF) (ERDF-100346119).

Author information

Authors and Affiliations

Authors

Contributions

VV took the initiative, prepared the data, performed the experiments on out of domain data and on the baseline. OG adapted the existing approach to Dutch, performed finetuning of the models and performed the classification experiments. Both authors reviewed the manuscript.

Corresponding author

Correspondence to Vincent Vandeghinste.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

In this appendix we present more detailed classification evaluation results, including precision and recall. See Table 10, 11, 12, 13.

Table 10 Monolingual Europarl model tested on Nl EuroParl data
Table 11 Monolingual SoNaR model tested on Nl SoNaR
Table 12 Multilingual EP model tested on Nl Europarl data
Table 13 Multilingual EP+SoNaR tested on Nl, Europarl + Nl SoNaR data

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Vandeghinste, V., Guhr, O. FullStop: punctuation and segmentation prediction for Dutch with transformers. Lang Resources & Evaluation (2023). https://doi.org/10.1007/s10579-023-09676-x

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10579-023-09676-x

Keywords

Navigation