Abstract
When applying automated speech recognition (ASR) for Belgian Dutch, the output consists of an unsegmented stream of words, without any punctuation. A next step is to perform segmentation and insert punctuation, making the ASR output more readable and easy to manually correct. We present the first (as far as we know) publicly available punctuation insertion system for Dutch that functions at a usable level and that is publicly available. The model we present here is an extension of the approach of Guhr et al. (In: Swiss Text Analytics Conference. Shared task on Sentence End and Punctuation Prediction in NLG Text, 2021) for Dutch: we finetuned the Dutch language model RobBERT on a punctuation prediction sequence classification task. The model was finetuned on two datasets: the Dutch side of Europarl and the SoNaR corpus. For every word in the input sequence, the model predicts a punctuation marker that follows the word. In cases where the language is unknown or where code switching applies, we have extended an existing multilingual model with Dutch. Previous work showed that such a multilingual model, based on “xlm-roberta-base” performs on par or sometimes even better than the monolingual cases. The system was evaluated on in-domain data as a classifier and on out-of-domain data as a sentence segmentation system through full stop prediction. The evaluations on sentence segmentation on out of domain data show that models finetuned on SoNaR show the best results, which can be attributed to SoNaR being a reference corpus containing different language registers. The multilingual models show an even better precision (at the cost of a lower recall) compared to the monolingual models.
Similar content being viewed by others
Notes
References
Aronoff, M. (2007).Language (linguistics).Scholarpedia253175. revision #121088
Attia, M., Al-Badrashiny, M., & Diab, M. (2014). GWU-HASP: Hybrid Arabic spelling and punctuation corrector. Proceedings of the EMNLP 2014 workshop on Arabic natural language processing (ANLP) (pp. 148–154). Doha, QatarAssociation for Computational Linguistics. https://aclanthology.org/W14-3620
Che, X., Wang, C., Yang, H., & Meinel, C. (2016). Proceedings of the tenth international conference on language resources and evaluation (LREC’16) (pp. 654–658). Portorož, SloveniaEuropean Language Resources Association (ELRA). https://aclanthology.org/L16-1103
Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., & Stoyanov, V. (2020). Unsupervised cross-lingual representation learning at scale. Proceedings of the 58th annual meeting of the association for computational linguistics (pp. 8440–8451).
de Vries, W., van Cranenburgh, A., Bisazza, A., Caselli, T., van Noord, G., & Nissim, M. (2019). BERTje: A Dutch BERT Model. arXiv.https://arxiv.org/abs/1912.09582
Delobelle, P., Winters, T., & Berendt, B. (2020). RobBERT: a Dutch RoBERTa-based Language Model. Findings of the association for computational linguistics: Emnlp 2020 (pp. 3255–3265). https://aclanthology.org/2020.findings-emnlp.292
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: Human language technologies, Volume 1 (Long and Short Papers) (pp. 4171–4186). Minneapolis, Minnesota. https://aclanthology.org/N19-1423
Guerreiro, N. M., Rei, R., & Batista, F. (2021). Towards better subtitles: A multilingual approach for punctuation restoration of speech transcripts. Expert Systems with Applications, 186, 115740.
Guhr, O., Schumann, A. K., Bahrmann, F., & Böhme, H. J. (2021). FullStop: Multilingual deep models for punctuation prediction. Proceedings of the swiss text analytics conference (pp. 14–16). Shared task on Sentence End and Punctuation Prediction in NLG Text. Switzerland: Winterthur.
Klein, G., Kim, Y., Deng, Y., Senellart, J., & Rush, A. (2017). OpenNMT: Open-source toolkit for neural machine translation. Proceedings of ACL 2017, system demonstrations (pp. 67–72). Vancouver, Canada Association for Computational Linguistics. https://aclanthology.org/P17-4012
Koehn, P. (2005 13-15). Europarl: A parallel corpus for statistical machine translation. Proceedings of machine translation summit X: Papers (pp. 79–86). Phuket, Thailand. https://aclanthology.org/2005.mtsummit-papers.11
Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., & Herbst, E. (2007). Moses: Open source toolkit for statistical machine translation. Proceedings of the 45th annual meeting of the association for computational linguistics companion volume. Proceedings of the demo and poster sessions (pp. 177–180). Prague, Czech Republic. https://aclanthology.org/P07-2045
Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C. H., & Kang, J. (2019). BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4), 1234–1240.
Li, X., & Lin, E. (2020). A 43 language multilingual punctuation prediction neural network model. Interspeech (pp. 1067–1071).
Lison, P., & Tiedemann, J. (2016). OpenSubtitles 2016: Extracting large parallel corpora from movie and TV subtitles. Proceedings of the tenth international conference on language resources and evaluation (LREC’16) (pp. 923–929). Portorož, SloveniaEuropean Language Resources Association (ELRA). https://aclanthology.org/L16-1147
Lu, W., & Ng, H. T. (2010). Better punctuation prediction with dynamic conditional random fields. Proceedings of the 2010 conference on empirical methods in natural language processing (EMNLP) (pp. 177–186). Association for Computational Linguistics: Cambridge, MA.
Masiello-Ruiz, J. M., Cuadrado, J. L. L., & Martínez, P. (2021). Participation of HULAT-UC3M in SEPP-NLG 2021 shared task (short paper). Proceedings of the swiss text analytics conference (pp. 14–16). Shared task on Sentence End and Punctuation Prediction in NLG Text. Switzerland: Winterthur.
Oostdijk, N., Goedertier, W., van Eynde, F., Boves, L., Martens, J. P., Moortgat, M., & Baayen, H. (2002). Experiences from the spoken Dutch corpus project. Proceedings of the third international conference on language resources and evaluation (LREC’02). Las Palmas, Canary Islands - SpainEuropean Language Resources Association (ELRA). http://www.lrec-conf.org/proceedings/lrec2002/pdf/98.pdf
Oostdijk, N., Reynaert, M., Hoste, V., & Schuurman, I. (2013). The construction of a 500 million word reference corpus of contemporary written Dutch. Essential Speech and Language Technology for Dutch: Results by the STEVIN-programme. Springer Verlag.
Păiş, V., & Tufiş, D. (2022). Capitalization and punctuation restoration: A survey. Artificial Intelligence Review, 55(3), 1681–1722.
Pennington, J., Socher, R., & Manning, C. (2014). GloVe: Global vectors for word representation. Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532–1543). Doha, QatarAssociation for Computational Linguistics. https://aclanthology.org/D14-1162 https://doi.org/10.3115/v1/D14-1162
Petasis, G., Vichot, F., Wolinski, F., Paliouras, G., Karkaletsis, V., & Spyropoulos, C.D. (2001). Using machine learning to maintain rule-based named-entity recognition and classification systems. Proceedings of the 39th annual meeting of the association for computational linguistics (pp. 426–433). Toulouse, FranceAssociation for Computational Linguistics.https://aclanthology.org/P01-1055 https://doi.org/10.3115/1073012.1073067
Shazeer, N., & Stern, M. (2018 Jul 10–15) . Adafactor: Adaptive learning rates with sublinear memory cost. J. Dy & A. Krause (Eds.), Proceedings of the 35th international conference on machine learning (Vol. 80, pp. 4596–4604). PMLR.
Stolcke, A., & Shriberg, E. (1996). Automatic linguistic segmentation of conversational speech. Proceeding of fourth international conference on spoken language processing. ICSLP ’96 (Vol. 2, p. 1005-1008).
Sunkara, M., Ronanki, S., Dixit, K., Bodapati, S., & Kirchhoff, K. (2020). Robust prediction of punctuation and truecasing for medical ASR. Proceedings of the first workshop on natural language processing for medical conversations (pp. 53–62). Online Association for Computational Linguistics. https://aclanthology.org/2020.nlpmc-1.8
Susanto, R.H., Chieu, H.L., & Lu, W. (2016). Learning to capitalize with character-level recurrent neural networks: An empirical study. Proceedings of the 2016 conference on empirical methods in natural language processing (pp. 2090–2095). Austin, TexasAssociation for Computational Linguistics. https://aclanthology.org/D16-1225
Tiedemann, J. (2012). Parallel data, tools and interfaces in OPUS. Proceedings of the eighth international conference on language resources and evaluation (LREC’12) (pp. 2214–2218). Istanbul, Turkey European Language Resources Association (ELRA). http://www.lrec-conf.org/proceedings/lrec2012/pdf/463_Paper.pdf
Tilk, O., & Alumäe, T. (2016). Bidirectional recurrent neural network with attention mechanism for punctuation restoration. Interspeech. (pp. 3047–3051). San Francisco, USA.
Tuggener, D., & Aghaebrahimian, A. (2021). The sentence end and punctuation prediction in NLG Text (SEPP-NLG) Shared task 2021. Proceedings of the Swiss text analytics conference 2021.
Van Dyck, B., BabaAli, B., & Van Compernolle, D. (2021). A hybrid ASR system for Southern Dutch. Computational Linguistics in the Netherlands Journal 1127-34. https://clinjournal.org/clinj/article/view/119
Vandeghinste, V., & Bulté, B. (2019 Dec). Linguistic proxies of readability: Comparing easy-to-read and regular newspaper dutch. Computational Linguistics in the Netherlands Journal981-100. https://www.clinjournal.org/clinj/article/view/97
Vandeghinste, V., Van Dyck, B., De Coster, M., & Goddefroy, M. (2022). BeCoS corpus: Belgian Covid-19 sign language corpus. A corpus for training sign language recognition and translation. Computational Linguistics in the Netherlands Journal, 12, 7–17.
Vandeghinste, V., Verwimp, L., Pelemans, J., & Wambacq, P. (2018). A comparison of different punctuation prediction approaches in a translation context. Proceedings of the 21st annual conference of the European association for machine translation (pp. 269–278). Universitat d’Alacant, Alacant, Spain.
Vaswani, A. , Shazeer, N. , Parmar, N. , Uszkoreit, J. , Jones, L. , Gomez, A.N., & Polosukhin, I. (2017). Attention is all you need. I. Guyon et al. (Eds.),Advances in neural information processing systems (Vol. 30). Curran Associates, Inc.
Wittenburg, P. , Brugman, H. , Russel, A. , Klassmann, A., & Sloetjes, H. (2006). ELAN: A professional framework for multimodality research. Proceedings of the fifth international conference on language resources and evaluation (LREC’06). Genoa, Italy European Language Resources Association (ELRA).
Funding
Work in this paper is partly financed by the SignON project.https://signon-project.eu This project has received funding from the European Union’s Horizon 2020 Research and Innovation Programme under Grant Agreement No. 101017255. The SABeD project is funded by KU Leuven Internal Funding, Research Project 3H200610. Oliver Guhr has been funded by the European Social Fund (ESF), SAB grant number 100339497 and the European Regional Development Funds (ERDF) (ERDF-100346119).
Author information
Authors and Affiliations
Contributions
VV took the initiative, prepared the data, performed the experiments on out of domain data and on the baseline. OG adapted the existing approach to Dutch, performed finetuning of the models and performed the classification experiments. Both authors reviewed the manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Vandeghinste, V., Guhr, O. FullStop: punctuation and segmentation prediction for Dutch with transformers. Lang Resources & Evaluation (2023). https://doi.org/10.1007/s10579-023-09676-x
Accepted:
Published:
DOI: https://doi.org/10.1007/s10579-023-09676-x