research-article

Free Access

Just Accepted

Help Them Understand: Testing and Improving Voice User Interfaces

Authors:
Emanuela Guglielmi

Bioscience and Territory University of Molise, Pesche, Italy

Bioscience and Territory University of Molise, Pesche, Italy
Search about this author

,
Giovanni Rosa

Bioscience and Territory University of Molise, Pesche, Italy

Bioscience and Territory University of Molise, Pesche, Italy
Search about this author

,
Simone Scalabrino

Bioscience and Territory University of Molise, Pesche, Italy

Bioscience and Territory University of Molise, Pesche, Italy
Search about this author

,
Gabriele Bavota

Faculty of Informatics Università della Svizzera italiana, Lugano, Switzerland

Faculty of Informatics Università della Svizzera italiana, Lugano, Switzerland
Search about this author

,
Rocco Oliveto

Bioscience and Territory University of Molise, Pesche, Italy

Bioscience and Territory University of Molise, Pesche, Italy
Search about this author

Authors Info & Claims

ACM Transactions on Software Engineering and MethodologyAccepted on February 2024https://doi.org/10.1145/3654438

Online AM:05 April 2024Publication History

ACM Transactions on Software Engineering and Methodology

Abstract

Voice-based virtual assistants are becoming increasingly popular. Such systems provide frameworks to developers for building custom apps. End-users can interact with such apps through a Voice User Interface (VUI), which allows the user to use natural language commands to perform actions. Testing such apps is not trivial: The same command can be expressed in different semantically equivalent ways. In this paper, we introduce VUI-UPSET, an approach that adapts chatbot-testing approaches to VUI-testing. We conducted an empirical study to understand how VUI-UPSET compares to two state-of-the-art approaches (i.e., a chatbot testing technique and ChatGPT) in terms of (i) correctness of the generated paraphrases, and (ii) capability of revealing bugs. To this aim, we analyzed 14,898 generated paraphrases for 40 Alexa Skills. Our results show that VUI-UPSET generates more bug-revealing paraphrases than the two baselines with, however, ChatGPT being the approach generating the highest percentage of correct paraphrases. We also tried to use the generated paraphrases to improve the skills. We tried to include in the voice interaction models of the skills (i) only the bug-revealing paraphrases, (ii) all the valid paraphrases. We observed that including only bug-revealing paraphrases is sometimes not sufficient to make all the tests pass.

References

2022. Stop Word List. https://countwordsfree.com/stopwords.Google Scholar
”Amazon”. 2018. Alexa. https://developer.amazon.com/en-US/alexa.Google Scholar
”Amazon”. 2018. Alexa Slots. https://developer.amazon.com/en-US/docs/alexa/custom-skills/slot-type-reference.html.Google Scholar
”Amazon”. 2018. Amazon Developer. https://developer.amazon.com/en/.Google Scholar
”Amazon”. 2018. Amazon official documentation. https://developer.amazon.com/en-US/docs/alexa/custom-skills/get-utterance-recommendations.html.Google Scholar
”Amazon”. 2018. NLU-evaluation tool. https://developer.amazon.com/it-IT/docs/alexa/smapi/nlu-evaluation-tool-api.html.Google Scholar
Jordan J Bird, Anikó Ekárt, and Diego R Faria. 2021. Chatbot Interaction with Artificial Intelligence: human data augmentation with T5 and language transformer ensemble for text classification. Journal of Ambient Intelligence and Humanized Computing (2021), 1–16.Google Scholar
Jordan J Bird, Anikó Ekárt, and Diego R Faria. 2023. Chatbot Interaction with Artificial Intelligence: human data augmentation with T5 and language transformer ensemble for text classification. Journal of Ambient Intelligence and Humanized Computing 14, 4 (2023), 3129–3144.Google ScholarCross Ref
Josip Bozic, Oliver A Tazl, and Franz Wotawa. 2019. Chatbot testing using AI planning. In 2019 IEEE International Conference On Artificial Intelligence Testing (AITest). IEEE, 37–44.Google ScholarCross Ref
Josip Bozic and Franz Wotawa. 2019. Testing chatbots using metamorphic relations. In IFIP International Conference on Testing Software and Systems. Springer, 41–55.Google ScholarDigital Library
Jordi Cabot, Loli Burgueno, Robert Clarisó, Gwendal Daniel, Jorge Perianez-Pascual, and Roberto Rodriguez-Echeverria. 2021. Testing challenges for NLP-intensive bots. In 2021 IEEE/ACM Third International Workshop on Bots in Software Engineering (BotSE). IEEE, 31–34.Google ScholarCross Ref
Daniel Cer, Mona Diab, Eneko Agirre, Iñigo Lopez-Gazpio, and Lucia Specia. 2017. SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017). Association for Computational Linguistics, Vancouver, Canada, 1–14. https://doi.org/10.18653/v1/S17-2001Google ScholarCross Ref
”ChatGPT”. 2023. ChatGpt. https://chat.openai.com.Google Scholar
Alexandru Coca, Bo-Hsiang Tseng, Weizhe Lin, and Bill Byrne. 2023. More Robust Schema-Guided Dialogue State Tracking via Tree-Based Paraphrase Ranking. arXiv preprint arXiv:2303.09905(2023).Google Scholar
Michael H Cohen, Michael Harris Cohen, James P Giangola, and Jennifer Balogh. 2004. Voice user interface design. Addison-Wesley Professional.Google ScholarDigital Library
Tom De Smedt and Walter Daelemans. 2012. Pattern for python. The Journal of Machine Learning Research 13, 1 (2012), 2063–2067.Google ScholarDigital Library
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805(2018).Google Scholar
Adrian Egli. 2023. ChatGPT, GPT-4, and Other Large Language Models: The Next Revolution for Clinical Microbiology?Clinical Infectious Diseases(2023), ciad407.Google Scholar
”Hugging Face”. [n. d.]. Hugging Face squad_v2. https://huggingface.co/datasets/squad_v2/viewer/squad_v2/train?p=4&row=440.Google Scholar
”Hugging Face”. 2022. Hugging Face. https://huggingface.co/cross-encoder/stsb-roberta-large.Google Scholar
”Hugging Face”. 2022. Hugging Face ambig_qa. https://huggingface.co/datasets/ambig_qa/viewer/full/train.Google Scholar
”Hugging Face”. 2022. Hugging Face break_data. https://huggingface.co/datasets/break_data/viewer/logical-forms/test?row=1.Google Scholar
”Hugging Face”. 2022. Hugging Face conv_ai_3. https://huggingface.co/datasets/conv_ai_3/viewer/conv_ai_3/train?row=36.Google Scholar
Emanuela Guglielmi, Giovanni Rosa, Simone Scalabrino, Gabriele Bavota, and Rocco Oliveto. 2022. Replication Package of ”Help Them Understand: Testing and Improving Voice User Interfaces”. https://figshare.com/s/36c3475659710714175d.Google Scholar
Emanuela Guglielmi, Giovanni Rosa, Simone Scalabrino, Gabriele Bavota, and Rocco Oliveto. 2022. Sorry, I don’t Understand: Improving Voice User Interface Testing. In 37th IEEE/ACM International Conference on Automated Software Engineering. 1–12.Google Scholar
Jonathan Guichard, Elayne Ruane, Ross Smith, Dan Bean, and Anthony Ventresque. 2019. Assessing the robustness of conversational agents using paraphrases. In 2019 IEEE International Conference On Artificial Intelligence Testing (AITest). IEEE, 55–62.Google ScholarCross Ref
Samer Hassan, Andras Csomai, Carmen Banea, Ravi Sinha, and Rada Mihalcea. 2007. Unt: Subfinder: Combining knowledge sources for automatic lexical substitution. In Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007). 410–413.Google ScholarCross Ref
Chaitra Hegde and Shrikumar Patil. 2020. Unsupervised paraphrase generation using pre-trained language models. arXiv preprint arXiv:2006.05477(2020).Google Scholar
Kuan-Hao Huang and Kai-Wei Chang. 2021. Generating syntactically controlled paraphrases without using annotated parallel pairs. arXiv preprint arXiv:2101.10579(2021).Google Scholar
”KayLearch”. 2018. KayLearch. https://github.com/KayLerch/alexa-utterance-generator/.Google Scholar
Federica Laricchia. 2022. Number of digital voice assistants in use worldwide from 2019 to 2024. https://www.statista.com/statistics/973815/worldwide-digital-voice-assistant-in-use/.Google Scholar
Kwang B Lee and Roger A Grice. 2006. The design and development of user interfaces for voice application in mobile devices. In 2006 IEEE International Professional Communication Conference. IEEE, 308–320.Google ScholarCross Ref
Suwan Li, Lei Bu, Guangdong Bai, Zhixiu Guo, Kai Chen, and Hanlin Wei. 2022. VITAS: Guided Model-based VUI Testing of VPA Apps. In 37th IEEE/ACM International Conference on Automated Software Engineering. 1–12.Google Scholar
Ta Lin Liau, Carolyn B Bassin, Clessen J Martin, and Edmund B Coleman. 1976. Modification of the Coleman readability formulas. Journal of Reading Behavior 8, 4 (1976), 381–386.Google ScholarCross Ref
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692(2019).Google Scholar
Andrea De Lucia, Fausto Fasano, Rocco Oliveto, and Genoveffa Tortora. 2007. Recovering traceability links in software artifact management systems using information retrieval methods. ACM Transactions on Software Engineering and Methodology (TOSEM) 16, 4(2007), 13–es.Google ScholarDigital Library
Guillermo Macbeth, Eugenia Razumiejczyk, and Rubén Daniel Ledesma. 2011. Cliff’s Delta Calculator: A non-parametric effect size program for two groups of observations. Universitas Psychologica 10, 2 (2011), 545–555.Google ScholarCross Ref
Christopher D Manning, Mihai Surdeanu, John Bauer, Jenny Rose Finkel, Steven Bethard, and David McClosky. 2014. The Stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations. 55–60.Google ScholarCross Ref
Ke Mao, Mark Harman, and Yue Jia. 2016. Sapienz: Multi-objective automated testing for android applications. In Proceedings of the 25th International Symposium on Software Testing and Analysis. 94–105.Google ScholarDigital Library
Tomáš Mikolov, Wen-tau Yih, and Geoffrey Zweig. 2013. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 conference of the north american chapter of the association for computational linguistics: Human language technologies. 746–751.Google Scholar
George A Miller. 1995. WordNet: a lexical database for English. Commun. ACM 38, 11 (1995), 39–41.Google ScholarDigital Library
Kevin Moran, Mario Linares Vásquez, and Denys Poshyvanyk. 2017. Automated GUI testing of Android apps: from research to practice. In 2017 IEEE/ACM 39th International Conference on Software Engineering Companion (ICSE-C). IEEE, 505–506.Google ScholarDigital Library
Leah Nicolich-Henkin, Taichi Nakatani, Zach Trozenski, Joel Whiteman, and Nathan Susanj. 2021. Comparing Data Augmentation and Annotation Standardization to Improve End-to-end Spoken Language Understanding Models. In Proceedings of the 35th Conference on Neural Information Processing Systems (NeurIPS). 1–6.Google Scholar
Octavany Octavany and Arya Wicaksana. 2020. Cleveree: an artificially intelligent web service for Jacob voice chatbot. TELKOMNIKA (Telecommunication Computing Electronics and Control) 18, 3(2020), 1422–1432.Google ScholarCross Ref
Hemant Palivela. 2021. Optimization of paraphrase generation and identification using language models in natural language processing. International Journal of Information Management Data Insights 1, 2(2021), 100025.Google ScholarCross Ref
Ranci Ren, Mireya Zapata, John W. Castro, Oscar Dieste, and Silvia T. Acuña. 2022. Experimentation for Chatbot Usability Evaluation: A Secondary Study. IEEE Access 10(2022), 12430–12464. https://doi.org/10.1109/ACCESS.2022.3145323Google ScholarCross Ref
Konstantinos I Roumeliotis and Nikolaos D Tselikas. 2023. ChatGPT and Open-AI Models: A Preliminary Review. Future Internet 15, 6 (2023), 192.Google ScholarCross Ref
Kabir S Said, Liming Nie, Adekunle A Ajibode, and Xueyi Zhou. 2020. GUI testing for mobile applications: objectives, approaches and challenges. In 12th Asia-Pacific Symposium on Internetware. 51–60.Google ScholarDigital Library
Sergio Segura, Gordon Fraser, Ana B Sanchez, and Antonio Ruiz-Cortés. 2016. A survey on metamorphic testing. IEEE Transactions on software engineering 42, 9 (2016), 805–824.Google ScholarCross Ref
Siamak Shakeri and Abhinav Sethy. 2019. Label dependent deep variational paraphrase generation. arXiv preprint arXiv:1911.11952(2019).Google Scholar
Alex Sokolov and Denis Filimonov. 2020. Neural machine translation for paraphrase generation. arXiv preprint arXiv:2006.14223(2020).Google Scholar
”Liling Tan. 2014. Pywsd: Python implementations of word sense disambiguation (wsd) technologies [software]. https://github.com/alvations/pywsd.Google Scholar
Yushi Wang, Jonathan Berant, and Percy Liang. 2015. Building a semantic parser overnight. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 1332–1342.Google ScholarCross Ref
Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. 2022. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682(2022).Google Scholar
Sam Witteveen and Martin Andrews. 2019. Paraphrasing with large language models. arXiv preprint arXiv:1911.09661(2019).Google Scholar
Robert F Woolson. 2007. Wilcoxon signed-rank test. Wiley encyclopedia of clinical trials(2007), 1–3.Google Scholar
Chen Zhang, Luis Fernando D’Haro, Qiquan Zhang, Thomas Friedrichs, and Haizhou Li. 2023. PoE: A Panel of Experts for Generalized Automatic Dialogue Assessment. IEEE/ACM Transactions on Audio, Speech, and Language Processing 31 (2023), 1234–1250.Google ScholarDigital Library
Jianing Zhou and Suma Bhat. 2021. Paraphrase Generation: A Survey of the State of the Art. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 5075–5086. https://doi.org/10.18653/v1/2021.emnlp-main.414Google ScholarCross Ref

Index Terms

Help Them Understand: Testing and Improving Voice User Interfaces
1. Software and its engineering
  1. Software creation and management
    1. Software post-development issues
      1. Maintaining software
      2. Software evolution
    2. Software verification and validation
      1. Software defect analysis

Recommendations

Sorry, I don’t Understand: Improving Voice User Interface Testing
ASE '22: Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering

Voice-based virtual assistants are becoming increasingly popular. Such systems provide frameworks to developers on which they can build their own apps. End-users can interact with such apps through a Voice User Interface (VUI), which allows to use ...
Read More
Investigating the Role of User's English Language Proficiency in Using a Voice User Interface: A Case of Google Home Smart Speaker
CHI EA '19: Extended Abstracts of the 2019 CHI Conference on Human Factors in Computing Systems

Amazon's Echo, and Apple's Siri have drawn attention from different user groups; however, these existing commercial VUIs support limited language options for users including native English speakers and non-native English speakers. Also, the existing ...
Read More
Design and Evaluation of Voice User Interfaces: What Should One Consider?
Design, Operation and Evaluation of Mobile Communications
Abstract
Voice user interfaces (VUI) come in various forms of software or hardware, are controlled by voice, and can help the user in their daily life. Despite VUIs being readily available on smartphones, they have a low adoption rate. This can be ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Software Engineering and Methodology Just Accepted
ISSN:1049-331X
EISSN:1557-7392
Table of Contents

Copyright © 2024 Copyright held by the owner/author(s). Publication rights licensed to ACM.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Online AM: 5 April 2024
- Accepted: 25 February 2024
- Revised: 5 February 2024
- Received: 5 June 2023
Published in tosem Just Accepted

Check for updates
Author Tags
voice user interfaces
software testing
NLP
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 62
  Total Downloads
- Downloads (Last 12 months)62
- Downloads (Last 6 weeks)62
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Help Them Understand: Testing and Improving Voice User Interfaces

ACM Transactions on Software Engineering and Methodology

Abstract

References

Cited By

Index Terms

Recommendations

Sorry, I don’t Understand: Improving Voice User Interface Testing

Investigating the Role of User's English Language Proficiency in Using a Voice User Interface: A Case of Google Home Smart Speaker

Design and Evaluation of Voice User Interfaces: What Should One Consider?

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Help Them Understand: Testing and Improving Voice User Interfaces

ACM Transactions on Software Engineering and Methodology

Abstract

References

Cited By

Index Terms

Recommendations

Sorry, I don’t Understand: Improving Voice User Interface Testing

Investigating the Role of User's English Language Proficiency in Using a Voice User Interface: A Case of Google Home Smart Speaker

Design and Evaluation of Voice User Interfaces: What Should One Consider?

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media