Abstract
Voice-based virtual assistants are becoming increasingly popular. Such systems provide frameworks to developers for building custom apps. End-users can interact with such apps through a Voice User Interface (VUI), which allows the user to use natural language commands to perform actions. Testing such apps is not trivial: The same command can be expressed in different semantically equivalent ways. In this paper, we introduce VUI-UPSET, an approach that adapts chatbot-testing approaches to VUI-testing. We conducted an empirical study to understand how VUI-UPSET compares to two state-of-the-art approaches (i.e., a chatbot testing technique and ChatGPT) in terms of (i) correctness of the generated paraphrases, and (ii) capability of revealing bugs. To this aim, we analyzed 14,898 generated paraphrases for 40 Alexa Skills. Our results show that VUI-UPSET generates more bug-revealing paraphrases than the two baselines with, however, ChatGPT being the approach generating the highest percentage of correct paraphrases. We also tried to use the generated paraphrases to improve the skills. We tried to include in the voice interaction models of the skills (i) only the bug-revealing paraphrases, (ii) all the valid paraphrases. We observed that including only bug-revealing paraphrases is sometimes not sufficient to make all the tests pass.
- 2022. Stop Word List. https://countwordsfree.com/stopwords.Google Scholar
- ”Amazon”. 2018. Alexa. https://developer.amazon.com/en-US/alexa.Google Scholar
- ”Amazon”. 2018. Alexa Slots. https://developer.amazon.com/en-US/docs/alexa/custom-skills/slot-type-reference.html.Google Scholar
- ”Amazon”. 2018. Amazon Developer. https://developer.amazon.com/en/.Google Scholar
- ”Amazon”. 2018. Amazon official documentation. https://developer.amazon.com/en-US/docs/alexa/custom-skills/get-utterance-recommendations.html.Google Scholar
- ”Amazon”. 2018. NLU-evaluation tool. https://developer.amazon.com/it-IT/docs/alexa/smapi/nlu-evaluation-tool-api.html.Google Scholar
- Jordan J Bird, Anikó Ekárt, and Diego R Faria. 2021. Chatbot Interaction with Artificial Intelligence: human data augmentation with T5 and language transformer ensemble for text classification. Journal of Ambient Intelligence and Humanized Computing (2021), 1–16.Google Scholar
- Jordan J Bird, Anikó Ekárt, and Diego R Faria. 2023. Chatbot Interaction with Artificial Intelligence: human data augmentation with T5 and language transformer ensemble for text classification. Journal of Ambient Intelligence and Humanized Computing 14, 4 (2023), 3129–3144.Google ScholarCross Ref
- Josip Bozic, Oliver A Tazl, and Franz Wotawa. 2019. Chatbot testing using AI planning. In 2019 IEEE International Conference On Artificial Intelligence Testing (AITest). IEEE, 37–44.Google ScholarCross Ref
- Josip Bozic and Franz Wotawa. 2019. Testing chatbots using metamorphic relations. In IFIP International Conference on Testing Software and Systems. Springer, 41–55.Google ScholarDigital Library
- Jordi Cabot, Loli Burgueno, Robert Clarisó, Gwendal Daniel, Jorge Perianez-Pascual, and Roberto Rodriguez-Echeverria. 2021. Testing challenges for NLP-intensive bots. In 2021 IEEE/ACM Third International Workshop on Bots in Software Engineering (BotSE). IEEE, 31–34.Google ScholarCross Ref
- Daniel Cer, Mona Diab, Eneko Agirre, Iñigo Lopez-Gazpio, and Lucia Specia. 2017. SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017). Association for Computational Linguistics, Vancouver, Canada, 1–14. https://doi.org/10.18653/v1/S17-2001Google ScholarCross Ref
- ”ChatGPT”. 2023. ChatGpt. https://chat.openai.com.Google Scholar
- Alexandru Coca, Bo-Hsiang Tseng, Weizhe Lin, and Bill Byrne. 2023. More Robust Schema-Guided Dialogue State Tracking via Tree-Based Paraphrase Ranking. arXiv preprint arXiv:2303.09905(2023).Google Scholar
- Michael H Cohen, Michael Harris Cohen, James P Giangola, and Jennifer Balogh. 2004. Voice user interface design. Addison-Wesley Professional.Google ScholarDigital Library
- Tom De Smedt and Walter Daelemans. 2012. Pattern for python. The Journal of Machine Learning Research 13, 1 (2012), 2063–2067.Google ScholarDigital Library
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805(2018).Google Scholar
- Adrian Egli. 2023. ChatGPT, GPT-4, and Other Large Language Models: The Next Revolution for Clinical Microbiology?Clinical Infectious Diseases(2023), ciad407.Google Scholar
- ”Hugging Face”. [n. d.]. Hugging Face squad_v2. https://huggingface.co/datasets/squad_v2/viewer/squad_v2/train?p=4&row=440.Google Scholar
- ”Hugging Face”. 2022. Hugging Face. https://huggingface.co/cross-encoder/stsb-roberta-large.Google Scholar
- ”Hugging Face”. 2022. Hugging Face ambig_qa. https://huggingface.co/datasets/ambig_qa/viewer/full/train.Google Scholar
- ”Hugging Face”. 2022. Hugging Face break_data. https://huggingface.co/datasets/break_data/viewer/logical-forms/test?row=1.Google Scholar
- ”Hugging Face”. 2022. Hugging Face conv_ai_3. https://huggingface.co/datasets/conv_ai_3/viewer/conv_ai_3/train?row=36.Google Scholar
- Emanuela Guglielmi, Giovanni Rosa, Simone Scalabrino, Gabriele Bavota, and Rocco Oliveto. 2022. Replication Package of ”Help Them Understand: Testing and Improving Voice User Interfaces”. https://figshare.com/s/36c3475659710714175d.Google Scholar
- Emanuela Guglielmi, Giovanni Rosa, Simone Scalabrino, Gabriele Bavota, and Rocco Oliveto. 2022. Sorry, I don’t Understand: Improving Voice User Interface Testing. In 37th IEEE/ACM International Conference on Automated Software Engineering. 1–12.Google Scholar
- Jonathan Guichard, Elayne Ruane, Ross Smith, Dan Bean, and Anthony Ventresque. 2019. Assessing the robustness of conversational agents using paraphrases. In 2019 IEEE International Conference On Artificial Intelligence Testing (AITest). IEEE, 55–62.Google ScholarCross Ref
- Samer Hassan, Andras Csomai, Carmen Banea, Ravi Sinha, and Rada Mihalcea. 2007. Unt: Subfinder: Combining knowledge sources for automatic lexical substitution. In Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007). 410–413.Google ScholarCross Ref
- Chaitra Hegde and Shrikumar Patil. 2020. Unsupervised paraphrase generation using pre-trained language models. arXiv preprint arXiv:2006.05477(2020).Google Scholar
- Kuan-Hao Huang and Kai-Wei Chang. 2021. Generating syntactically controlled paraphrases without using annotated parallel pairs. arXiv preprint arXiv:2101.10579(2021).Google Scholar
- ”KayLearch”. 2018. KayLearch. https://github.com/KayLerch/alexa-utterance-generator/.Google Scholar
- Federica Laricchia. 2022. Number of digital voice assistants in use worldwide from 2019 to 2024. https://www.statista.com/statistics/973815/worldwide-digital-voice-assistant-in-use/.Google Scholar
- Kwang B Lee and Roger A Grice. 2006. The design and development of user interfaces for voice application in mobile devices. In 2006 IEEE International Professional Communication Conference. IEEE, 308–320.Google ScholarCross Ref
- Suwan Li, Lei Bu, Guangdong Bai, Zhixiu Guo, Kai Chen, and Hanlin Wei. 2022. VITAS: Guided Model-based VUI Testing of VPA Apps. In 37th IEEE/ACM International Conference on Automated Software Engineering. 1–12.Google Scholar
- Ta Lin Liau, Carolyn B Bassin, Clessen J Martin, and Edmund B Coleman. 1976. Modification of the Coleman readability formulas. Journal of Reading Behavior 8, 4 (1976), 381–386.Google ScholarCross Ref
- Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692(2019).Google Scholar
- Andrea De Lucia, Fausto Fasano, Rocco Oliveto, and Genoveffa Tortora. 2007. Recovering traceability links in software artifact management systems using information retrieval methods. ACM Transactions on Software Engineering and Methodology (TOSEM) 16, 4(2007), 13–es.Google ScholarDigital Library
- Guillermo Macbeth, Eugenia Razumiejczyk, and Rubén Daniel Ledesma. 2011. Cliff’s Delta Calculator: A non-parametric effect size program for two groups of observations. Universitas Psychologica 10, 2 (2011), 545–555.Google ScholarCross Ref
- Christopher D Manning, Mihai Surdeanu, John Bauer, Jenny Rose Finkel, Steven Bethard, and David McClosky. 2014. The Stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations. 55–60.Google ScholarCross Ref
- Ke Mao, Mark Harman, and Yue Jia. 2016. Sapienz: Multi-objective automated testing for android applications. In Proceedings of the 25th International Symposium on Software Testing and Analysis. 94–105.Google ScholarDigital Library
- Tomáš Mikolov, Wen-tau Yih, and Geoffrey Zweig. 2013. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 conference of the north american chapter of the association for computational linguistics: Human language technologies. 746–751.Google Scholar
- George A Miller. 1995. WordNet: a lexical database for English. Commun. ACM 38, 11 (1995), 39–41.Google ScholarDigital Library
- Kevin Moran, Mario Linares Vásquez, and Denys Poshyvanyk. 2017. Automated GUI testing of Android apps: from research to practice. In 2017 IEEE/ACM 39th International Conference on Software Engineering Companion (ICSE-C). IEEE, 505–506.Google ScholarDigital Library
- Leah Nicolich-Henkin, Taichi Nakatani, Zach Trozenski, Joel Whiteman, and Nathan Susanj. 2021. Comparing Data Augmentation and Annotation Standardization to Improve End-to-end Spoken Language Understanding Models. In Proceedings of the 35th Conference on Neural Information Processing Systems (NeurIPS). 1–6.Google Scholar
- Octavany Octavany and Arya Wicaksana. 2020. Cleveree: an artificially intelligent web service for Jacob voice chatbot. TELKOMNIKA (Telecommunication Computing Electronics and Control) 18, 3(2020), 1422–1432.Google ScholarCross Ref
- Hemant Palivela. 2021. Optimization of paraphrase generation and identification using language models in natural language processing. International Journal of Information Management Data Insights 1, 2(2021), 100025.Google ScholarCross Ref
- Ranci Ren, Mireya Zapata, John W. Castro, Oscar Dieste, and Silvia T. Acuña. 2022. Experimentation for Chatbot Usability Evaluation: A Secondary Study. IEEE Access 10(2022), 12430–12464. https://doi.org/10.1109/ACCESS.2022.3145323Google ScholarCross Ref
- Konstantinos I Roumeliotis and Nikolaos D Tselikas. 2023. ChatGPT and Open-AI Models: A Preliminary Review. Future Internet 15, 6 (2023), 192.Google ScholarCross Ref
- Kabir S Said, Liming Nie, Adekunle A Ajibode, and Xueyi Zhou. 2020. GUI testing for mobile applications: objectives, approaches and challenges. In 12th Asia-Pacific Symposium on Internetware. 51–60.Google ScholarDigital Library
- Sergio Segura, Gordon Fraser, Ana B Sanchez, and Antonio Ruiz-Cortés. 2016. A survey on metamorphic testing. IEEE Transactions on software engineering 42, 9 (2016), 805–824.Google ScholarCross Ref
- Siamak Shakeri and Abhinav Sethy. 2019. Label dependent deep variational paraphrase generation. arXiv preprint arXiv:1911.11952(2019).Google Scholar
- Alex Sokolov and Denis Filimonov. 2020. Neural machine translation for paraphrase generation. arXiv preprint arXiv:2006.14223(2020).Google Scholar
- ”Liling Tan. 2014. Pywsd: Python implementations of word sense disambiguation (wsd) technologies [software]. https://github.com/alvations/pywsd.Google Scholar
- Yushi Wang, Jonathan Berant, and Percy Liang. 2015. Building a semantic parser overnight. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 1332–1342.Google ScholarCross Ref
- Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. 2022. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682(2022).Google Scholar
- Sam Witteveen and Martin Andrews. 2019. Paraphrasing with large language models. arXiv preprint arXiv:1911.09661(2019).Google Scholar
- Robert F Woolson. 2007. Wilcoxon signed-rank test. Wiley encyclopedia of clinical trials(2007), 1–3.Google Scholar
- Chen Zhang, Luis Fernando D’Haro, Qiquan Zhang, Thomas Friedrichs, and Haizhou Li. 2023. PoE: A Panel of Experts for Generalized Automatic Dialogue Assessment. IEEE/ACM Transactions on Audio, Speech, and Language Processing 31 (2023), 1234–1250.Google ScholarDigital Library
- Jianing Zhou and Suma Bhat. 2021. Paraphrase Generation: A Survey of the State of the Art. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 5075–5086. https://doi.org/10.18653/v1/2021.emnlp-main.414Google ScholarCross Ref
Index Terms
- Help Them Understand: Testing and Improving Voice User Interfaces
Recommendations
Sorry, I don’t Understand: Improving Voice User Interface Testing
ASE '22: Proceedings of the 37th IEEE/ACM International Conference on Automated Software EngineeringVoice-based virtual assistants are becoming increasingly popular. Such systems provide frameworks to developers on which they can build their own apps. End-users can interact with such apps through a Voice User Interface (VUI), which allows to use ...
Investigating the Role of User's English Language Proficiency in Using a Voice User Interface: A Case of Google Home Smart Speaker
CHI EA '19: Extended Abstracts of the 2019 CHI Conference on Human Factors in Computing SystemsAmazon's Echo, and Apple's Siri have drawn attention from different user groups; however, these existing commercial VUIs support limited language options for users including native English speakers and non-native English speakers. Also, the existing ...
Design and Evaluation of Voice User Interfaces: What Should One Consider?
Design, Operation and Evaluation of Mobile CommunicationsAbstractVoice user interfaces (VUI) come in various forms of software or hardware, are controlled by voice, and can help the user in their daily life. Despite VUIs being readily available on smartphones, they have a low adoption rate. This can be ...
Comments