skip to main content
research-article
Free Access
Just Accepted

Help Them Understand: Testing and Improving Voice User Interfaces

Online AM:05 April 2024Publication History
Skip Abstract Section

Abstract

Voice-based virtual assistants are becoming increasingly popular. Such systems provide frameworks to developers for building custom apps. End-users can interact with such apps through a Voice User Interface (VUI), which allows the user to use natural language commands to perform actions. Testing such apps is not trivial: The same command can be expressed in different semantically equivalent ways. In this paper, we introduce VUI-UPSET, an approach that adapts chatbot-testing approaches to VUI-testing. We conducted an empirical study to understand how VUI-UPSET compares to two state-of-the-art approaches (i.e., a chatbot testing technique and ChatGPT) in terms of (i) correctness of the generated paraphrases, and (ii) capability of revealing bugs. To this aim, we analyzed 14,898 generated paraphrases for 40 Alexa Skills. Our results show that VUI-UPSET generates more bug-revealing paraphrases than the two baselines with, however, ChatGPT being the approach generating the highest percentage of correct paraphrases. We also tried to use the generated paraphrases to improve the skills. We tried to include in the voice interaction models of the skills (i) only the bug-revealing paraphrases, (ii) all the valid paraphrases. We observed that including only bug-revealing paraphrases is sometimes not sufficient to make all the tests pass.

References

  1. 2022. Stop Word List. https://countwordsfree.com/stopwords.Google ScholarGoogle Scholar
  2. ”Amazon”. 2018. Alexa. https://developer.amazon.com/en-US/alexa.Google ScholarGoogle Scholar
  3. ”Amazon”. 2018. Alexa Slots. https://developer.amazon.com/en-US/docs/alexa/custom-skills/slot-type-reference.html.Google ScholarGoogle Scholar
  4. ”Amazon”. 2018. Amazon Developer. https://developer.amazon.com/en/.Google ScholarGoogle Scholar
  5. ”Amazon”. 2018. Amazon official documentation. https://developer.amazon.com/en-US/docs/alexa/custom-skills/get-utterance-recommendations.html.Google ScholarGoogle Scholar
  6. ”Amazon”. 2018. NLU-evaluation tool. https://developer.amazon.com/it-IT/docs/alexa/smapi/nlu-evaluation-tool-api.html.Google ScholarGoogle Scholar
  7. Jordan J Bird, Anikó Ekárt, and Diego R Faria. 2021. Chatbot Interaction with Artificial Intelligence: human data augmentation with T5 and language transformer ensemble for text classification. Journal of Ambient Intelligence and Humanized Computing (2021), 1–16.Google ScholarGoogle Scholar
  8. Jordan J Bird, Anikó Ekárt, and Diego R Faria. 2023. Chatbot Interaction with Artificial Intelligence: human data augmentation with T5 and language transformer ensemble for text classification. Journal of Ambient Intelligence and Humanized Computing 14, 4 (2023), 3129–3144.Google ScholarGoogle ScholarCross RefCross Ref
  9. Josip Bozic, Oliver A Tazl, and Franz Wotawa. 2019. Chatbot testing using AI planning. In 2019 IEEE International Conference On Artificial Intelligence Testing (AITest). IEEE, 37–44.Google ScholarGoogle ScholarCross RefCross Ref
  10. Josip Bozic and Franz Wotawa. 2019. Testing chatbots using metamorphic relations. In IFIP International Conference on Testing Software and Systems. Springer, 41–55.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Jordi Cabot, Loli Burgueno, Robert Clarisó, Gwendal Daniel, Jorge Perianez-Pascual, and Roberto Rodriguez-Echeverria. 2021. Testing challenges for NLP-intensive bots. In 2021 IEEE/ACM Third International Workshop on Bots in Software Engineering (BotSE). IEEE, 31–34.Google ScholarGoogle ScholarCross RefCross Ref
  12. Daniel Cer, Mona Diab, Eneko Agirre, Iñigo Lopez-Gazpio, and Lucia Specia. 2017. SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017). Association for Computational Linguistics, Vancouver, Canada, 1–14. https://doi.org/10.18653/v1/S17-2001Google ScholarGoogle ScholarCross RefCross Ref
  13. ”ChatGPT”. 2023. ChatGpt. https://chat.openai.com.Google ScholarGoogle Scholar
  14. Alexandru Coca, Bo-Hsiang Tseng, Weizhe Lin, and Bill Byrne. 2023. More Robust Schema-Guided Dialogue State Tracking via Tree-Based Paraphrase Ranking. arXiv preprint arXiv:2303.09905(2023).Google ScholarGoogle Scholar
  15. Michael H Cohen, Michael Harris Cohen, James P Giangola, and Jennifer Balogh. 2004. Voice user interface design. Addison-Wesley Professional.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Tom De Smedt and Walter Daelemans. 2012. Pattern for python. The Journal of Machine Learning Research 13, 1 (2012), 2063–2067.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805(2018).Google ScholarGoogle Scholar
  18. Adrian Egli. 2023. ChatGPT, GPT-4, and Other Large Language Models: The Next Revolution for Clinical Microbiology?Clinical Infectious Diseases(2023), ciad407.Google ScholarGoogle Scholar
  19. ”Hugging Face”. [n. d.]. Hugging Face squad_v2. https://huggingface.co/datasets/squad_v2/viewer/squad_v2/train?p=4&row=440.Google ScholarGoogle Scholar
  20. ”Hugging Face”. 2022. Hugging Face. https://huggingface.co/cross-encoder/stsb-roberta-large.Google ScholarGoogle Scholar
  21. ”Hugging Face”. 2022. Hugging Face ambig_qa. https://huggingface.co/datasets/ambig_qa/viewer/full/train.Google ScholarGoogle Scholar
  22. ”Hugging Face”. 2022. Hugging Face break_data. https://huggingface.co/datasets/break_data/viewer/logical-forms/test?row=1.Google ScholarGoogle Scholar
  23. ”Hugging Face”. 2022. Hugging Face conv_ai_3. https://huggingface.co/datasets/conv_ai_3/viewer/conv_ai_3/train?row=36.Google ScholarGoogle Scholar
  24. Emanuela Guglielmi, Giovanni Rosa, Simone Scalabrino, Gabriele Bavota, and Rocco Oliveto. 2022. Replication Package of ”Help Them Understand: Testing and Improving Voice User Interfaces”. https://figshare.com/s/36c3475659710714175d.Google ScholarGoogle Scholar
  25. Emanuela Guglielmi, Giovanni Rosa, Simone Scalabrino, Gabriele Bavota, and Rocco Oliveto. 2022. Sorry, I don’t Understand: Improving Voice User Interface Testing. In 37th IEEE/ACM International Conference on Automated Software Engineering. 1–12.Google ScholarGoogle Scholar
  26. Jonathan Guichard, Elayne Ruane, Ross Smith, Dan Bean, and Anthony Ventresque. 2019. Assessing the robustness of conversational agents using paraphrases. In 2019 IEEE International Conference On Artificial Intelligence Testing (AITest). IEEE, 55–62.Google ScholarGoogle ScholarCross RefCross Ref
  27. Samer Hassan, Andras Csomai, Carmen Banea, Ravi Sinha, and Rada Mihalcea. 2007. Unt: Subfinder: Combining knowledge sources for automatic lexical substitution. In Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007). 410–413.Google ScholarGoogle ScholarCross RefCross Ref
  28. Chaitra Hegde and Shrikumar Patil. 2020. Unsupervised paraphrase generation using pre-trained language models. arXiv preprint arXiv:2006.05477(2020).Google ScholarGoogle Scholar
  29. Kuan-Hao Huang and Kai-Wei Chang. 2021. Generating syntactically controlled paraphrases without using annotated parallel pairs. arXiv preprint arXiv:2101.10579(2021).Google ScholarGoogle Scholar
  30. ”KayLearch”. 2018. KayLearch. https://github.com/KayLerch/alexa-utterance-generator/.Google ScholarGoogle Scholar
  31. Federica Laricchia. 2022. Number of digital voice assistants in use worldwide from 2019 to 2024. https://www.statista.com/statistics/973815/worldwide-digital-voice-assistant-in-use/.Google ScholarGoogle Scholar
  32. Kwang B Lee and Roger A Grice. 2006. The design and development of user interfaces for voice application in mobile devices. In 2006 IEEE International Professional Communication Conference. IEEE, 308–320.Google ScholarGoogle ScholarCross RefCross Ref
  33. Suwan Li, Lei Bu, Guangdong Bai, Zhixiu Guo, Kai Chen, and Hanlin Wei. 2022. VITAS: Guided Model-based VUI Testing of VPA Apps. In 37th IEEE/ACM International Conference on Automated Software Engineering. 1–12.Google ScholarGoogle Scholar
  34. Ta Lin Liau, Carolyn B Bassin, Clessen J Martin, and Edmund B Coleman. 1976. Modification of the Coleman readability formulas. Journal of Reading Behavior 8, 4 (1976), 381–386.Google ScholarGoogle ScholarCross RefCross Ref
  35. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692(2019).Google ScholarGoogle Scholar
  36. Andrea De Lucia, Fausto Fasano, Rocco Oliveto, and Genoveffa Tortora. 2007. Recovering traceability links in software artifact management systems using information retrieval methods. ACM Transactions on Software Engineering and Methodology (TOSEM) 16, 4(2007), 13–es.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Guillermo Macbeth, Eugenia Razumiejczyk, and Rubén Daniel Ledesma. 2011. Cliff’s Delta Calculator: A non-parametric effect size program for two groups of observations. Universitas Psychologica 10, 2 (2011), 545–555.Google ScholarGoogle ScholarCross RefCross Ref
  38. Christopher D Manning, Mihai Surdeanu, John Bauer, Jenny Rose Finkel, Steven Bethard, and David McClosky. 2014. The Stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations. 55–60.Google ScholarGoogle ScholarCross RefCross Ref
  39. Ke Mao, Mark Harman, and Yue Jia. 2016. Sapienz: Multi-objective automated testing for android applications. In Proceedings of the 25th International Symposium on Software Testing and Analysis. 94–105.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Tomáš Mikolov, Wen-tau Yih, and Geoffrey Zweig. 2013. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 conference of the north american chapter of the association for computational linguistics: Human language technologies. 746–751.Google ScholarGoogle Scholar
  41. George A Miller. 1995. WordNet: a lexical database for English. Commun. ACM 38, 11 (1995), 39–41.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Kevin Moran, Mario Linares Vásquez, and Denys Poshyvanyk. 2017. Automated GUI testing of Android apps: from research to practice. In 2017 IEEE/ACM 39th International Conference on Software Engineering Companion (ICSE-C). IEEE, 505–506.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Leah Nicolich-Henkin, Taichi Nakatani, Zach Trozenski, Joel Whiteman, and Nathan Susanj. 2021. Comparing Data Augmentation and Annotation Standardization to Improve End-to-end Spoken Language Understanding Models. In Proceedings of the 35th Conference on Neural Information Processing Systems (NeurIPS). 1–6.Google ScholarGoogle Scholar
  44. Octavany Octavany and Arya Wicaksana. 2020. Cleveree: an artificially intelligent web service for Jacob voice chatbot. TELKOMNIKA (Telecommunication Computing Electronics and Control) 18, 3(2020), 1422–1432.Google ScholarGoogle ScholarCross RefCross Ref
  45. Hemant Palivela. 2021. Optimization of paraphrase generation and identification using language models in natural language processing. International Journal of Information Management Data Insights 1, 2(2021), 100025.Google ScholarGoogle ScholarCross RefCross Ref
  46. Ranci Ren, Mireya Zapata, John W. Castro, Oscar Dieste, and Silvia T. Acuña. 2022. Experimentation for Chatbot Usability Evaluation: A Secondary Study. IEEE Access 10(2022), 12430–12464. https://doi.org/10.1109/ACCESS.2022.3145323Google ScholarGoogle ScholarCross RefCross Ref
  47. Konstantinos I Roumeliotis and Nikolaos D Tselikas. 2023. ChatGPT and Open-AI Models: A Preliminary Review. Future Internet 15, 6 (2023), 192.Google ScholarGoogle ScholarCross RefCross Ref
  48. Kabir S Said, Liming Nie, Adekunle A Ajibode, and Xueyi Zhou. 2020. GUI testing for mobile applications: objectives, approaches and challenges. In 12th Asia-Pacific Symposium on Internetware. 51–60.Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Sergio Segura, Gordon Fraser, Ana B Sanchez, and Antonio Ruiz-Cortés. 2016. A survey on metamorphic testing. IEEE Transactions on software engineering 42, 9 (2016), 805–824.Google ScholarGoogle ScholarCross RefCross Ref
  50. Siamak Shakeri and Abhinav Sethy. 2019. Label dependent deep variational paraphrase generation. arXiv preprint arXiv:1911.11952(2019).Google ScholarGoogle Scholar
  51. Alex Sokolov and Denis Filimonov. 2020. Neural machine translation for paraphrase generation. arXiv preprint arXiv:2006.14223(2020).Google ScholarGoogle Scholar
  52. ”Liling Tan. 2014. Pywsd: Python implementations of word sense disambiguation (wsd) technologies [software]. https://github.com/alvations/pywsd.Google ScholarGoogle Scholar
  53. Yushi Wang, Jonathan Berant, and Percy Liang. 2015. Building a semantic parser overnight. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 1332–1342.Google ScholarGoogle ScholarCross RefCross Ref
  54. Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. 2022. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682(2022).Google ScholarGoogle Scholar
  55. Sam Witteveen and Martin Andrews. 2019. Paraphrasing with large language models. arXiv preprint arXiv:1911.09661(2019).Google ScholarGoogle Scholar
  56. Robert F Woolson. 2007. Wilcoxon signed-rank test. Wiley encyclopedia of clinical trials(2007), 1–3.Google ScholarGoogle Scholar
  57. Chen Zhang, Luis Fernando D’Haro, Qiquan Zhang, Thomas Friedrichs, and Haizhou Li. 2023. PoE: A Panel of Experts for Generalized Automatic Dialogue Assessment. IEEE/ACM Transactions on Audio, Speech, and Language Processing 31 (2023), 1234–1250.Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Jianing Zhou and Suma Bhat. 2021. Paraphrase Generation: A Survey of the State of the Art. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 5075–5086. https://doi.org/10.18653/v1/2021.emnlp-main.414Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Help Them Understand: Testing and Improving Voice User Interfaces

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Software Engineering and Methodology
          ACM Transactions on Software Engineering and Methodology Just Accepted
          ISSN:1049-331X
          EISSN:1557-7392
          Table of Contents

          Copyright © 2024 Copyright held by the owner/author(s). Publication rights licensed to ACM.

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Online AM: 5 April 2024
          • Accepted: 25 February 2024
          • Revised: 5 February 2024
          • Received: 5 June 2023
          Published in tosem Just Accepted

          Check for updates

          Qualifiers

          • research-article
        • Article Metrics

          • Downloads (Last 12 months)62
          • Downloads (Last 6 weeks)62

          Other Metrics

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader