Skip to main content
Log in

Speech-to-SQL: toward speech-driven SQL query generation from natural language question

  • Special Issue Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

Speech-based inputs have been gaining significant momentum with the popularity of smartphones and tablets in our daily lives, since voice is the most popular and efficient way for human–computer interaction. This paper works toward designing more effective speech-based interfaces to query the structured data in relational databases. We first identify a new task named Speech-to-SQL, which aims to understand the information conveyed by human speech and directly translate it into structured query language (SQL) statements. A naive solution to this problem can work in a cascaded manner, that is, an automatic speech recognition component followed by a text-to-SQL component. However, it requires a high-quality ASR system and also suffers from the error compounding problem between the two components, resulting in limited performance. To handle these challenges, we propose a novel end-to-end neural architecture named SpeechSQLNet to directly translate human speech into SQL queries without an external ASR step. SpeechSQLNet has the advantage of making full use of the rich linguistic information presented in speech. To the best of our knowledge, this is the first attempt to directly synthesize SQL based on common natural language questions in spoken form, rather than a natural language-based version of SQL. To validate the effectiveness of the proposed problem and model, we further construct a dataset named SpeechQL, by piggybacking the widely used text-to-SQL datasets. Extensive experimental evaluations on this dataset show that SpeechSQLNet can directly synthesize high-quality SQL queries from human speech, outperforming various competitive counterparts as well as the cascaded methods in terms of exact match accuracies. We expect speech-to-SQL would inspire more research on more effective and efficient human–machine interfaces to lower the barrier of using relational databases.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Notes

  1. https://cloud.google.com/speech-to-text

  2. https://azure.microsoft.com/cognitive-services

  3. https://ai.baidu.com/tech/speech/asr

  4. https://www.nuance.com/dragon

  5. https://serenade.ai/

  6. https://talonvoice.com/

References

  1. Serenade ai. (Last accessed 16 Oct. 2022). https://serenade.ai/

  2. Talon voice. (Last accessed 16 Oct. 2022). https://talonvoice.com/

  3. Affolter, K., Stockinger, K., Bernstein, A.: A comparative survey of recent natural language interfaces for databases. VLDB J. 28(5), 793–819 (2019)

    Article  Google Scholar 

  4. Alateeq, A., Roantree, M., Gurrin, C.: Voxento: A prototype voice-controlled interactive search engine for lifelogs. In: Proceedings of the Third Annual Workshop on Lifelog Search Challenge, pp. 77–81 (2020)

  5. Audhkhasi, K., Rosenberg, A., Sethy, A., Ramabhadran, B., Kingsbury, B.: End-to-end asr-free keyword search from speech. IEEE J. Selected Top. Signal Process. 11(8), 1351–1359 (2017)

    Article  ADS  Google Scholar 

  6. Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. In: NIPS 2016 Deep Learning Symposium (2016)

  7. Bahdanau, D., Chorowski, J., Serdyuk, D., Brakel, P., Bengio, Y.: End-to-end attention-based large vocabulary speech recognition. In: 2016 ICASSP, pp. 4945–4949. IEEE (2016)

  8. Bansal, S., Kamper, H., Lopez, A., Goldwater, S.: Towards speech-to-text translation without speech recognition. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pp. 474–479 (2017)

  9. Black, D., Rapos, E.J., Stephan, M.: Voice-driven modeling: Software modeling using automated speech recognition. In: 2019 ACM/IEEE 22nd International Conference on Model Driven Engineering Languages and Systems Companion (MODELS-C), pp. 252–258. IEEE (2019)

  10. Blunschi, L., Jossen, C., Kossmann, D., Mori, M., Stockinger, K.: Data-thirsty business analysts need soda: search over data warehouse. In: Proceedings of the 20th ACM international conference on Information and knowledge management, pp. 2525–2528 (2011)

  11. Bogin, B., Berant, J., Gardner, M.: Representing schema structure with graph neural networks for text-to-sql parsing. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 4560–4565 (2019)

  12. Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 33, 1877–1901 (2020)

    Google Scholar 

  13. Chan, W., Jaitly, N., Le, Q., Vinyals, O.: Listen, attend and spell: a neural network for large vocabulary conversational speech recognition. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4960–4964. IEEE (2016)

  14. Chazan, D., Hoory, R., Cohen, G., Zibulski, M.: Speech reconstruction from mel frequency cepstral coefficients and pitch frequency. In: 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 00CH37100), vol. 3, pp. 1299–1302. IEEE (2000)

  15. Chen, F., Hwang, S.w., Choo, J., Ha, J.W., Kim, S.: Nl2psql: Generating pseudo-sql queries from under-specified natural language questions. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 2603–2613 (2019)

  16. Chen, T., Wong, R.C.W.: Handling information loss of graph neural networks for session-based recommendation. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1172–1180 (2020)

  17. Cho, K., van Merrienboer, B., Gülçehre, Ç., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using rnn encoder-decoder for statistical machine translation. In: Conference on Empirical Methods in Natural Language Processing (EMNLP 2014) (2014)

  18. Currey, A., Heafield, K.: Incorporating source syntax into transformer-based neural machine translation. In: Proceedings of the Fourth Conference on Machine Translation (Volume 1: Research Papers), pp. 24–33 (2019)

  19. Désilets, A., Fox, D.C., Norton, S.: Voicecode: An innovative speech interface for programming-by-voice. In: CHI’06 Extended Abstracts on Human Factors in Computing Systems, pp. 239–242 (2006)

  20. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) (2019)

  21. Du, Z., Qian, Y., Liu, X., Ding, M., Qiu, J., Yang, Z., Tang, J.: Glm: General language model pretraining with autoregressive blank infilling. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 320–335 (2022)

  22. Fan, Y., Qian, Y., Xie, F.L., Soong, F.K.: Tts synthesis with bidirectional lstm based recurrent neural networks. In: Fifteenth Annual Conference of the International Speech Communication Association (2014)

  23. Foote, J.T.: Content-based retrieval of music and audio. In: Multimedia Storage and Archiving Systems II, vol. 3229, pp. 138–147. International Society for Optics and Photonics (1997)

  24. Gan, Y., Chen, X., Huang, Q., Purver, M., Woodward, J.R., Xie, J., Huang, P.: Towards robustness of text-to-sql models against synonym substitution. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 2505–2515 (2021)

  25. Gkini, O., Belmpas, T., Koutrika, G., Ioannidis, Y.: An in-depth benchmarking of text-to-sql systems. In: Proceedings of the 2021 International Conference on Management of Data, pp. 632–644 (2021)

  26. Glorot, X., Bordes, A., Bengio, Y.: Deep sparse rectifier neural networks. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 315–323. JMLR Workshop and Conference Proceedings (2011)

  27. Graves, A.: Long short-term memory. In: Supervised Sequence Labelling with Recurrent Neural Networks, pp. 37–45. Springer (2012)

  28. Guo, J., Zhan, Z., Gao, Y., Xiao, Y., Lou, J.G., Liu, T., Zhang, D.: Towards complex text-to-sql in cross-domain database with intermediate representation. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 4524–4535 (2019)

  29. Hernandez, F., Nguyen, V., Ghannay, S., Tomashenko, N., Estève, Y.: Ted-lium 3: twice as much data and corpus repartition for experiments on speaker adaptation. In: International Conference on Speech and Computer, pp. 198–208. Springer (2018)

  30. Herzig, J., Nowak, P.K., Mueller, T., Piccinno, F., Eisenschlos, J.: Tapas: Weakly supervised table parsing via pre-training. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 4320–4333 (2020)

  31. Iacob, R.C.A., Brad, F., Apostol, E.S., Truică, C.O., Hosu, I.A., Rebedea, T.: Neural approaches for natural language interfaces to databases: a survey. In: Proceedings of the 28th International Conference on Computational Linguistics, pp. 381–395 (2020)

  32. Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning, pp. 448–456. PMLR (2015)

  33. Kedar, S.: Database Management System. Technical Publications (2009)

  34. Kim, H., So, B.H., Han, W.S., Lee, H.: Natural language to sql: Where are we today? Proceedings of the VLDB Endowment 13(10), 1737–1750 (2020)

    Article  Google Scholar 

  35. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. Tech. rep. (2014)

  36. Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. In: 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24–26, 2017, Conference Track Proceedings (2017)

  37. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 25, 1097–1105 (2012)

    Google Scholar 

  38. Kumar, K., Kumar, R., de Boissiere, T., Gestin, L., Teoh, W.Z., Sotelo, J., de Brébisson, A., Bengio, Y., Courville, A.C.: Melgan: Generative adversarial networks for conditional waveform synthesis. Adv. Neural Inf. Process. Syst. 32 (2019)

  39. Lakew, S.M., Cettolo, M., Federico, M.: A comparison of transformer and recurrent neural networks on multilingual neural machine translation. In: Proceedings of the 27th International Conference on Computational Linguistics, pp. 641–652 (2018)

  40. Le, H., Sahoo, D., Chen, N., Hoi, S.: Multimodal transformer networks for end-to-end video-grounded dialogue systems. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 5612–5623 (2019)

  41. Lee, H., Fenwick Jr, J.B., Klima, R.E., McRae, A.A., Vahlbusch, J.: Disability assistive programming: using voice input to write code. Ph.D. thesis, Appalachian State University (2019)

  42. Lei, W., Wang, W., Ma, Z., Gan, T., Lu, W., Kan, M.Y., Chua, T.S.: Re-examining the role of schema linking in text-to-sql. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 6943–6954 (2020)

  43. Li, F., Jagadish, H.: Constructing an interactive natural language interface for relational databases. Proc. VLDB Endow. 8(1), 73–84 (2014)

    Article  CAS  Google Scholar 

  44. Li, F., Jagadish, H.V.: Nalir: an interactive natural language interface for querying relational databases. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, pp. 709–712 (2014)

  45. Li, G., Muller, M., Thabet, A., Ghanem, B.: Deepgcns: Can gcns go as deep as cnns? In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9267–9276 (2019)

  46. Li, G., Zhou, X., Cao, L.: Ai meets database: Ai4db and db4ai. In: Proceedings of the 2021 International Conference on Management of Data, pp. 2859–2866 (2021)

  47. Li, J., Zhang, X., Jia, C., Xu, J., Zhang, L., Wang, Y., Ma, S., Gao, W.: Direct speech-to-image translation. IEEE J. Selected Top. Signal Process. 14(3), 517–529 (2020)

    Article  ADS  Google Scholar 

  48. Luong, T., Pham, H., Manning, C.D.: Effective approaches to attention-based neural machine translation. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 1412–1421 (2015)

  49. Lyons, G., Tran, V., Binnig, C., Cetintemel, U., Kraska, T.: Making the case for query-by-voice with echoquery. In: Proceedings of the 2016 International Conference on Management of Data, pp. 2129–2132 (2016)

  50. Medsker, L.R., Jain, L.: Recurrent neural networks. Des. Appl. 5 (2001)

  51. Nguyen, D.Q., et al.: Investigating the impact of asr errors on spoken implicit discourse relation recognition. In: Proceedings of the First Workshop On Transcript Understanding, pp. 34–39 (2022)

  52. Nguyen, T.Q.: Near-perfect-reconstruction pseudo-qmf banks. IEEE Trans. Signal Process. 42(1), 65–76 (1994)

    Article  ADS  Google Scholar 

  53. Nihalani, N., Silakari, S., Motwani, M.: Natural language interface for database: a brief review. Int. J. Comput. Sci. Issues (IJCSI) 8(2), 600 (2011)

    Google Scholar 

  54. Obaido, G., Ade-Ibijola, A., Vadapalli, H.: Talksql: A tool for the synthesis of sql queries from verbal specifications. In: 2020 2nd International Multidisciplinary Information Technology and Engineering Conference (IMITEC), pp. 1–10. IEEE (2020)

  55. OpenAI: Chatgpt (2023). https://openai.com/blog/chatgpt

  56. OpenAI: Gpt-4 technical report (2023)

  57. Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210. IEEE (2015)

  58. Peng, Z., Mo, K., Zhu, X., Chen, J., Chen, Z., Xu, Q., Ma, X.: Understanding user perceptions of robot’s delay, voice quality-speed trade-off and gui during conversation. In: Extended Abstracts of the 2020 CHI Conference on Human Factors in Computing Systems, pp. 1–8 (2020)

  59. Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)

  60. Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., et al.: The kaldi speech recognition toolkit. In: IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, CONF. IEEE Signal Processing Society (2011)

  61. Rao, K., Sak, H., Prabhavalkar, R.: Exploring architectures, data and units for streaming end-to-end speech recognition with rnn-transducer. In: 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 193–199. IEEE (2017)

  62. Ren, Y., Hu, C., Tan, X., Qin, T., Zhao, S., Zhao, Z., Liu, T.Y.: Fastspeech 2: Fast and high-quality end-to-end text to speech. In: International Conference on Learning Representations (2020)

  63. Rousseau, A., Deléglise, P., Esteve, Y.: Ted-lium: an automatic speech recognition dedicated corpus. In: LREC, pp. 125–129 (2012)

  64. Sen, J., Lei, C., Quamar, A., Özcan, F., Efthymiou, V., Dalmia, A., Stager, G., Mittal, A., Saha, D., Sankaranarayanan, K.: Athena++ natural language querying for complex nested sql queries. Proc. VLDB Endow. 13(12), 2747–2759 (2020)

    Article  Google Scholar 

  65. Shah, V., Li, S., Kumar, A., Saul, L.: Speakql: Towards speech-driven multimodal querying of structured data. In: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, pp. 2363–2374 (2020)

  66. Shah, V., Li, S., Yang, K., Kumar, A., Saul, L.: Demonstration of speakql: speech-driven multimodal querying of structured data. In: Proceedings of the 2019 International Conference on Management of Data, pp. 2001–2004 (2019)

  67. Shekarpour, S., Marx, E., Ngomo, A.C.N., Auer, S.: Sina: Semantic interpretation of user queries for question answering on interlinked data. J. Web Semant. 30, 39–51 (2015)

    Article  Google Scholar 

  68. Song, Y., Jiang, D., Huang, X., Li, Y., Xu, Q., Wong, R.C.W., Yang, Q.: Goldenretriever: A speech recognition system powered by modern information retrieval. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 4500–4502 (2020)

  69. Song, Y., Jiang, D., Zhao, X., Xu, Q., Wong, R.C.W., Fan, L., Yang, Q.: L2rs: A learning-to-rescore mechanism for hybrid speech recognition. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 1157–1166 (2021)

  70. Song, Y., Wong, R.C.W., Xuefang, Z., Jiang, D.: Voicequerysystem: a voice-driven database querying system using natural language questions. In: Proceedings of the 2022 ACM SIGMOD International Conference on Management of Data (2022)

  71. Stolcke, A.: Srilm-an extensible language modeling toolkit. In: Seventh International Conference on Spoken Language Processing (2002)

  72. Sun, N., Yang, X., Liu, Y.: Tableqa: a large-scale chinese text-to-sql dataset for table-aware sql generation. arXiv pp. arXiv–2006 (2020)

  73. Sun, Y., Tang, D., Duan, N., Ji, J., Cao, G., Feng, X., Qin, B., Liu, T., Zhou, M.: Semantic parsing with syntax-and table-aware sql generation. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 361–372 (2018)

  74. Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. Adv. Neural Inf. Process. Syst. 27 (2014)

  75. Tian, Z., Yi, J., Tao, J., Bai, Y., Wen, Z.: Self-attention transducers for end-to-end speech recognition. Proc. Interspeech 2019, 4395–4399 (2019)

    Google Scholar 

  76. Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)

  77. Trummer, I.: Demonstrating the voice-based exploration of large data sets with cicerodb-zero. Proc. VLDB Endow. 13(12), 2869–2872 (2020)

    Article  Google Scholar 

  78. Utama, P., Weir, N., Binnig, C., Cetintemel, U.: Voice-based data exploration: Chatting with your database. In: Proceedings of the Workshop on Search-Oriented Conversational AI (SCAI) (2017)

  79. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 6000–6010 (2017)

  80. Vinyals, O., Fortunato, M., Jaitly, N.: Pointer networks. Adv. Neural Inf. Process. Syst. 28, 2692–2700 (2015)

    Google Scholar 

  81. Wahlster, W.: Verbmobil: Foundations of Speech-to-Speech Translation. Springer Science & Business Media, Berlin (2013)

    Google Scholar 

  82. Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., Lang, K.J.: Phoneme recognition using time-delay neural networks. IEEE Trans. Acoust. Speech Signal Process. 37(3), 328–339 (1989)

    Article  Google Scholar 

  83. Wang, X., Qiao, T., Zhu, J., Hanjalic, A., Scharenborg, O.: S2igan: Speech-to-image generation via adversarial learning. In: INTERSPEECH 2020, pp. 2292–2296. ISCA (2020)

  84. Wang, X., Qiao, T., Zhu, J., Hanjalic, A., Scharenborg, O.: Generating images from spoken descriptions. IEEE/ACM Trans. Audio Speech Language Process. 29, 850–865 (2021)

    Article  Google Scholar 

  85. Watanabe, S., Hori, T., Kim, S., Hershey, J.R., Hayashi, T.: Hybrid ctc/attention architecture for end-to-end speech recognition. IEEE J. Selected Top. Signal Process. 11(8), 1240–1253 (2017)

    Article  ADS  Google Scholar 

  86. Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V., Zhou, D., et al.: Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural Inf. Process. Syst. 35, 24824–24837 (2022)

    Google Scholar 

  87. Weller, O., Sperber, M., Pires, T., Setiawan, H., Gollan, C., Telaar, D., Paulik, M.: End-to-end speech translation for code switched speech. In: Findings of the Association for Computational Linguistics: ACL 2022, pp. 1435–1448 (2022)

  88. Xu, J., Tan, X., Ren, Y., Qin, T., Li, J., Zhao, S., Liu, T.Y.: Lrspeech: Extremely low-resource speech synthesis and recognition. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 2802–2812 (2020)

  89. Xu, X., Liu, C., Song, D.: Sqlnet: Generating structured queries from natural language without reinforcement learning. arXiv preprint arXiv:1711.04436 (2017)

  90. Yin, P., Neubig, G., Yih, W.t., Riedel, S.: Tabert: Pretraining for joint understanding of textual and tabular data. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 8413–8426 (2020)

  91. Yu, D., Deng, L.: AUTOMATIC SPEECH RECOGNITION. Springer (2016)

  92. Yu, T., Li, Z., Zhang, Z., Zhang, R., Radev, D.: Typesql: Knowledge-based type-aware neural text-to-sql generation. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 588–594 (2018)

  93. Yu, T., Wu, C.S., Lin, X.V., Tan, Y.C., Yang, X., Radev, D., Xiong, C., et al.: Grappa: Grammar-augmented pre-training for table semantic parsing. In: International Conference on Learning Representations (2020)

  94. Yu, T., Yasunaga, M., Yang, K., Zhang, R., Wang, D., Li, Z., Radev, D.: Syntaxsqlnet: Syntax tree networks for complex and cross-domain text-to-sql task. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 1653–1663 (2018)

  95. Yu, T., Zhang, R., Polozov, A., Meek, C., Awadallah, A.H.: Score: Pre-training for context representation in conversational semantic parsing. In: International Conference on Learning Representations (2021)

  96. Yu, T., Zhang, R., Yang, K., Yasunaga, M., Wang, D., Li, Z., Ma, J., Li, I., Yao, Q., Roman, S., et al.: Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 3911–3921 (2018)

  97. Zeng, A., Liu, X., Du, Z., Wang, Z., Lai, H., Ding, M., Yang, Z., Xu, Y., Zheng, W., Xia, X., et al.: Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414 (2022)

  98. Zenz, G., Zhou, X., Minack, E., Siberski, W., Nejdl, W.: From keywords to semantic queries-incremental query construction on the semantic web. J. Web Semant. 7(3), 166–176 (2009)

    Article  Google Scholar 

  99. Zeyer, A., Bahar, P., Irie, K., Schlüter, R., Ney, H.: A comparison of transformer and lstm encoder decoder models for asr. In: 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 8–15. IEEE (2019)

  100. Zhang, R., Yu, T., Er, H., Shim, S., Xue, E., Lin, X.V., Shi, T., Xiong, C., Socher, R., Radev, D.: Editing-based sql query generation for cross-domain context-dependent questions. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 5338–5349 (2019)

  101. Zhao, X., Wang, L., He, R., Yang, T., Chang, J., Wang, R.: Multiple knowledge syncretic transformer for natural dialogue generation. In: Proceedings of The Web Conference 2020, pp. 752–762 (2020)

  102. Zheng, W., Cheng, H., Zou, L., Yu, J.X., Zhao, K.: Natural language question/answering: Let users talk with the knowledge graph. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pp. 217–226 (2017)

  103. Zhong, V., Xiong, C., Socher, R.: Seq2sql: Generating structured queries from natural language using reinforcement learning. arXiv preprint arXiv:1709.00103 (2017)

  104. Zhou, S., Dong, L., Xu, S., Xu, B.: A comparison of modeling units in sequence-to-sequence speech recognition with the transformer on mandarin chinese. In: International Conference on Neural Information Processing, pp. 210–220. Springer (2018)

  105. Zhou, X., Chai, C., Li, G., Sun, J.: Database meets artificial intelligence: A survey. IEEE Trans. Knowl. Data Eng. (2020)

Download references

Acknowledgements

We thank the editor and the reviewers for their valuable comments in improving this draft. Xuefang Zhao is the corresponding author.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xuefang Zhao.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

A VoiceQuerySystem using the techniques described in this draft was published in SIGMOD’22 [70].

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Song, Y., Wong, R.CW. & Zhao, X. Speech-to-SQL: toward speech-driven SQL query generation from natural language question. The VLDB Journal (2024). https://doi.org/10.1007/s00778-024-00837-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s00778-024-00837-0

Keywords

Navigation