Style-conditioned music generation with Transformer-GANs

Wang, Weining; Li, Jiahui; Li, Yifan; Xing, Xiaofen

doi:10.1631/FITEE.2300359

Style-conditioned music generation with Transformer-GANs

基于Transformer-GANs生成有风格调节的音乐

Published: 08 February 2024

Volume 25, pages 106–120, (2024)
Cite this article

Frontiers of Information Technology & Electronic Engineering Aims and scope Submit manuscript

214 Accesses
Explore all metrics

Abstract

Recently, various algorithms have been developed for generating appealing music. However, the style control in the generation process has been somewhat overlooked. Music style refers to the representative and unique appearance presented by a musical work, and it is one of the most salient qualities of music. In this paper, we propose an innovative music generation algorithm capable of creating a complete musical composition from scratch based on a specified target style. A style-conditioned linear Transformer and a style-conditioned patch discriminator are introduced in the model. The style-conditioned linear Transformer models musical instrument digital interface (MIDI) event sequences and emphasizes the role of style information. Simultaneously, the style-conditioned patch discriminator applies an adversarial learning mechanism with two innovative loss functions to enhance the modeling of music sequences. Moreover, we establish a discriminative metric for the first time, enabling the evaluation of the generated music’s consistency concerning music styles. Both objective and subjective evaluations of our experimental results indicate that our method’s performance with regard to music production is better than the performances encountered in the case of music production with the use of state-of-the-art methods in available public datasets.

摘要

近年来,研究人员开发了各种算法来生成动听的音乐。然而,在生成过程中有时忽略了风格控制。音乐风格是指音乐作品呈现的具有代表性的特征,是音乐最突出的特质之一。本文提出一种创新的音乐生成算法,该算法能够根据指定的风格从零开始创作完整的音乐作品。算法引入了风格约束的线性生成器和风格鉴别器。风格约束生成器模拟MIDI事件序列,强调风格信息的作用。风格鉴别器应用对抗学习机制并引入两种创新的损失函数,以加强对音乐序列的建模。此外,本文首次建立了一个判别指标,以评估生成音乐与训练数据在音乐风格上的一致性。在现有公共数据集上,实验结果的客观和主观评价都表明我们的算法在音乐制作方面优于现有先进方法。

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Data availability

The data that support the findings of this study are available from the corresponding author upon reasonable request. The code and some generated music examples are shared in https://github.com/li-car-fei/SCTG.

References

Brunner G, Konrad A, Wang YY, et al., 2018. MIDI-VAE: modeling dynamics and instrumentation of music with applications to style transfer. Proc 19^th Int Society for Music Information Retrieval Conf, p.747–754.
Choi K, Hawthorne C, Simon I, et al., 2020. Encoding musical style with Transformer autoencoders. Proc 37^th Int Conf on Machine Learning, p.1899–1908.
Chou YH, Chen IC, Chang CJ, et al., 2021. MidiBERT-Piano: large-scale pre-training for symbolic music understanding. https://doi.org/10.48550/arXiv.2107.05223
Delgado M, Fajardo W, Molina-Solana M, 2009. Inmamusys: intelligent multiagent music system. Expert Syst Appl, 36(3):4574–4580. https://doi.org/10.1016/j.eswa.2008.05.028
Article Google Scholar
Devlin J, Chang MW, Lee K, et al., 2019. BERT: pre-training of deep bidirectional Transformers for language understanding. Proc Conf of the North American Chapter of the Association for Computational Linguistics, p.4171–4186. https://doi.org/10.18653/v1/N19-1423
Dong HW, Yang YH, 2018. Convolutional generative adversarial networks with binary neurons for polyphonic music generation. Proc 19^th Int Society for Music Information Retrieval Conf, p.190–196.
Dong HW, Hsiao WY, Yang LC, et al., 2018. MuseGAN: multi-track sequential generative adversarial networks for symbolic music generation and accompaniment. Proc 32^nd AAAI Conf on Artificial Intelligence, Article 5.
Dong HW, Chen K, McAuley JJ, et al., 2020. MusPy: a toolkit for symbolic music generation. Proc 21^st Int Society for Music Information Retrieval Conf, p.101–108.
Dosovitskiy A, Beyer L, Kolesnikov A, et al., 2021. An image is worth 16×16 words: Transformers for image recognition at scale. Proc 9^th Int Conf on Learning Representations.
Ferreira LN, Whitehead J, 2021. Learning to generate music with sentiment. Proc 20^th Int Society for Music Information Retrieval Conf, p.384–390.
Goodfellow I, Pouget-Abadie J, Mirza M, et al., 2020. Generative adversarial networks. Commun ACM, 63(11):139–144. https://doi.org/10.1145/3422622
Article MathSciNet Google Scholar
Herremans D, Chew E, 2019. MorpheuS: generating structured music with constrained patterns and tension. IEEE Trans Affect Comput, 10(4):510–523. https://doi.org/10.1109/TAFFC.2017.2737984
Article Google Scholar
Hsiao WY, Liu JY, Yeh YC, et al., 2021. Compound word Transformer: learning to compose full-song music over dynamic directed hypergraphs. Proc 35^th AAAI Conf on Artificial Intelligence, p.178–186. https://doi.org/10.1609/aaai.v35i1.16091
Huang CZA, Vaswani A, Uszkoreit J, et al., 2019. Music Transformer: generating music with long-term structure. Proc 7^th Int Conf on Learning Representations.
Huang YS, Yang YH, 2020. Pop music Transformer: beat-based modeling and generation of expressive pop piano compositions. Proc 28^th ACM Int Conf on Multimedia, p.1180–1188. https://doi.org/10.1145/3394171.3413671
Hung HT, Ching J, Doh S, et al., 2021. EMOPIA: a multimodal pop piano dataset for emotion recognition and emotion-based music generation. Proc 22^nd Int Society for Music Information Retrieval Conf, p.318–325.
Jang E, Gu SX, Poole B, 2017. Categorical reparameterization with Gumbel-Softmax. Proc 5^th Int Conf on Learning Representations.
Jhamtani H, Berg-Kirkpatrick T, 2019. Modeling self-repetition in music generation using generative adversarial networks. Proc 36^th Int Conf on Machine Learning.
Jiang JY, Wang ZQ, 2019. Stylistic melody generation with conditional variational auto-encoder. Available from https://www.cs.cmu.edu/~epxing/Class/10708-19/assets/project/final-reports/project8.pdf [Accessed on Oct. 28, 2023].
Jiang JY, Xia GG, Carlton DB, et al., 2020. Transformer VAE: a hierarchical model for structure-aware and interpretable music representation learning. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.516–520. https://doi.org/10.1109/ICASSP40776.2020.9054554
Kaliakatsos-Papakostas M, Floros A, Vrahatis MN, 2020. Artificial intelligence methods for music generation: a review and future perspectives. In: Yang XS (Ed.), Nature-Inspired Computation and Swarm Intelligence. Academic Press, Amsterdam, p.217–245. https://doi.org/10.1016/B978-0-12-819714-1.00024-5
Chapter Google Scholar
Katharopoulos A, Vyas A, Pappas N, et al., 2020. Transformers are RNNs: fast autoregressive transformers with linear attention. Proc 37^th Int Conf on Machine Learning, p.5156–5165.
Ke GL, He D, Liu TY, 2021. Rethinking positional encoding in language pre-training. Proc 9^th Int Conf on Learning Representations.
Leach J, Fitch J, 1995. Nature, music, and algorithmic composition. Comput Music J, 19(2):23–33. https://doi.org/10.2307/3680598
Article Google Scholar
Liang X, Wu JM, Cao J, 2019. MIDI-Sandwich2: RNN-based hierarchical multi-modal fusion generation VAE networks for multi-track symbolic music generation. https://doi.org/10.48550/arXiv.1909.03522
Liao YK, Yue W, Jian YQ, et al., 2022. MICW: a multi-instrument music generation model based on the improved compound word. Proc IEEE Int Conf on Multimedia and Expo Workshops, p.1–10. https://doi.org/10.1109/ICMEW56448.2022.9859531
Lim YQ, Chan CS, Loo FY, 2020. Style-conditioned music generation. Proc IEEE Int Conf on Multimedia and Expo, p.1–6. https://doi.org/10.1109/ICME46284.2020.9102870
Liu HM, Yang YH, 2018. Lead sheet generation and arrangement by conditional generative adversarial network. Proc 17^th IEEE Int Conf on Machine Learning and Applications, p.722–727. https://doi.org/10.1109/ICMLA.2018.00114
Livingstone SR, Mühlberger R, Brown AR, et al., 2010. Changing musical emotion: a computational rule system for modifying score and performance. Comput Music J, 34(1):41–64. https://doi.org/10.1162/comj.2010.34.1.41
Article Google Scholar
Lousseief E, Sturm BLT, 2019. MahlerNet: unbounded orchestral music with neural networks. Proc Nordic Sound and Music Computing Conf and the Interactive Bonification Workshop, p.58–64. https://doi.org/10.5281/zenodo.3755968
Luo J, Yang XY, Ji SL, et al., 2020. MG-VAE: deep Chinese folk songs generation with specific regional styles. Proc 7^th Conf on Sound and Music Technology, p.93–106. https://doi.org/10.1007/978-981-15-2756-2_8
Mao HH, Shin T, Cottrell G, 2018. DeepJ: style-specific music generation. Proc IEEE 12^th Int Conf on Semantic Computing, p.377–382. https://doi.org/10.1109/ICSC.2018.00077
Mou LT, Sun YH, Tian YH, et al., 2023. MemoMusic 3.0: considering context at music recommendation and combining music theory at music generation. Proc IEEE Int Conf on Multimedia and Expo Workshops, p.296–301. https://doi.org/10.1109/ICMEW59549.2023.00057
Muhamed A, Li L, Shi XJ, et al., 2021. Symbolic music generation with Transformer-GANs. Proc 35^th AAAI Conf on Artificial Intelligence, p.408–417. https://doi.org/10.1609/aaai.v35i1.16117
Nie WL, Narodytska N, Patel A, 2019. RelGAN: relational generative adversarial networks for text generation. Proc 7^th Int Conf on Learning Representations.
Oore S, Simon I, Dieleman S, et al., 2020. This time with feeling: learning expressive musical performance. Neur Comput Appl, 32(4):955–967. https://doi.org/10.1007/s00521-018-3758-9
Article Google Scholar
Ren Y, He JZ, Tan X, et al., 2020. PopMAG: pop music accompaniment generation. Proc 28^th ACM Int Conf on Multimedia, p.1198–1206. https://doi.org/10.1145/3394171.3413721
Rivero D, Ramírez-Morales I, Fernandez-Blanco E, et al., 2020. Classical music prediction and composition by means of variational autoencoders. Appl Sci, 10(9): 3053.
Article CAS Google Scholar
Roberts A, Engel J, Raffel C, et al., 2018. A hierarchical latent vector model for learning long-term structure in music. Proc 35^th Int Conf on Machine Learning, p.4364–4373.
Shih YJ, Wu SL, Zalkow F, et al., 2022. Theme Transformer: symbolic music generation with theme-conditioned Transformer. IEEE Trans Multimed, 25: 3495–3508. https://doi.org/10.1109/TMM.2022.3161851
Article Google Scholar
Sulun S, Davies MEP, Viana P, 2022. Symbolic music generation conditioned on continuous-valued emotions. IEEE Access, 10:44617–44626. https://doi.org/10.1109/ACCESS.2022.3169744
Article Google Scholar
Supper M, 2001. A few remarks on algorithmic composition. Comput Music J, 25(1):48–53. https://doi.org/10.1162/014892601300126106
Article Google Scholar
Trieu N, Keller RM, 2018. JazzGAN: improvising with generative adversarial networks. Proc 6^th Int Workshop on Musical Metacreation.
Vaswani A, Shazeer N, Parmar N, et al., 2017. Attention is all you need. Proc 31^st Int Conf on Neural Information Processing Systems, p.6000–6010.
Waite E, Eck D, Roberts A, et al., 2016. Project magenta: generating long-term structure in songs and stories. Available from https://github.com/magenta/magenta/issues/1438 [Accessed on Oct. 28, 2023].
Wang L, Zhao ZY, Liu HW, et al., 2023. A review of intelligent music generation systems. https://doi.org/10.48550/arXiv.2211.09124
Wang WP, Li XB, Jin C, et al., 2022. CPS: full-song and style-conditioned music generation with linear transformer. Proc IEEE Int Conf on Multimedia and Expo Workshops, p.1–6. https://doi.org/10.1109/ICMEW56448.2022.9859286
Williams RJ, Zipser D, 1989. A learning algorithm for continually running fully recurrent neural networks. Neur Comput, 1(2):270–280. https://doi.org/10.1162/neco.1989.1.2.270
Article Google Scholar
Wu SL, Yang YH, 2020. The jazz Transformer on the front line: exploring the shortcomings of AI-composed music through quantitative measures. Proc 21^st Int Society for Music Information Retrieval Conf, p.142–149.
Wu XC, Wang CY, Lei QY, 2020. Transformer-XL based music generation with multiple sequences of time-valued notes. https://doi.org/10.48550/arXiv.2007.07244
Yang LC, Lerch A, 2020. On the evaluation of generative models in music. Neur Comput Appl, 32(9):4773–4784. https://doi.org/10.1007/s00521-018-3849-7
Article Google Scholar
Yang LC, Chou SY, Yang YH, 2017. MidiNet: a convolutional generative adversarial network for symbolic-domain music generation. Proc 18^th Int Society for Music Information Retrieval Conf, p.324–331.
Yu BT, Lu PL, Wang R, et al., 2022. Museformer: Transformer with fine- and coarse-grained attention for music generation. Proc 36^th Conf on Neural Information Processing Systems, p.1376–1388.
Zhang N, 2023. Learning adversarial transformer for symbolic music generation. IEEE Trans Neur Netw Learn Syst, 34(4):1754–1763. https://doi.org/10.1109/TNNLS.2020.2990746
Article MathSciNet Google Scholar
Zhang XY, Zhang JC, Qiu Y, et al., 2022. Structure-enhanced pop music generation via harmony-aware learning. Proc 30^th ACM Int Conf on Multimedia, p.1204–1213. https://doi.org/10.1145/3503161.3548084
Zhong K, Qiao TW, Zhang LQ, 2019. A study of emotional communication of emoticon based on Russell’s Circumplex Model of Affect. Proc 8^th Int Conf on Human-Computer Interaction, p.577–596. https://doi.org/10.1007/978-3-030-23570-3_43

Download references

Author information

Authors and Affiliations

School of Electronic and Information Engineering, South China University of Technology, Guangzhou, 510600, China
Weining Wang (王伟凝), Jiahui Li (李嘉辉), Yifan Li (李意繁) & Xiaofen Xing (邢晓芬)

Authors

Weining Wang (王伟凝)
View author publications
You can also search for this author in PubMed Google Scholar
Jiahui Li (李嘉辉)
View author publications
You can also search for this author in PubMed Google Scholar
Yifan Li (李意繁)
View author publications
You can also search for this author in PubMed Google Scholar
Xiaofen Xing (邢晓芬)
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Weining WANG and Jiahui LI designed the research and processed the data. Jiahui LI and Yifan LI drafted the paper. Weining WANG and Xiaofen XING helped organize the paper. Weining WANG and Xiaofen XING revised and finalized the paper.

Corresponding author

Correspondence to Xiaofen Xing (邢晓芬).

Ethics declarations

All the authors declare that they have no conflict of interest.

Additional information

Project supported by the Natural Science Foundation of Guangdong Province in China (No. 2021A1515011888)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, W., Li, J., Li, Y. et al. Style-conditioned music generation with Transformer-GANs. Front Inform Technol Electron Eng 25, 106–120 (2024). https://doi.org/10.1631/FITEE.2300359

Download citation

Received: 21 May 2023
Accepted: 29 October 2023
Published: 08 February 2024
Issue Date: January 2024
DOI: https://doi.org/10.1631/FITEE.2300359

Key words

关键词

CLC number

TP39

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Style-conditioned music generation with Transformer-GANs

Abstract

摘要

Access this article

Data availability

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Key words

关键词

CLC number

Search

Navigation