Enhancing Music Generation With a Semantic-Based Sequence-to-Music Transformer Framework

Yang Xu

Source Title: International Journal on Semantic Web and Information Systems (IJSWIS) 20(1)

DOI: 10.4018/IJSWIS.343491

Article PDF Download Open access articles are freely available for download

Abstract

Music generation became a platform for creative expression, promoting artistic innovation, personalized experiences, and cultural integration, with implications for education and creative industry development. But generating music that resonates emotionally is a challenge. Therefore, we introduce a new framework called the Sequence-to-Music Transformer Framework for Music Generation. This framework employs a simple encoder-decoder Transformer to model music by transforming its fundamental notes into a sequence of discrete tokens. The model learns to generate this sequence token by token. The encoder extracts melodic features of the music, while the decoder uses these extracted features to generate the music sequence. Generation is performed in an auto-regressive manner, meaning the model generates tokens based on previously observed tokens. Music melodic features are integrated into the decoder through cross-attention layers, and the generation process concludes when “end” is generated. The experimental results achieve state-of-the-art performance on a wide range of datasets.

Article Preview

Top

Materials And Methods

Sequence Learning, Multitask Learning, Music Generation

In general, studies concerning music generation can be classified into three categories: sequence learning, multitask learning, and music generation.

Sequence Learning

Sequential data is prevalent in real-world datasets, such as speech, text, and stock predictions. In the last century, modeling problems related to sequences advanced substantially. Traditional methods such as Hidden Markov Models (Rezatofighi et al., 2019) have been widely used in fields such as text-to-speech conversion, language modeling, and protein sequences.

For example, Pierre Baldi and his colleagues used HMM to model proteins, adapting model parameters through algorithms that achieve smooth convergence. Simultaneously, Keiichi Tokuda employed an algorithm that generates speech parameters from HMM using unobservable vectors. However, traditional methods suffer from issues such as the need for manual feature design and extraction, leading to significant time and effort consumption. Therefore, deep learning has demonstrated outstanding performance in sequence modeling (Babalola et al., 2021; Liu et al., April 2023). It can model sequences in an end-to-end manner, avoiding the need for extensive manual features (Li et al., April 2022)—for instance, RNNs) and CNNs)

Although CNNs can address the problem to some extent, deep neural networks require more data to train a large number of parameters, and sometimes, there isn’t even enough data for training. These issues have prompted consideration of alternative methods. For example, Ma et al. (2023) refers to Lample, who introduced an unsupervised machine translation method relying solely on monolingual corpora, and Liu, who employed SeqGAN to generate text using scarce, unmatched image-text data.

Multitask Learning

Multitask learning is frequently used to exploit shared features across interconnected tasks because features obtained from one task can be advantageous for others. Previous research has demonstrated the successful application of multitask learning across various domains of machine learning, spanning from natural language processing to computer vision (Liu et al., April 2023; Krauss, 2023). Zhang proposed enhancing generalization performance by using information from related tasks. Hashimoto established a hierarchical framework encompassing various natural language processing (NLP) tasks and formulated a basic regularization term to enhance performance across the board. Kendall adapted the relative weights for each task by formulating a multitask loss function aimed at maximizing Gaussian likelihood. There is still a substantial amount of ongoing work in the field of multitask learning (Zhou et al., 2023; Wu et al., 2023).

Music Generation

Over the past few decades, music generation has been a challenging task, and various approaches have been proposed (Pei et al., 2023; Shen, 2023). Typical data-driven statistical methods often employ Markov models. Additionally, other work has suggested similar ideas, such as using chords to select melodies. However, traditional methods require a significant amount of human effort and domain knowledge. Lately, deep neural networks have been employed for end-to-end music generation, effectively tackling the aforementioned challenges. Johnson, for instance, integrated a RNN with a non-recurrent neural network to depict the potential coexistence of multiple notes. (Moysis, February 2023) introduced an RNN-based generative model capable of generating four-part choral music using a Gibbs-like sampling process. In contrast to RNN-based models, Sabathe used VAE to learn the distribution of music pieces. Furthermore, Zhang employed a Transformer network (Moysis, July 2023) to generate music, using random noise as input to generate melodies from scratch.

Despite extensive research in music generation, none of the studies have fully considered the specificity of music, such as chords, rhythm, and instruments. For the generation of pop music, prior works did not take into account chord progressions and rhythmic patterns. Specifically, chord progressions typically guide the melody’s progression, and rhythmic patterns determine whether a song is suitable for singing. Furthermore, pop music should also retain the characteristics of instruments. Finally, harmonies play a crucial role in multitrack music, but have not been well addressed in previous research. Moreover, music style is an essential feature of music. Recently, researchers have shown increasing interest in this area. An unsupervised music style transfer method has been proposed that does not require parallel data. This method is suitable for waveform and image data, but it cannot handle sequential data, such as Musical Instrument Digital Interface (MIDI) files. To address this issue, a variational autoencoder neural network model has been designed to achieve style transformation between classical and jazz music. Although this model can handle sequential data, it requires a significant amount of parallel music data for training. Therefore, the valuable question of how to leverage unparalleled music data to learn music styles remains.

Our sequence learning framework embodies a similar ethos to that of Pix2Seq (Zhang, 2023). Both methods view their domain tasks as sequence generation challenges and discretize the sequences’ continuous values into integers. However, our approach diverges from Pix2Seq in three key aspects:

•
Sequence structure: Pix2Seq sets up sequences using object coordinates and object categories, whereas our method uses basic musical notes.
•
Architecture: Pix2Seq employs ResNet (Chen et al., 2021) as its backbone network, followed by an encoder-decoder transformer. Our approach is simpler and more direct, using a single encoder and decoder Transformer. It uses BERT (He et al., 2016) as the encoder to extract features and adopts causal transformer blocks as the decoder for sequence generation.
•
Task: Pix2Seq is tailored for computer vision, whereas our approach is tailored for music generation.

Complete Article List

Search this Journal:

Reset

Volume 20: 1 Issue (2024)

Volume 19: 1 Issue (2023)

Volume 18: 4 Issues (2022): 2 Released, 2 Forthcoming

Volume 17: 4 Issues (2021)

Volume 16: 4 Issues (2020)

Volume 15: 4 Issues (2019)

Volume 14: 4 Issues (2018)

Volume 13: 4 Issues (2017)

Volume 12: 4 Issues (2016)

Volume 11: 4 Issues (2015)

Volume 10: 4 Issues (2014)

Volume 9: 4 Issues (2013)

Volume 8: 4 Issues (2012)

Volume 7: 4 Issues (2011)

Volume 6: 4 Issues (2010)

Volume 5: 4 Issues (2009)

Volume 4: 4 Issues (2008)

Volume 3: 4 Issues (2007)

Volume 2: 4 Issues (2006)

Volume 1: 4 Issues (2005)

View Complete Journal Contents Listing

MLA

APA

Chicago

Export Reference