A Transformer-Based Model for Multi-Track Music Generation

A Transformer-Based Model for Multi-Track Music Generation

Cong Jin (Communication University of China, China), Tao Wang (Zhengzhou University, China), Shouxun Liu (Communication University of China, China), Yun Tie (Zhengzhou University, China), Jianguang Li (Communication University of China, China), Xiaobing Li (Central Conservatory of Music, China) and Simon Lui (Singapore University of Technology and Design, China)
DOI: 10.4018/IJMDEM.2020070103
OnDemand PDF Download:
No Current Special Offers


Most of the current works are still limited to dealing with the melody generation containing pitch, rhythm, duration of each note, and pause between notes. This paper proposes a transformer-based model to generate multi-track music including tracks of piano, guitar, and drum, which is abbreviated as MTMG model. The proposed MTMG model is mainly innovated and improved on the basis of transformer. Firstly, the model obtains three target sequences after pairwise learning through learning network. Then, according to these three target sequences, GPT is applied to predict and generate three closely related sequences of instrument tracks. Finally, the three generated instrument tracks are fused to obtain multi-track music pieces containing piano, guitar, and drum. To verify the effectiveness of the proposed model, related experiments are conducted on a pair of comparative subjective and objective evaluation. The encouraging performance of the proposed model over other state-of-the-art models demonstrates its superiority in musical representation.
Article Preview


Similar to most sequence-to-sequence (Seq2Seq) models(Sutskever et al., 2014), Transformer uses an encoder-decoder structure. However, the previous model usually uses a recurrent neural network (such as LSTM) in the encoder and decoder. The disadvantage of this network structure is the problem of long-term dependence and the inability to calculate in parallel. In order to improve the efficiency of parallel computing and capture long-term dependencies, Transformer gave up the RNN cycle generation and used the self-attention model to build a fully connected network structure, thereby implementing an architecture based entirely on the feed-forward attention mechanism. In this context, OpenAI's GPT pre-training model came into being(Radford et al., 2018). GPT uses a generative method to train language models. GPT uses the decoder structure in Transformer and does not use a complete Transformer to build the network. With the wide application of GPT, OpenAI proposed GPT-2, which has a larger training data set and can do diverse tasks without supervision on the basis of GPT. The structure of the GPT-2 model is still the same as GPT, where the core idea is that unsupervised pre-training models can be used to do supervised tasks(Radford et al., 2019). Similar with the pre-training model, a weekly-supervised deep hashing method is proposed by using weekly-supervised information(Li et al., 2020). BERT (Bidirectional Encoder Representations from Transformers) also modifies the pre-training target on the basis of GPT, and uses a larger model and more data to pre-train to obtain the best results at present.

Complete Article List

Search this Journal:
Open Access Articles: Forthcoming
Volume 12: 4 Issues (2021): 1 Released, 3 Forthcoming
Volume 11: 4 Issues (2020)
Volume 10: 4 Issues (2019)
Volume 9: 4 Issues (2018)
Volume 8: 4 Issues (2017)
Volume 7: 4 Issues (2016)
Volume 6: 4 Issues (2015)
Volume 5: 4 Issues (2014)
Volume 4: 4 Issues (2013)
Volume 3: 4 Issues (2012)
Volume 2: 4 Issues (2011)
Volume 1: 4 Issues (2010)
View Complete Journal Contents Listing