Week 4— Generating Music by using Deep Learning

b21626972
BBM406 Spring 2021 Projects
2 min readMay 16, 2021

--

Hi, this week we will talk about a recent method used in Music Generation which is the Transformer model.

Previously, we discussed CNN based and RNN based models used to process music data. However, after further investigation, we found out that there are some problems we may encounter using these two models. To clarify, RNN based models do not perform well as they take one token at each timestamp in the training phase and CNNs only apply the Convolutional operation to certain areas. Transformers offer solutions to these two problems at the same time.

Transformer Based Models

Transformer architecture, first presented in the article published by Google Brain team. Transformers consist of mainly two parts which are encoder block and decoder block. The architecture overview is shown below.

Encoder

The encoder composed of stack of two sub-layers. The first of sub-layers is multi-head attention layer, second one is traditional fully connected neural network. This architecture uses residual connections and layer normalization for two sub-layers.

Decoder

The decoder layer is similar to encoder layer. Unlike the encoder, this block only uses a 3rd multi-head attention sub layer. Similar to encoder, decoder uses residual connections and layer normalization for three sub-layers.

References

Music Transformer: Generating Music with Long-Term Structure, Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M. Dai, Matthew D. Hoffman, Monica Dinculescu, Douglas Eck

See you in next blog post.

--

--