Week 4— Generating Music by using Deep Learning
Hi, this week we will talk about a recent method used in Music Generation which is the Transformer model.
Previously, we discussed CNN based and RNN based models used to process music data. However, after further investigation, we found out that there are some problems we may encounter using these two models. To clarify, RNN based models do not perform well as they take one token at each timestamp in the training phase and CNNs only apply the Convolutional operation to certain areas. Transformers offer solutions to these two problems at the same time.
Transformer Based Models
Transformer architecture, first presented in the article published by Google Brain team. Transformers consist of mainly two parts which are encoder block and decoder block. The architecture overview is shown below.
Encoder
The encoder composed of stack of two sub-layers. The first of sub-layers is multi-head attention layer, second one is traditional fully connected neural network. This architecture uses residual connections and layer normalization for two sub-layers.
Decoder
The decoder layer is similar to encoder layer. Unlike the encoder, this block only uses a 3rd multi-head attention sub layer. Similar to encoder, decoder uses residual connections and layer normalization for three sub-layers.
References
See you in next blog post.