Mike Wang, John Inacay, and Wiley Wang (All authors contributed equally)
If you’ve been using online translation services, you may have noticed that the translation quality has significantly improved in recent years. Since it was introduced in 2017, the Transformer deep learning model has rapidly replaced the recurrent neural network (RNN) model as the model of choice in natural language processing tasks. However, Transformer models, like OpenAI’s Generative Pre-trained Transformer (GPT) and Google’s Bidirectional Encoder Representations from Transformers (BERT) models, have quickly replaced RNNs as the network architecture of choice for Natural Language Processing (NLP). With the Transformer’s parallelization ability and the utilization of modern computing power, these models are big and fast evolving, generative language models frequently draw media attention for their capabilities. If you’re like us, relatively new to NLP but generally understand machine learning fundamentals, this tutorial may help you kick start understanding Transformers with real life examples by building an end-to-end German to English translator.
In creating this tutorial, we based our work on two resources: the Pytorch RNN based language translator tutorial and a translator implementation by Andrew Peng. With an openly available database, we’ll be demonstrating our Colab implementation for how to translate between German and English using Pytorch and the Transformer model.
To start with, let’s talk about how data flows through the translation process. The data flow follows the diagram shown above. An input sequence is converted to a tensor where each of the Transformer’s outputs then goes through an unpictured “de-embedding” conversion process from embedding to the final output sequence. Note that we’ll be obtaining words one-by-one from each forward pass during inference rather than receiving a translation of the full text all at once from a single inference.
At the start, we have our input sequence. For example, we start with the German sentence “Zwei junge personen fahren mit dem schlitten einen hügel hinunter.” The ground truth English translation is “Two young people are going down a hill on a slide.” Below, we show how the Transformer is used with some insight on the inner workings. The model itself expects the source German sentence and whatever the current translation has been inferred. The Transformer translation process results in a feedback loop to predict the following word in the translation.
For the task of translation, we use the German-English `Multi30k` dataset from `torchtext`. This dataset is small enough to be trained in a short period of time, but big enough to show reasonable language relations. It consists of 30k paired German and English sentences. To improve calculation efficiency, the dataset of translation pairs is sorted by length. As the length of German and English sentence pairs can vary significantly, the sorting is by the sentences’ combined and individual lengths. Finally, the sorted pairs are loaded as batches. For Transformers, the input sequence lengths are padded to fixed length for both German and English sentences in the pair, together with location based masks. For our model, we train on an input of German sentences to output English sentences.
1. The Messenger Rules for European Facebook Pages Are Changing. Here’s What You Need to Know
2. This Is Why Chatbot Business Are Dying
3. Facebook acquires Kustomer: an end for chatbots businesses?
4. The Five P’s of successful chatbots
We use the spacy python package for vocabulary encoding. The vocabulary indexing is based on the frequency of words, though numbers 0 to 3 are reserved for special tokens:
- 0: <SOS> as “start of sentence”
- 1: <EOS> as “end of sentence”
- 2: <UNK> as “unknown” words
- 3: <PAD> as “padding”
Uncommon words that appear less than 2 times in the dataset are denoted with the <UNK> token. Note that inside of the Transformer structure, the input encoding, which is by frequency indices, passes through the nn.Embedding layer to be converted into the actual nn.Transformer dimension. Note that this embedding mapping is per word based. From our input sentence of 10 German words, we get tensors of length 10 where each position is the embedding of the word.
Compared to RNNs, Transformers are different in requiring positional encoding. RNN with its sequential nature, encodes the location information naturally. Transformers process all words in parallel, therefore requiring stronger location information to be encoded from the inputs.
We calculate positional encoding as a function of time. This function is expected to contain cyclic (sine and cosine functions) and non-cyclic components. The intuition here is that this combination will allow attention to regard other words far away relative to the word being processed while being invariant to the length of sentences due to the cyclic component. We then add this information to the word embedding. In our case, we add this to each token in the sentence, but another possible method is concatenation to each word.
Here we emphasize Transformer layers and how cost functions are constructed.
Pytorch’s Transformer module is at the core of our application.The torch.nn.Transformer parameters include: src, tgt, src_key_padding_mask, tgt_key_padding_mask, memory_key_padding_mask, and tgt_mask. These parameters are defined as:
src: the source sequence
tgt: the target sequence. Note that the target input compared to the translation output is always shifted by 1 time step
src_key_padding_mask: a boolean tensor from the source language where 1 indicates padding and 0 indicates an actual word
tgt_key_padding_mask: a boolean tensor from the target language where 1 indicates padding and 0 indices an actual word
memory_key_padding_mask: a boolean tensor where 1 indicates padding and 0 indicates an actual word. In our example, this is the same as the src_key_padding_mask
tgt_mask: a lower triangular matrix is used to process target generation recursively where 0 indicates an actual predicted word and negative infinity indicates a prediction to ignore
The Transformer is designed to take in a full sentence, so an input shorter than the transformer’s input capacity is padded. The key padding masks allow for the Transformer to perform calculations efficiently by excluding elements after sentences end. When the Transformer is used in sequence to sequence applications, it’s crucial to understand that even though the input sequence is processed at the same time, the output sequence is processed progressively. This sequential progression is configured by tgt_mask. During training or inference, the target output is always one step ahead of the target input as each recursion generates a new additional word, as shown “tgt_inp, tgt_out = tgt[:-1, :], tgt[1:, :]” configuration during training. The tgt_mask is composed as a lower triangular matrix:
Row by row, a new position is unlocked for target output, e.g. a new target word. The newly appended sentence is then fed back as the target input in this recursion.
While we do build the translation word-by-word for inference, we can train our model using a full input and output sequence at once. Each word in the predicted sentence can be compared with each word in the ground truth sentence. Since we have a finite vocabulary with our word embeddings, we can treat translation as a classification task for each word. As a result, we train our network with the Cross Entropy loss on an individual word level for the translation output in both the RNN and Transformer formulations of the task.
When we perform the actual German to English translation, the entire German sentence is used as the source input, but the target output, e.g. the English sentence is translated word by word, starting with <SOS> and ending with <EOS>. Each step, at the target output we apply argmax function over the vocabulary to obtain the next target word. Note choosing the highest probability word progressively from our network is a form of greedy sampling.
The Transformer model is very effective in solving sequence-to-sequence problems. Funnily enough, it’s effectiveness comes from processing a sentence as a graph instead of an explicit sequence. Each word at a particular position considers all other words. The Transformer powers this approach with the attention mechanism, which captures word relations and applies attention weights to words of focus. Unlike Recurrent Neural Networks, calculating the Transformer module can be done in parallel. Note that the Transformer model allows fixed length sequences for inputs and outputs. Sentences are padded with <PAD> tokens to the fixed length.
A full transformer network consists of a stack of encoding layers and a stack of decoding layers. These encoding and decoding layers are composed of self-attention and feed forward layers. One of the basic building blocks of the transformer is the self-attention module which contains Key, Value, and Query vectors. At a high level, the Query and Key vectors together calculate an attention score between 0 and 1 which scales how much the current item is being weighted. Note that if the attention score is only scaling items to be bigger or smaller, we can’t really call it a transformer yet. In order to start transforming the input, the Value vector is applied to the input vector. The output of the Value vector applied to the Input Vector is scaled by the Attention Score we calculated earlier.