ChatGPT - How it works: Transformers & NLP 5

Hi there and welcome Lucidate’s fifth video on transformer neural networks! In recent years, transformers have become one of the most popular architectures in natural language processing and other areas of artificial intelligence. In this video post, we'll be diving into what transformers are, how they work, and why they're so powerful.

'Hit >Play' above to get an overview of the transformer architecture. Open up another window and read along with the script below

In this video we’ll discuss:

1. The encoder / decoder architecture

2. We’ll look at what the transformer does during training & inference.

a) Training is what it does when it is learning by example and updating its weights.

b) Inference is when it is running and predicting the next sequence of words..

3. This will lead to a discussion of why we have masked attention and what the mask does

Encoder and Decoder

Transformers consist of two main components: the encoder and the decoder. The encoder takes in the input data and transforms it into a compact, low-dimensional representation called the "hidden state". The decoder then takes this hidden state and produces the final output.

For NLP tasks, such as ChatGPT, the inputs to the encoder are words. As discussed in prior videos machine learning models don’t process words. They only understand scalars, vectors matrices and tensors.

So one way to thing of the encoder is translating human readable words into the hidden state tensor representation, which is understood and interpretable by the AI model. Likewise the decoder is responsible for translating the machine-readable hidden state tensor representation into a human readable sequence.

As the name suggests the transformer model will transform the sequence somehow. The output sequence might be a translation of the input sequence - let’s say from French to German; it might be a summary of the input sequence, or it might be the generation of an entire article from a simple headline or first sentence.

The first thing that the encoder does it to convert these words into vectors. To understand how this semantic embedding works please see the second video post in this series.

The next thing the encoder does is to add positional embeddings to these vectors to record the position of each word in the sequence. The third video post in this series concerns positional embeddings.

Then the attention block will dynamically weight the importance of different elements in the sequence, determine which elements are most related to one another and identify the relevant information for the transformation task it has been set. It does this in the attention block. The fourth video in this series covers attention.

Grammar analogous to attention (but only analogous - grammar is not attention)

A useful analogy for attention is the rules of grammar. Let’s be clear here, attention is not grammar. Grammar provides rules and principles that determine the relationship between words in a sentence and their roles. It does this to help determine the meaning of a sentence. Analogously attention provides some learned rules between embeddings to better understand and formulate input and output sequences.

Training and Inference

Let’s turn our attention to training and inference.

Training

Training a transformer for NLP begins with converting a sequence of words into a numerical representation, known as an embedding. As per previous videos this captures the meaning and position of each word in the input sequence This embedding is then passed through a series of layers in the transformer to produce the final output.

There are many successful use-cases for transformers, several are shown on your screen. To illustrate the training process we’ll focus on number 7 in this list: sentence completion.

Let’s choose an appropriate sentence for completion. The sentence will be - ‘An input sequence generates an output sequence’ For this illustration our input sequence will be the first part of the sentence ‘An input sequence generates’, and our output sequence will be ‘an output sequence’.

We’ll write our input sequence as an input to our encoder and the output sequence - our target - above the decoder.

Our encoder will take these words and using the techniques described in prior videos convert them into a numeric representation. Along the way they will pick up semantic encodings, positional embeddings and attention scores from the various modules in the encoder. The encoder will output our hidden state tensor. This tensor will be fed into the decoder. The goal here is to convert this into our target sequence which begins with the word ‘an’. Early on in training a transformer the weights in our AI will be randomly initialised and consequently we would expect a random predicted word to emerge from our decoder. In this case that word is ‘Hedgerow’. We can calculate the difference between our expected word ‘an’ and our incorrectly predicted word ‘Hedgerow’ using cosine similarity in our embedding space. Please see the second video in this series for a refresher on this concept.

After this forward pass we can use backpropagation to update the trainable parameters in our transformer. This includes the weights and biases in our feedforward neural networks as well as the W, K and V matrices in our attention modules. In this schematic I’ve represented single attention heads. In sophisticated models like Bard, Bert, GPT-3 and ChatGPT their will be multiple attention heads run in parallel. You will hear the term ‘Multi-headed attention’ used to describe this.

So after this prediction let’s take stock. We have an incorrectly predicted word ‘hedgerow’ rather than ‘an’ and we’ve used this to update our transformer weights a little. We new have the next word to predict in our sequence. The next word is ‘output’. Our input sequence is one word longer. In order to train the network sensibly, we won’t use the incorrectly produced word ‘hedgerow’, instead we will use the correct word ‘output’. This is called ‘Teacher Forcing’. This means that during training only, the model uses uses the true output sequence as input for each time step, instead of its own predictions.

It means that at each step of the training process, the model is fed the actual, ground-truth output sequence, instead of relying on the predicted output from the previous step.

We then go again, now the word we are trying to predict is ‘output’, and we now have a longer input sequence. We have the original input sequence of ‘An input sequence generates’ concatenated with the next teacher forced word ‘an’. These are fed into the encoder and decoder as shown.

Once again we take a forward pass through the transformer, generating the word embeddings, multiplying the embeddings by the attention matrices and feeding our neural networks and we generate our second produced word in the sequence. Once again, in the early stages of training this is likely to be way off. We use the error to update our weights and we go again.

During training this forward pass to generate an estimate, comparison with the expected result and back propagation of the error is repeated millions or billions of times to hammer the tunable parameters to where they need to be.

Brute Force

If this seems like brute force, it is. This is the unsubtle end of AI. While a great many of the solutions are very elegant the reality of most AI systems is that you need a vast amount of training data, a huge computing surface of GPUs or TPUs and months of training to get these models into shape.

Masked Attention

The mask in the attention head has a crucial role to play here. If the sequences input into the transformer are less than the maximum length the mask will pad out the sequence with null values. More importantly during training it will stop the decoder ‘peeking ahead’ and looking at the next word in the sequence. A ‘diagonal’ matrix in the mask ensures that the transformer can only pay attention to the input sequence, from the hidden state and the number of words generated in the output sequence so far.

Inference

Having looked at training, inference is much simpler. When we are training we follow each forward pass with a back propagation step to update the weights in the transformer. With inference we just have forward passes to guys the next word. There is no ‘teacher forcing’ during inference. We simply append the word we have predicted to our input sequence and run another pass to generate another predicted word. This continues until we encounter a stop sequence. That is to say a code that stops the transformer generating further output.

Conclusions

In summary, we explored the concept of Transformer Neural Networks and their significance in natural language processing and artificial intelligence.

We looked at the encoder-decoder architecture: The encoder converts the input data into a low-dimensional representation called the hidden state while the decoder produces the final human-readable output.

We saw how the transformer converts words into a numeric representation, utilising positional embeddings. And we checked out the attention mechanism that dynamically weights the relationships and relevance of elements in the sequence.

We also looked at the ‘brute-force’ training process, utilising backpropagation to update the parameters in the transformer using huge training datasets over millions or billions of iterations.

Finally, we used an example of sentence completion to illustrate the steps involved during inference in transforming a sequence of words into a target sequence.