How can you build powerful language models that can beat almost any benchmark and solve most NLP tasks? Recent approaches in NLP bring new insights and techniques with which it’s possible to play to build deep learning models. In this article, we will talk about one of those recent approaches : unsupervised pre-training in a multi-task learning objective setup with multi-head self-attention mechanisms.
So:
- First, we’ll review how unsupervised pre-training works.
- Then, we’ll see how we can improve that with multiple unsupervised learning objectives.
- Next, let’s discover how to do supervised training with a clear learning objective from the unsupervised model.
- Finally, we’ll dive into the technical details of the deep learning models that are the closest to the state of the art to solve most NLP problems
Unsupervised pre-training of models
What is unsupervised pre-training?
Unsupervised learning is to learn on data without a clear training objective or label.
Unsupervised pre-training is a way to make a Neural Network (or other machine learning algorithms) learn on data in general before learning how to solve a precise task.
It could be compared to students completing a general education phase when young before studying in a specialisation after. Deep Neural Networks (DNNs) can profit of the same phases of curriculum learning.
Why is it needed to beat the SOTA across varied NLP tasks?
When specialised data is scarce, rare, hard or costly to generate, then making use of unsupervised pre-training is a good choice, because a properly pre-trained neural network will need to see less supervised training data to do good predictions compared to another neural network only trained on the supervised training data.
It’s also good to notice that depending on the nature of the task at hand, what’s called “small quantities of data” can be counted in millions or more. For example, natural language. How long does the human brain needs to be trained on text before being fully intelligent ? It can be counted in decades.
To sum up, most NLP tasks and benchmarks online only provide a few training examples for artificial neural networks to train on. To beat those benchmarks, it is a good practice to make the neural network learn to model language in an unsupervised way before even applying the neural network to the specialized data at hand to solve the practical task.
Multi-Task Learning (MTL)
What is MTL and why does it helps ?
Multi-Task Learning (MTL) is a way to combine many learning tasks into one same model (i.e.: Deep Neural Network). This makes that the neural network will learn commonalities and share learned neural representations across many tasks, improving them. This leads to the model performing better generalization of the information, which is not only a way to avoid overfitting, but also a way to obtain more accuracy by reusing some of the learned general-purpose or common-sense thinking across the tasks.
How are unsupervised pre-training and Multi-Task Learning done in practice?
Okay, so unsupervised pre-training doesn’t totally have “no” learning objective at all. In practice, it almost always consists of performing feats with the data so as to simulate a supervised training environment. In the case of BERT, at least two loss functions are use to backpropagate the learning in a supervised way. BERT is a model by Google that is the State-Of-The-Art (SOTA) on many tasks, and which inspired GPT-2 by OpenAI which is a very similar model. Both of those models make use of the same neural architecture with attention mechanisms.
One of the ways that BERT formulates the unsupervised learning objective (loss) function is to make the neural network read a sentence, but with missing words, and asking the neural network to fill-in the blanks. Although this way to formulate the problem is supervised learning in practice, it can be seen as unsupervised learning just from the fact that it makes use of unstructured text data in the first place. As a consequence, the neural network learns to guess words, and will learn to recognise patterns in the sentences, developing a kind of grammatical and contextual intelligence. This vaguely resembles N-Gram language models, however here the encoder (neural network) is bidirectional by the fact that it uses attention mechanisms, and also the N-Gram is somehow abstracted away as the predictions are made across sentences (and/or perhaps even larger blobs of text if required).
However, training a BERT model doesn’t end here. The training procedure in BERT includes other losses / objective functions extracted from the text itself. Another function is a simple binary classifier so as to know whether or not a sentence follows another sentence or not. This helps the neural network being able to tell the difference between closely-related and unrelated sentences, and some sentence-level interpretation skills.
So, pretty much the same goes with GPT-2 from OpenAI. It can solve many tasks, including classification, textual entailment, similarity, answering multiple-choice questions, and freely generating a text composition given the start of the text.
Model fine-tuning
Obviously, after unsupervised pre-training on multiple tasks, the multi-task learning doesn’t stop here. It’s possible to clone the model and to fine tune it for many different tasks. This consists of adding more learning objectives and adding more output heads to the model so as to be able to solve those tasks. It’s possible to add classification heads, other regression heads, and anything whatsoever, to solve tasks NLP tasks that ranges from Part Of Speech (POS) tagging, to Named Entity Recognition (NER), to automatic Question Answering (QA), and so forth.
For example, to perform question-answering, you might want to add two linear layers : one that point to the beginning of the answer in the text, and another that points to the end, so as to generate an answer, as it’s done with BERT.
The truth is that there is no one-size fits all solution, despite BERT-like models can do many things. For fine-tuning, the choice of the proper output head is important and may vary in size and shape.
Multi-Head Self-Attention Mechanisms
SPOILER ALERT: heavy maths ahead. Read at your own risks.
In its simplest (vanilla) form, as discovered and laid out by Dzmitry Bahdanau, Kyunghyun Cho and Yoshua Bengio in 2014, an attention mechanism is a mini-neural-network that has the capability of evaluating how much attention to give to many items in a set by generating a certain weight “α”. Then, the most salient items’ values are mostly kept when multiplied by that weight (and the others not kept), then everything is summed so as to keep an average of what was important. Note : each item to be summed in the attention mechanism is a vector, so the vectors, upon being weighted by the weight “α”, are either almost intact, or becomes with values very close to zero.
What are Multi-Head Self-Attention mechanisms?
Multi-Head Self-Attention mechanisms are the most (or one of the most) recent variants of attention mechanisms as used in BERT and GPT-2.
Before even talking of self-attention and of multi-heads, we must first redefine the attention mechanism below so that its definition is more thorough.
The keys “K” and the queries “Q”
As stated, an attention mechanisms will weigh every item before summing them. In recent attention such as the Multi-Head Self-Attention mechanisms, those weights come from the dot product (almost a cosine similarity) between a query vector “Q” and one of the key vector “K”. So in the case of the previous section, you could have one query to compare with each key “K” to get one value of similarity per key for the given query. Those values are the attention weights “α”.
The values “V”
Then, the attention weights “α” are passed through a softmax to be normalized such that they all sum to one. We now have as many attention weights as we have values, because there should be as many values as we had keys and that our weights comes from the match between a query and the keys.
The attention weights “α” now multiply each value, and the result is summed (as discussed earlier). We now have one result vector for the huge combination of a query vector and many key-value vector pairs (the keys and values were reduced of one dimension). So we have only one vector at the end which encapsulates all what the neural network chose to pay attention to from the queries and keys, but summed (somehow averaged) through the values.
Sets of queries instead of one query: the “self” attention
Above, we described attention mechanisms so as to have one query “Q”’ and a set of keys “K” with their respective values “V”. In that case, the result is one vector.
In practice, we often don't even have as many keys as we have values, but we also have as many queries as keys and values. If we have many queries instead of one, the whole process is repeated as many times, allowing as many output vectors as we had queries.
The real trick in BERT and GPT-2 is then to make the queries, keys, and values, all be the same! Hence the naming: “self” attention mechanisms. Those are word representations amidst the deep neural network, the operation being repeated between every layer and with linear operators as well. So we take each word of a sentence, and make them queries. They each query each other, so that the queries matches the values. Then this yields the attention weights to be summed through and weighted to the values, so as to then reduce them for each query. The result? Words are compared to each other and they each can profit of some of the context in the sentence to enrich their representation. As such, the word vectors doesn’t anymore represent the words, but also the words in their context, because remember : BERT is trained to predict missing words, and they need to fill-in those tokens with the attention from each other surrounding word.
Multi-Head
There is just one missing piece to fully understanding the Multi-Head Self-Attention mechanisms as used in BERT and GPT-2 : the Multi-Head part. Well, this one is simple : before the whole process, the word representations were re-scaled with a linear and split into many lower-dimensional word representations. This way, the whole thing can be done many times, with many small-representations of the words, which will impact how the attention will sum things up. It’s like adding checkpoints or intervals between each summation so as to keep things in their own small environments before concatenating all the results at the end, to let different attention heads treat the subject with a different angle without always losing everything in the summation that reduces the tensors.
How are they used in BERT and in GPT-2 ?
BERT actually means “Bidirectional Encoder Representations from Transformers”. This is quite ironic considering that it DOES NOT make use of Bidirectional Recurrent Neural Networks (RNNs), and that Multi-Head Self-Attention Mechanisms can process information in every direction (not exactly 2 directions as in “bidirectional”). Anyways, simply said, BERT is a huge stack of Multi-Head Self-Attention Mechanisms with Positional Encoding, which is also called a Transformer by Google. The same goes for GPT-2 : OpenAI reused the Transformer and changed the data on which it was trained as well as changing the learning objective to make it a generative model.
Why are attention mechanisms better than RNNs sometimes, and why sometimes not ?
Attention mechanisms are sometimes better than RNNs for two reasons :
- They capture neural information in all directions, like treating a set of values instead of a list or doubly-linked-list. In that sense, attention mechanisms are like the sets (or dictionaries, hashmaps) of neural networks, while RNNs are to be compared to chained lists.
- While the information can be processed in any directions (all at once - even more than bidirectionally!), it can also be processed simultaneously. So if you have 10 items to compute, computing those items will take 10*10 == 100 units of time, it has a O(n²) space complexity. However, the thing is that processing those O(n²) items can be done all at once, so the algorithm is O(1) in time!
You can probably already see where this is going : given an infinite amount of memory, the computation can take place in constant time. However, nobody has an infinite amount of memory in their GPUs or in their TPUs. Considering this, it’s easy to fall for cringy and borderline-fake articles (that even cite the wrong dates and sources) like this one claiming that the RNN is no good anymore (this article got way too much attention - badum tsss). And even despite that, it’s interesting to note that in the winners of a recent toxic comment classification challenge, RNNs were used instead of attention, and it was said that attention models were longer to train and were the only ones that really got performances comparable to RNNs (probably because their hardware didn’t have enough cores to do a real O(1) operation all at once, thought). Anyways. Let all that be nuanced properly, overall, it’s healthy and useful to understand that RNNs are still useful along attention mechanisms, and that they remains a good choice if you want to stick to just O(n) memory but O(n) time : everything processes linearily in RNNs. It’s mostly that attention approaches beat every benchmark given enough unsupervised pre-training on big data sources.
To sum up, in the case of NLP, given a big enough dataset to pre-train on, if you limit, clip or split your max sentence length to a certain amount of words only, you just bounded your problem to use at most a constant of O(1) space too, so that’s fine : most sentences aren’t too long, so having things squared in space to capture correlations between every words and semantic sense isn’t that bad and helps achieving good results. That’s good enough to beat most NLP benchmarks. In some cases : Attention Is All You Need (AIAYN).
Conclusion
Recent NLP approaches are very technical and understanding the inner workings of those approaches is important, not only to be able to use them and to be able to interact with such code, but also to be able to generalize properly on the subject matter so as to be able to be creative and coming up with new ideas on how to use, modify, and improve them. So the most important thing is to get a feel of how things work, and how things might work in the future. Each little detail is subject to change, although overall, things are on their way.
To sum up, we’ve covered how unsupervised pre-training works, how to use multiple unsupervised learning objectives, how to perform fine-tuning with a supervised training part and a clear learning objective from the unsupervised model, and finally we dove in the technical details of Transformer neural architectures works with their attention mechanisms and all the fluffs.
In the future, we expect new approaches to include new or more ways to formulate different unsupervised objectives so as to assist learning. We should also see new ways to connect neurons together within the deep neural network to obtain better scores, as well as to increase the computational efficiency and algorithmic complexity of the whole process.