Language is a natural phenomenon that is, to the best of our knowledge, unique to humans. Although there are incipient findings that suggest that other animals, such as dolphins, might be able to use a form of language, there is nothing in nature that even begins to approach the complexity of human language. Moreover, despite how mind-bogglingly complex it might be, humans are really, really good at language; so good, in fact, that, sometimes, we might even forget how difficult it is to process it, from a non-human (or even non-dolphin) perspective.
If, for example, we were trying to make a computer understand human language, we'd be hard-pressed to start communication right away, be it verbal or written, because, after all, computers are not really good at anything except doing tons and tons of computations very quickly. Say “hi!” to a computer and nothing happens; ask it to compute the square root of 1.43263 and we won't even have time to blink. It pays to remember that all the pretty things a computer is able to show us, like loving emails, cat pictures, funny dog videos, and oppressive all-seeing surveillance, are nothing but the result of being inhumanly good at performing additions and subtractions. And not just any kind of additions and subtractions, but binary additions and subtractions. The computer doesn't actually compute the square root of 1.43263 but rather the square root of 1.01101110110000001101.
So if we wanted to engage with a computer, and have it engage with us, using human language, either we have to learn how to speak in ones and zeroes, or, more wisely, find a suitable language representation that the computer is able to understand. Whether you want to ask it a question, ask it to translate a phrase, ask it to summarise a text, etc, you need to be able to represent your input in a manner that the computer is able to convert into a suitable output. Language representations are a huge part of the field in Computer Science that deals with human language and computers, Natural Language Processing (NLP). A language representation is just a mathematical object, like an n-dimensional vector, that approximates and condenses the richness of natural language such that a computer may process it.
This is easier said that done because, as we've mentioned, language is a very complex phenomenon. Just to give you an idea, consider the following example, taken from one of my favourite books of all times, Twenty-thousand leagues under the sea1.
« Deux hommes entre tous les hommes ont le droit de répondre maintenant. Le capitaine Nemo et moi. »
(“Two men amongst all men have now the right to respond now. Captain Nemo and I.”)
If we wanted to examine it from a linguistic point of view, there are different ways we could try and discover the features of this excerpt. If we started from a lexical point of view, we might simply list all the words in the sentences; if we used a morpho-syntactical approach, we would note that, because this is French, we have to take into account things like noun gender, number, inflection, etc; if we did a semantic analysis, we could ask ourselves if the noun phrase tous les hommes (“all men”) refers strictly to just adult males, or if it refers more generally to all human beings; if we assumed a pragmatic approach, we would have to note that the fact that the two men have the “right to respond now” suggests that they did not have the right to do so before some event happened; finally from a discourse perspective, we have to assume that, without ambiguity, the two men mentioned in the first sentence are actually Captain Nemo and the narrator, despite the fact that nowhere is this explicitly stated.
And this is just if we're talking about written language. When we deal with spoken language, we would have to take into account things like intonation (eg Is the text read in a sarcastic tone?), prosody (eg Is the text stressing the right syllables and words?), dialect (eg Is it said in an American accent, a British accent, or a superb French accent?) and language competence (eg is the person reading out loud the text good at pronouncing each word?)
When it comes to features, as you can see, language is no joke.
So how exactly do you start encoding such complexity into measly ones and zeroes? As it turns out, there is more than one way, and we're going to list just a few, in increasing levels of complexity.
Let's say we want to encode a sentence as a vector, that is, a mathematical object that a computer can work with. The simplest way is to list all the words that appear in the sentence, assign an index to them, and then list the indices of each word in the order in which they appear in the sentence. So for example, in the sentence Le chat est sur le tapis (“The cat sat on the mat”), we can see the words
- chat
- est
- le
- sur
- tapis
and, thus, a vector representation of the sentence could be [2, 0, 1, 3, 2, 4]. This simple, and sometimes useful, representation of a sentence is called the one-hot vector representation, because, in a vector the size of the entire vocabulary of a language, the words appearing in any given sentence are represented with a 1, while every other word that does not appear in the sentence is represented by a zero.
One-hot vectors are a good starting point, but they are not very informative because they do not reflect any feature except for the word the represent. They say nothing, for example, of the syntax, semantics, or morphology of the words in the sentence. A better way of representing words is to encode into vectors not just their position in the sentence, but also information about the context in which the word is usually found. One example of this strategy is the celebrated word2vec model2, in which words are represented as dense vectors. This means that, instead of representing each word as a single index and the whole sentence as a list of indices, each word itself would be a list of real numbers. Going back to the previous example, instead of a simple list of indices for Le chat est sur le tapis, we now have something that looks like
vector(≤)=[0.2589,0.1757,0.6877,0.0817...]
vector(chat)=[0.1257,0.1337,0.7373,0.8356,...]
vector(est)=[0.6768,0.5134,0.3220,0.0469,...]
vector(sur)=[0.9259,0.7515,0.4842,0.0701,...]
vector(tapis)=[0.9820,0.4070,0.0156,0.9608,...]
Why is this representation a big deal? Because, unlike with one-hot representations, these dense vectors allow us to represent meaningful relations between words. In fact, an important piece of evidence that language is being more fully captured is the fact that we can now perform arithmetic with vectors and still retain a lot of the semantics of the words. Take for example the words king, man, and woman; intuitively, man is to king what woman is to queen, which is a relation easily understood to us humans. One-hot vectors are completely incapable of encoding these kinds of semantics, but it turns out that dense vectors are very capable of doing so. In fact, we are now able to make the machine understand these simple semantics: if we subtract the vector for man from the vector of king and we add the vector for woman, we end up with, predictably, the vector for queen!
vector(“king”)−vector(“man”)+vector(“woman”)=vector(“queen”)
This development is great, and rather mind-blowing when you first encounter it, but its power should not be overstated, nor should its functionality be considered flawless. A very famous example of the fallibility of word2vec was the case when the following example of encoded semantics also occurred.
vector(doctor)−vector(“man”)+vector(“woman”)=vector(“nurse”)
Obviously, the role of doctor is not confined to human males, and thus the expected result of the previous example should not have yielded the role of nurse when the vector for man was replaced by the vector for woman. This example is great for illustrating the fact that, in order to perform real-world language processing tasks, not only do you need a good language representation model, but you also need appropriate, high-quality data that faithfully represents the real world; the word2vec model, when it was first developed, was trained on biased data where occurrences of the word doctor were heavily biased in favour of male subjects, which in turn caused it to “learn” that male humans working in the health sciences are invariably doctors while female humans working in the same field are always nurses.
As useful as word2vec may appear, it is not yet quite sufficient to encode a lot of language features; for example word2vec is only able to deal rather clumsily with homonyms, and it is not able to produce sentence embeddings, that is, dense vectors that encode whole sentences. Even though the principle of word2vec was later successfully applied to encode sentences and even entire documents3, the scope of the linguistic context that it took into account to create these vectors was still limited. In effect, a more desirable approach would be to represent language while taking into account the surrounding words, sentences and sometimes even the surrounding documents in order to produce contextual vectors. Is it possible to have a language representation like that? It turns out that the answer is yes: modern language representation techniques are precisely all about producing and refining these kinds of vectors, which, in turn, have proven to be wildly successful in most NLP tasks. In part 2, we will discuss these techniques, and the mathematical model that took the field storm, the Transformer.
-
Sources
- A book that I cannot recommend hard enough, and whose abysmal public-domain translation should be avoided at all cost. Seriously. Get a revised, non-public-domain version. I am not joking. ↩
- Mikolov, Tomas et al. “Efficient Estimation of Word Representations in Vector Space.” ICLR (2013). ↩
- Le, Quoc, and Tomas Mikolov. Distributed representations of sentences and documents. International conference on machine learning. PMLR, 2014. ↩