Natural language processing (NLP) has undergone revolutionary changes in recent years. Thanks to the development and use of pre-trained language models, remarkable achievements have been made in many applications. Pre-trained language models offer two major advantages. One advantage is that they can significantly boost the accuracy of many NLP tasks. For example, one can exploit the BERT model to achieve performances higher than humans in language understanding.8 One can also leverage the GPT-3 model to generate texts that resemble human writings in language generation.3 A second advantage of pre-trained language models is that they are universal language processing tools. To conduct a machine learning-based task in traditional NLP, one had to label a large amount of data to train a model. In contrast, one currently needs only to label a small amount of data to fine-tune a pre-trained language model because it has already acquired a significant amount of knowledge necessary for language processing.
This article offers a brief introduction to language modeling, particularly pre-trained language modeling, from the perspectives of historical development and future trends for general readers in computer science. It is not a comprehensive survey but an overview, highlighting the basic concepts, intuitive explanations, technical achievements, and fundamental challenges. While positioned as an introduction, this article also helps knowledgeable readers to deepen their understanding and initiate brainstorming. References on pre-trained language models for beginners are also provided.
NLP is a subfield of computer science (CS), artificial intelligence (AI), and linguistics, with machine translation, reading comprehension, dialogue system, document summarization, text generation, and others, as applications. In recent years, deep learning has become the fundamental technology of NLP.
In our view, there are two main approaches to modeling human languages using mathematical means: one is based on probability theory and the other on formal language theory. These two approaches can also be combined. Language models fall into the first category from the viewpoint of a fundamental framework.
Formally, a language model is a probability distribution defined on a word sequence (a sentence or a paragraph). Language models amount to important machinery for modeling natural language texts based on probability theory, statistics, information theory, and machine learning. Neural language models empowered by deep learning, especially recently developed pre-trained language models, have become the fundamental technologies of NLP.
In this article, I first introduce the basic concepts of language modeling studied by Markov and Shannon (based on probability theory). Next, I discuss the linguistic models proposed by Chomsky (based on formal language theory), followed by a description of the definitions of neural language models as extensions of traditional language models. Then, I explain the basic ideas of pre-trained language models and follow with a discussion of the advantages and limitations of the neural language modeling approach and a prediction of future trends.
Markov and Language Models
Andrey Markov was perhaps the first scientist who studied language models,10 although the term "language model" did not exist at the time.
Suppose that w1, w2, ···, wN is a sequence of words. Then, the probability of the word sequence can be calculated as follows:
Let p(w1|w0) = p(w1). Different types of language models use different methods to calculate the conditional probabilities p(wi|w1, w2, ···, wi-1). The process of learning and using a language model is referred to as language modeling. An n-gram model is a basic model that assumes that the word at each position only depends on the words at the n – 1 previous positions. That is, the model is an n – 1-order Markov chain.
Markov studied the Markov chain in 1906. The model which he first considered was quite simple, with only two states and transition probabilities between those states. Markov proved that if one jumps between the two states according to the transition probabilities, then the frequencies of accessing the two states will converge to the expected values, which is the Ergodic theorem of the Markov chain. In the following years, he expanded the model and proved that the above conclusion still holds in more general settings.
To provide a concrete example, Markov applied his proposed model to Alexander Pushkin's novel in verse, Eugene Onegin, in 1913. Removing spaces and punctuation marks and classifying the novel's first 20,000 Russian letters into vowels and consonants, he obtained a sequence of vowels and consonants in the novel. Using paper and pen, Markov then counted the transition probabilities between the vowels and consonants. Then, the data was used to verify the characteristics of the simplest Markov chain.
It is very interesting the initial application area of the Markov chain is language. The example Markov studied is the simplest language model.
Shannon and Language Models
In 1948, Claude Shannon published his groundbreaking paper, "The Mathematical Theory of Communication," which pioneered the field of information theory. In the paper, Shannon introduced the notions of entropy and cross-entropy and studied the properties of the n-gram model.30 (Shannon borrowed the term "entropy" from statistical mechanics based on advice from John von Neumann.)
Entropy represents the uncertainty of one probability distribution, while cross-entropy represents the uncertainty of one probability distribution with respect to the other probability distribution. Entropy is a lower bound of cross-entropy.
Suppose that language (word sequence) is data generated by a stochastic process. The entropy of probability distribution of n-grams is defined as follows:
where p(w1, w2, ···, wn) represents the probability of n-gram w1, w2, ···, wn. The cross-entropy of probability distribution of n-grams with respect to the "true" probability distribution of data is defined as follows:
where q(w1, w2, ···, wn) represents the probability of n-grams w1, w2, ···, wn and p(w1, w2, ···, wn) represents the true probability of n-gram w1, w2, ···, wn.
The following relation holds:
It is very interesting the initial application area of the Markov chain is language. The example Markov studied is the simplest language mode.
The Shannon-McMillan-Breiman theorem states that when the stochastic process of language satisfies the conditions of stationarity and ergodicity, the following relations hold the following:
In other words, when the word sequence length goes to infinity, the entropy of the language can be defined. The entropy takes a constant value and can be estimated from the data of the language.
If one language model can more accurately predict a word sequence than the other, then it should have lower cross-entropy. Thus, Shannon's work provides an evaluation tool for language modeling.
Note that language models can model not only natural languages but also formal and semi-formal languages—for example, Peng and Roth.21
Chomsky and Language Models
In parallel, Noam Chomsky proposed the Chomsky hierarchy of grammars in 1956, for representing the syntax of a language. He pointed out that finite-state grammars (also n-gram models) have limitations in describing natural languages.4
Chomsky's theory asserts that a language consists of a finite or infinite set of sentences, each sentence is a sequence of words of finite length, words come from a finite vocabulary, and a grammar is a set of production rules that can generate all sentences in the language. Different grammars can produce languages in different complexities, and they form a hierarchical structure.
A grammar that can generate sentences acceptable by a finite-state machine is a finite-state grammar or regular grammar, while a grammar that can produce sentences acceptable by a non-deterministic pushdown automaton is a context-free grammar. Finite-state grammars are properly included in context-free grammars.
The "grammar" underlying a finite Markov chain (or an n-gram model) is a finite-state grammar. A finite-state grammar does have limitations in generating sentences in English. For example, there are grammatical relations between English expressions, such as the following relations in (i) and (ii).
- (i) If S1, then S2.
- (ii) Either S3, or S4.
- (iii) Either if S5, then S6, or if S7, then S8
In principle, the relations can be combined indefinitely to produce correct English expressions (such as in example iii). However, a finite-state grammar cannot describe all the combinations, and, in theory, there are English sentences that cannot be covered. Therefore, Chomsky contended that there are great limitations in describing languages with finite-state grammars, including n-gram models. Instead, he pointed out that context-free grammar can model languages more effectively. Influenced by him, in the following decades, context-free grammars were more commonly used in NLP. (Chomsky's theory is not very influential to NLP now, but it still has important scientific values.)
Neural Language Models
In 2001, Yoshua Bengio and his co-authors proposed one of the first neural language models,1 which opened a new era of language modeling. (Bengio, Geoffrey Hinton, and Yann LeCun received the 2018 ACM A.M. Turing Award for their conceptual and engineering breakthroughs that have made deep neural networks a critical part of computing, as is well known.)
The n-gram model is limited in its learning ability. The traditional approach is to estimate from the corpus the conditional probabilities p(wi|wi-n+1, wi-n+2, ···, wi-1) in the model with a smoothing method. However, the number of parameters in the model is of exponential order O(Vn), where V denotes vocabulary size. When n increases, the parameters of the model cannot be accurately learned, due to the sparsity of training data.
The neural language model proposed by Bengio et al. improves the n-gram model in two ways. First, a real-valued vector, called word embedding, is used to represent a word or a combination of words. (The embedding of a word has much lower dimensionality than the "one-hot vector" of a word, in which the element corresponding to the word is one and the other elements are zero.)
Word embedding, as a type of "distributed representation," can represent a word with better efficiency, generalization ability, robustness, and extensibility than one-hot vector. Second, the language model is represented by a neural network, which greatly reduces the number of parameters in the model. The conditional probability is determined by a neural network:
where (wi-n+1, wi-n+2, ···, wi-1) denote the embeddings of words wi-n+1, wi-n+2, ···, wi-1; f(·) denotes the neural network; and ϑ denotes the network parameters. The number of parameters in the model is only of order O(V). Figure 1 shows the relationship between representations in the model. Each position has an intermediate representation that depends on the word embeddings (words) at the previous n – 1 positions, and this holds for all positions. The intermediate representation at the current position is then used to generate a word for the position.
After the work of Bengio et al., a large number of word-embedding methods and neural language-modeling methods have been developed, bringing improvements from different perspectives.
Representative methods for word embedding include Word2Vec.18,19 Representative neural language models are recurrent neural network (RNN) language models, including the long short-term memory (LSTM) language models.9,11 In an RNN language model, the conditional probability at each position is determined by an RNN:
where w1, w2, ···, wi-1 denote the embeddings of words w1, w2, ···, wi-1; f(·) denotes the RNN; and ϑ denotes the network parameters. The RNN language model no longer has the Markovian assumption, and the word at each position depends on the words at all previous positions. An important concept in RNN is its intermediate representations or states. The dependencies between words are characterized by the dependencies between states in the RNN model. The model's parameters are shared in different positions, but the obtained representations are different at different positions. (For ease of understanding, we do not give the formal definitions or present the architectures of neural networks in this article.)
Figure 2 shows the relationship between representations in an RNN language model. There is an intermediate representation of each layer at each position that represents the "state" of the word sequence so far. The intermediate representation of the current layer at the current position is determined by the intermediate representation of the same layer at the previous position and the intermediate representation of the layer below at the current position. The final intermediate representation at the current position is used to calculate the probability of the next word.
Language models can be used to calculate the probability of language (word sequence) or to generate language. In the latter case, natural language sentences or articles are generated, for example, by random sampling from language models. It is known that LSTM language models that learn from a large amount of data can generate quite natural sentences.
An extension of a language model is a conditional language model, which calculates the conditional probability of a word sequence under a given condition. If the condition is another word sequence, then the problem becomes transformation from one word sequence to another—that is, the so-called sequence-to-sequence problem. Machine translation,5,33 text summarization,20 and generative dialogue31 are such tasks. If the given condition is a picture, then the problem becomes transformation from a picture to a word sequence. Image captioning35 is such a task.
Conditional language models can be employed in a large variety of applications. In machine translation, the system transforms sentences in one language into sentences in another language, with the same semantics. In dialogue generation, the system generates a response to the user's utterance, and the two messages form one round of dialogue. In text summarization, the system transforms a long text into a short text, making the latter represent the gist of the former. The semantics represented by the conditional probability distributions of the models vary from application to application and are learned from the data in the applications.
The study of sequence-to-sequence models has contributed to the development of new technologies. A representative sequence-to-sequence model is a transformer developed by Vaswani et al.34 The transformer is entirely based on the attention mechanism5 and exploits attention to conduct encoding, decoding, and information exchange between encoder and decoder. At present, almost all machine translation systems employ the transformer model, and machine translation has reached the level that can almost meet the needs in practice. The architecture of the transformer is now adopted in almost all pre-trained language models because of its superior power in language representation.
Pre-Trained Language Models
The basic idea of a pre-trained language model is as follows. First, one implements the language model based on, for example, the transformer's encoder or decoder. The model learns in two phases: pre-training, where one trains the parameters of the model using a very large corpus via unsupervised learning (also called self-supervised learning), and fine-tuning, where one applies the pre-trained model to a specific task and further adjusts the model's parameters using a small amount of labeled data via supervised learning.3,7,8,14,16,24,25,26,36 The links in Table 1 offer resources for learning and using pre-trained language models.
There are three types of pre-trained language models: unidirectional, bidirectional, and sequence-to-sequence. Due to space limitations, this paper covers only the first two types. All the major pre-trained language models adopt the transformer's architecture. Table 2 offers a summary of existing pre-trained language models.
A transformer has strong language representation ability; a very large corpus contains rich language expressions (such unlabeled data can be easily obtained) and training large-scale deep learning models has become more efficient. Therefore, pre-trained language models can effectively represent a language's lexical, syntactic, and semantic features. Pre-trained language models, such as BERT and GPTs (GPT-1, GPT-2, and GPT-3), have become the core technologies of current NLP.
Pre-trained language model applications have brought great success to NLP. "Fine-tuned" BERT has outperformed humans in terms of accuracy in language-understanding tasks, such as reading comprehension.8,17 "Fine-tuned" GPT-3 has also reached an astonishing level of fluency in text-generation tasks.3 (Note that the results solely indicate machines' higher performance in those tasks; one should not simply interpret that BERT and GPT-3 can understand languages better than humans, because this also depends on how benchmarking is conducted.6 Having the proper understanding and expectation of the capabilities of AI technologies is critical to the healthy growth and development of the area, as is learned from history.)
GPTs developed by Radford et al.25,26 and Brown et al.3 have the following architecture. The input is a sequence of words w1, w2, ···, wN. First, through the input layer, a sequence of input representations is created, denoted as a matrix H(0). After passing L transformer decoder layers, a sequence of intermediate representations is created, denoted as a matrix H(L)
Finally, a probability distribution of words is calculated at each position based on the final intermediate representation at the position. The pre-training of GPTs is the same as conventional language modeling. The objective is to predict the likelihood of a word sequence. For a given word sequence w = w1, w2, ···, wN, we calculate and minimize the cross-entropy or the negative log-likelihood to estimate the parameters:
where ϑ denotes the parameters of the GPTs model.
Figure 3 shows the relationship between the representations in the GPTs model. The input representation at each position is composed of the word embedding and the "position embedding." The intermediate representation of each layer at each position is created from the intermediate representations of the layer below at the previous positions. The prediction or generation of a word is performed at each position repeatedly from left to right—Cf. (1) and (2). In other words, GPTs are a unidirectional language model in which the word sequence is modeled from one direction. (Note that an RNN language model is also a unidirectional language model.) Therefore, GPTs are better suited to solving language-generation problems that automatically produce sentences.
BERT, developed by Devlin et al.,8 has the following architecture. The input is a sequence of words, which can be consecutive sentences from a single document or a concatenation of consecutive sentences from two documents. This makes the model applicable to tasks with one text as input (such as text classification), as well as to tasks with two texts as input (such as answering questions). First, through the input layer, a sequence of input representations is created, denoted as a matrix H(0). After passing L transformer encoder layers, a sequence of intermediate representations is created, denoted as H(L)
Finally, a probability distribution of words can be calculated at each position based on the final intermediate representation at the position. Pre-training of BERT is performed as the so-called mask language modeling. Suppose that the word sequence is w = w1, w2, ···, wN. Several words in the sequence are randomly masked—that is, changed to a special symbol [mask]—yielding a new sequence of words , where the set of masked words is denoted as . The objective of learning is to recover the masked words by calculating and minimizing the following negative log-likelihood to estimate the parameters:
where ϑ denotes the parameters of the BERT model and δi takes a value of 1 or 0, indicating whether the word at position i is masked or not masked. Note that mask-language modeling is already a technique that differs from traditional language modeling.
Figure 4 shows the relationship between the representations in the BERT model. The input representation at each position is composed of word embeddings, "position embeddings," etc. The intermediate representation of each layer at each position is created from the intermediate representations of the layer below at all positions. The prediction or generation of a word is independently performed at each masked position—Cf. (3). That is to say, BERT is a bidirectional language model in which the word sequence is modeled from two directions. Therefore, BERT can be naturally employed in language understanding problems whose input is a whole word sequence and whose output is usually a label or a label sequence.
An intuitive explanation of pre-training language models is that the machine has performed a lot of word solitaire (GPTs) or word cloze (BERT) exercises based on a large corpus in pre-training, capturing various patterns of composing sentences from words, then composing articles from sentences, and expressing and memorizing the patterns in the model. A text is not randomly created with words and sentences, but constructed based on lexical, syntactic, and semantic rules. GPTs and BERT can use a transformer's decoder and encoder, respectively, to realize the compositionality of language. (Compositionality is the most fundamental feature of language, which is also modeled by grammars in the Chomsky hierarchy.) In other words, GPTs and BERT have acquired a considerable amount of lexical, syntactic, and semantic knowledge in pre-training. Consequently, when adapted to a specific task in fine-tuning, the models can be refined with only a small amount of labeled data to achieve high performance. It is found, for example, that different layers of BERT have different characteristics. The bottom layers mainly represent lexical knowledge, the middle layers mainly represent syntactic knowledge, and the top layers mainly represent semantic knowledge.13,16,29
Pre-trained language models (without fine-tuning), such as BERT and GPT-3, contain a large amount of factual knowledge. For example, they can be used to answer questions such as, "Where was Dante born?" and conduct simple reasoning such as, "What is 48 plus 76?," as long as they have acquired the knowledge from the training data.3,22 However, the language models themselves do not have a reasoning mechanism. Their "reasoning" ability is based on association instead of genuine logical reasoning. As a result, they fail to show high performance on problems that need complex reasoning, including argument reasoning,38 numerical and temporal reasoning,37 and discourse reasoning.32 Integrating reasoning ability and language ability into an NLP system will be an important topic in the future.
Contemporary sciences (brain science and cognitive science) have limited understanding of the mechanism of human language processing (language understanding and language generation). It is difficult to see a major break-through happening in the foreseeable future, and the possibility of never breaking through exists. On the other hand, we hope to continuously promote the development of AI technologies and develop machines of language processing that are useful for human beings.
It seems that neural language modeling is by far the most successful approach. The essential characteristic of language modeling has not changed—that is, it relies on the probability distribution defined in a discrete space containing all word sequences. The learning process is to find the optimal model so that the accuracy of predicting language data in terms of cross-entropy is the highest (see Figure 5). Neural language modeling constructs models through neural networks. The advantage is that it can very accurately simulate human language behaviors by leveraging complex models, big data, and powerful computing. From the original model proposed by Bengio et al. to RNN language models and pre-trained language models such as GPTs and BERT, the architectures of neural networks have become increasingly complex (Cf., Figures 1, 2,3,4), while the ability to predict languages has become higher and higher (cross-entropy gets smaller and smaller). However, this does not necessarily mean that the models have the same language ability as humans, and the limitations of the approach are also self-evident.
Are there other possible development paths? It is not yet clear. It can be predicted that there are still many opportunities for improvement with the approach of neural language modeling. There is still a big gap between the current neural language models and human brains in representation ability and computing efficiency (in terms of power consumption). An adult human brain operates on only 12 W;12 in striking contrast, training the GPT-3 model has consumed several thousand Petaflop/s-day, according to the authors.3 Whether a better language model can be developed to be closer to human language processing is an important direction for future research. There are still many opportunities for technology enhancement. We can still learn from the limited discoveries in brain science.
Human language processing is believed to be carried out mainly at two brain regions in the cerebral cortex: Broca's area and Wernicke's area (Figure 6). The former is responsible for grammar, and the latter is responsible for vocabulary.23 There are two typical cases of aphasia due to brain injuries. Patients who suffer from injuries in Broca's area can only speak in sporadic words instead of sentences, while patients who suffer from injuries in Wernicke's area can construct grammatically correct sentences, but the words often lack meaning. A natural hypothesis is that human language processing is carried out in both brain regions in parallel. Whether it is necessary to adopt a more human-like processing mechanism is a topic worth studying. Language models do not explicitly use grammars and cannot infinitely compose languages, which is an important property of human language, as pointed out by Chomsky. The ability to incorporate grammars more directly into language models will be a problem that needs to be investigated.
Brain scientists believe that human language understanding is a process of activating representations of relevant concepts in the sub-conscious and generating relevant images in the conscious. The representations include visual, auditory, tactile, olfactory, and gustatory representations. They are the visual, auditory, tactile, olfactory, and gustatory contents of concepts remembered in various parts of the brain through one's experiences during growth and development. Therefore, language understanding is closely related to the experiences of people.2 Basic concepts in life, such as cat and dog, are learned from the input of sensors through seeing, hearing, touching, and so forth. Hearing or seeing the words "cat" and "dog" also reactivates the relevant visual, auditory, and tactile representations in people's brains. Can machines learn better models from a large amount of multimodal data (language, vision, speech) so that they can more intelligently process language, vision, and speech? Multimodal language models will be an important topic for future exploration. Most recently, there has been some progress in research on the topic—for example, Ramesh et al.28 or Radford et al.27
Language models have a history that dates back more than 100 years. Markov, Shannon, and others could not have foreseen that the models and theories they studied would have such a great impact later; it might even be unexpected for Bengio. How will the language models develop over the next 100 years? Will they still be an essential part of AI technologies? This is beyond our imagination and prediction. What we can see is that language modeling technologies are continuously evolving. It is highly likely that more powerful models will replace BERT and GPTs in the years to come. For us, we are lucky enough to be the first generation to see the great achievements of the technologies and to participate in the research and development.
The author thanks the two anonymous reviewers for their critical and constructive reviews. He also thanks Yan Zeng and Quan Wang for their useful references and crucial suggestions, and Jun Xu, Xinsong Zhang, and Xiangyu Li for their helpful comments.
8. Devlin, J., Chang, M.W., Lee, K., and Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conf. of the North American Chapter of the Assoc. for Computational Linguistics: Human Language.
18. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., and Dean, J. Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems (2013), 3111–3119.
31. Shang, L., Lu, Z., and Li, H. Neural responding machine for short-text conversation. In Proceedings of the 53rd Annual Meeting of the Assoc. for Computational Linguistics and the 7th International Joint Conf. on Natural Language Processing (2015), 1577–1586.
36. Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R., and Le, Q.V. Xlnet: Generalized autoregressive pretraining for language understanding. Advances in Neural Information Processing Systems 2019, 5754–5764.
©2022 ACM 0001-0782/22/7
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and full citation on the first page. Copyright for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request permission to publish from [email protected] or fax (212) 869-0481.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2022 ACM, Inc.