When it was released by Google just a few years ago, a deep-learning model called BERT demonstrated a major step forward in natural language processing (NLP). BERT's core structure, based on a type of neural network known as a Transformer, has become the underpinning for a range of NLP applications, from completing search queries and user-written sentences to language translation.
The models even score well on benchmarks intended to test understanding at a high school level, such as Large-scale ReAding Comprehension (RACE) developed at Carnegie Mellon University. In doing so, they have become marketing tools in the artificial intelligence (AI) gold rush. At Nvidia's annual technology conference, president and CEO Jen-Hsun Huang used RACE to claim high performance for his company's implementation of BERT.
"The average human scored 73%. Expert humans score 95%. Nvidia's Megatron-BERT scored 91%," Huang said, adding, "Facebook AI Research developed a Transformer-based chatbot with knowledge, personality, and empathy that half of the users tested actually preferred [over humans]."
Performance stepped up another notch with the release of GPT-3 in summer 2020, the latest iteration of a series of language models developed by the company OpenAI. Sporting 175 billion trainable parameters, GPT-3 is 500 times larger than BERT's biggest version.
Size has given GPT-3 seemingly impressive abilities. Whereas most other Transformer-based systems need a training sequence that "fine-tunes" the last few layers of the deep neural-network (DNN) pipeline to fit a specific application, such as language translation, OpenAI promises GPT-3 can dispense with the need for extensive fine-tuning because of the sheer size of its core training set.
Tests have demonstrated the ability of GPT-3 to construct lengthy essays in response to brief prompts. Yet the huge system has flaws that are easy to show. Questions to GPT-3 often can yield answers of almost nightmarish surrealism, claiming in one case that blades of grass have eyes, or in other situations that a horse has four eyes. OpenAI's own research team questioned the limits of huge models trained purely for language modeling in a paper published shortly after the release of GPT-3.
The key to the performance of these language models seems to come down to their ability to capture and organize sometimes contradictory information mined from enormous collections of text that include sources such as Wikipedia and the social-media site Reddit. Early approaches used word embedding, in which each discrete word is converted to a numeric vector using a clustering algorithm. Words that most commonly surround it in the corpus used for training determine the vector's values. But these approaches hit problems because they could not disambiguate words with multiple meanings.
The networks inside BERT take the flexible meanings of words into account. They use multiple layers of neural-network constructions called Transformers to assign vectors not to separate words, but to words and sub-words in different contexts that the model finds as it scans the training set.
Though Transformers associate words and their stems with different contexts, what remains far from clear is what relationships between words and context they actually learn. This uncertainty has spawned what University of Massachusetts Lowell assistant professor Anna Rumshisky and colleagues termed "BERTology." BERT is a particular focus in research like this because its source code is available, whereas the much larger GPT-3 is only accessible through an API.
Closer inspection of their responses shows what these systems clearly lack is any understanding of how the world works, which is vital for many of the more advanced applications into which they are beginning to be pushed. In practice, they mostly make associations based on the proximity of words in the training material; as a result, Transformer-based models often get basic information wrong.
For example, Ph.D. student Bill Yuchen Lin and coworkers in Xiang Ren's group at the University of Southern California (USC) developed a set of tests to probe language models' ability to give sensible answers to questions about numbers. BERT claims a bird has twice the probability of having four legs rather than two. It also can give contradictory answers. Though BERT will put a high confidence on a car having four wheels, if the statement is qualified to "round wheels," the model claims it is more likely to sport just two.
Toxicity and unwanted biases are further issues for language models, particularly when they are integrated into chatbots that might be used for emotional support: they readily regurgitate offensive statements and make associations that tend to reinforce prejudices. Work by Yejin Choi and colleagues at the Allen Institute for AI has indicated a major problem lies in subtle cues in the large text bases used for training that can include sources like Reddit. However, even training just on the more-heavily-policed Wikipedia show issues.
"Sanitizing the content will be highly desirable, but it might not be entirely possible due to the subtleties of potentially toxic language," Choi says.
One way to improve the quality of results is to give language models a better understanding of how the world works by training them on "commonsense" concepts. This cannot be achieved by simply giving them bigger training sets. Choi points to the issue that training on conventional text suffers from reporting bias: even encyclopedic sources do not describe much of how the world around us works. Even worse, sources such as news, which supports much of the content of Reddit and Wikipedia, express exceptions more often than the norm. Much of the background knowledge is simply assumed by humans; to teach machines, this background calls for other sources.
Choi points to the issue that training on conventional text suffers from reporting bias; even encyclopedic sources do not describe much of how the world around us works.
One possible source of commonsense knowledge is a knowledge base, which needs to be built by hand. One existing source that some teams have used is ConceptNet, but it is far from comprehensive.
"We need knowledge of why and how," Choi notes, whereas the majority of elements in ConceptNet typically describe "is a" or "is a part of" relationships. To obtain the information needed, the group crowdsourced the information they wanted for their own Atomic knowledge base. They opted to build a new knowledge base rather than extend ConceptNet, partly because it focused the fine-tuning on aspects of behavior and motivation without potentially extraneous information, but also because Atomic is expressed in natural-language form, so the knowledge can more easily be processed by BERT. ConceptNet's symbolic representations need to be converted to natural language form using templates.
However, it remains unclear whether the Transformer neural-network design itself provides an appropriate structure for representing the knowledge it attempts to store. Says Antoine Bosselut, postdoctoral researcher at Stanford University, "It's one of the most interesting questions to answer in this space. We don't yet know exactly how the commonsense knowledge gets encoded. And we don't know how linguistic properties get encoded."
To improve the abilities of language models, Tetsuya Nasukawa, a senior member of the technical staff at IBM Research in Japan, says he and his colleagues took inspiration from the way images and language are used together to teach children, when creating their visual concept naming (VCN) system. This uses images and text from social media to link objects to the words often used to describe them, on the basis that different cultures and nations may use quite different terms to refer to the same thing, and which are not captured in conventional training based on text alone. "We believe it's essential to handle non-textual information such as positions, shapes, and colors by using visual information," he says.
Another approach, which has been used by Ren's group, is to take an existing handbuilt knowledge base and couple it to a Transformer, rather than trying to teach the language model common sense. KagNet fine-tunes a BERT implementation in conjunction with a second neural network that encodes information stored in the ConceptNet knowledge base.
An issue with linking Transformers to other forms of AI model is that it is not yet clear how to make them cooperate in the most efficient manner. In the USC work, the KagNet does not add much in terms of accuracy compared to a fine-tuned language model working on its own. As well the relative sparsity of information in the knowledge base, Lin says the knowledge-fusing method may not go deep enough to make good connections. A further issue common to much work on language models is that it is not easy to determine why a language model provides the answer it does. "Does the model really answer the question for the right reasons? The current evaluation protocol may not be enough to show the power of symbolic reasoning," Lin says.
Nasukawa says work in visual question answering, in which a system has to answer a textual question about the content of an image, has met with similar issues. He says the most productive route that has emerged so far is to tune the second architecture for a specific application, rather than trying to fine-tune something more generic in the way language models currently work. A more sophisticated general-purpose structure that can be used across many applications has not yet emerged for applications that need understanding of how the world works. In the meantime, Transformers may yield more surprises as they continue to scale up.
Another approach is to take an existing handbuilt knowledge base and couple it to a Transformer, rather than trying to teach the language model common sense.
"Each time, the added scale gives us new capabilities to let us test new assumptions," Bosselut says. "As much as many people think we are going too far down this path, the truth is that the next iteration of language modeling could open a new set of capabilities that the current generation doesn't have. This is a great thing about NLP: there does seem to be an openness to diverse perspectives."
Rogers, A., Kovaleva, O., and Rumshisky, A. A Primer in BERTology: What we know about how BERT works arXiv:2002.12327 (2020) https://arxiv.org/abs/2002.12327
Bosselut, A., Rashkin, H., Sap, M., Malaviya, C., Celikyilmaz, A., and Choi, Y. COMET: Commonsense Transformers for Automatic Knowledge Graph Construction, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL) (2019).
Lin, B.Y., Chen, X., Chen, J., and Ren, X. KagNet: Knowledge-Aware Graph Networks for Commonsense Reasoning, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) (2019)
Muraoka, M., Nasukawa, T., Raymond, R., and Bhattacharjee, B. Visual Concept Naming: Discovering Well-Recognized Textual Expressions of Visual Concepts, Proceedings of The Web Conference (2020)
©2021 ACM 0001-0782/21/4
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and full citation on the first page. Copyright for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request permission to publish from [email protected] or fax (212) 869-0481.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2021 ACM, Inc.