When my son was still a toddler and his mom had to go on an extended trip out of the country, he would "talk" to her on the phone almost daily. Scare quotes because he still was more babbling than talking. But, the impressive (and adorable) thing was that his imitation of the syntactics of us talking on the phone was flawless, replete with the meaningful pauses, expansive hand gestures, and walking around while talking on the phone, etc.
Natural language generation systems in Artificial Intelligence are currently going through a rather fertile phase of imitation themselves—only not limited to a couple of hapless parents, but rather the whole world. The so-called large language models (LLMs), such as GPT-3, learn to imitate language generation by training themselves on the massive corpus (some three billion pages) of text crawled from the Web. This post is about the impacts of such massive language models, but first we will start with a little background on how they work.
LLMs learn to complete a piece of text in the training corpus one word at a time. Suppose there is a sentence in the training data saying "The quick brown fox jumped the fence." The LLM may train itself to complete the partial sentence "The quick brown fox …" If the current model comes up with the completion "ran" instead of "jumped," then the learning component takes this error and propagates it back to tune the model's parameters. From the system's point of view, "jumped" and "ran" are both seen as vectors (or a sequence of numbers), and the difference between these vectors is the error. While tuning parameters brings to mind the image of a DJ tuning knobs on a large audio mixer, it is worth noting that LLMs have quite an enormous number of tunable parameters. GPT-3, for example, has 175 billion tunable parameters, and it painstakingly tunes these parameters using massive compute facilities (it is estimated that with a normal off the shelf GPU unit, it will take 355 years to train GPT-3, and the lowest cost will likely be around $5 million).
The resulting trained/tuned models have shown pretty impressive abilities to take any text prompt and provide plausible completions/elaborations. For example, this link shows GPT-3's completion based on the first paragraph of this column. Granted that what looks reasonable turns out to be, on close inspection, bloviation tangentially connected to the prompt. However, to be fair, even as recently as three years back, no one really believed that we would have AI systems capable of bloviating in perfect grammar, with text that is "plausible" at least at the level we associate with fast-talking fortune tellers and godmen.
Not surprisingly, the popular press has had a field day marveling at, and hyping up, the abilities of LLMs. Some published columns purportedly written by GPT-3 (no doubt with significant filtering help from human editors). Others fretted about the imminent automation of all writing jobs.
While GPT-3 by OpenAI is perhaps the most famous of these LLMs, almost every Big Tech company is developing them, and several are reportedly already using them in client-facing applications. Google announced that its BERT-based LLMs are used in the search engine in multiple ways. It has also released the LLMs Meena and Lambda trained specifically on massive conversational data, to serve as backends of next-generation chatbots. Not surprisingly, there is also a rush to develop LLMs tailored to languages other than English. China recently announced an LLM called Wu Dao that has, at 1.75 trillion, 10 times more tunable parameters than GPT-3! Open source implementations are slowly catching up with the commercial ones in terms of parameter capacity.
It is quite clear from the "one word at a time completion" design that LLMs focus on finding plausible completions to the prompt (and any previously generated completion words). There is no implied metareasoning about the global semantics of the completion (beyond that the completion has high-enough plausibility given the massive training data). Specifically, there is no guarantee of accuracy or factuality of any kind.
Nevertheless, as a species, we humans are particularly vulnerable to confusing syntax with semantics—be it accent with accomplishment, beauty with talent, or confidence with content. So LLMs that can produce perfectly grammatical and reasonably plausible text (not unlike a smooth-talking soothsayer) are turning out to be a pretty effective Rorschach test for us! Some see in them the optimistic future of singularity and AI reaching general human intelligence, while others are terrified by their potential misuses, whether intended or unintended. The opposing views about the right ways to deploy LLMs played out on a rather public stage last year between Google and its AI and Ethics group.
At the outset, it would seem a little strange that there is so much concern about LLMs, in contrast to the other impressive feats by AI, such as Deep Blue or Alpha Go. The latter are examples of deep but narrow intelligence. They are almost provably good at their specific tasks, but nothing more. We are used to them by now. In contrast, LLMs fall in the category of broad but shallow intelligence. While they can bloviate with superficial intelligence on almost any topic, they can offer no guarantees about the content of what they generate. Broad but shallow linguistic competence exhibited by LLMs is both frightening and exhilarating, because we know that many of us are so easily taken by it.
To be sure, most applications of LLMs that involve making them available as tools to support our own writing, in a computer-supported cooperative work setting, can be very helpful, especially for people who are not particularly proficient in the language. I had a clever Ph.D. student from China in the early 2000's who would improve his ill-phrased sentences by posting them as search queries to Google and looking at the results to revise himself! Imagine how much more effective he would be with LLM-based tools! Indeed, even some journalists, who could justifiably have an antagonistic stance to these types of technologies, have sung the praises of writing tools based on LLMs.
LLMs have also been shown to be quite good at quickly learning to translate from one format to another; for example, from text specifications to code snippets, thus giving the same support to code-smithing that they are already known to provide for wordsmithing. This translation ability will likely allow us to interact with our computers in natural language, rather than arcane command line syntax. Indeed, the seeming generality of LLMs has even tempted some researchers to start rebranding them with the controversial term "foundation models."
The worrisome scenarios are those where the systems are fielded in end-user facing applications, be they machine-generated text, explanations, or search query elaborations. Here, humans can be put in a vulnerable position by the broad and shallow linguistic intelligence displayed by LLMs. In one recent case, a medical chatbot backed by GPT-3 reportedly advised a test patient to kill themselves. In another study, 72% of people reading an LLM-generated fake news story thought it was credible. Even supposedly computer savvy folks were hardly more immune--as a GPT-3 produced fake blog post climbed to the top of the hacker news last year. To their credit, the Open AI policy team did do some serious due diligence about the potential impacts before releasing their LLM in stages. Nevertheless, given the largely open and democratic nature of AI research, and the lack of effective moats in developing the models, no single company can possibly control the uses and misuses of LLMs, now that the Pandora's box is open.
One of the big concerns about LLM-generated text has been that it can often be rife with societal biases and stereotypes. There was a rather notorious early example of GPT-3 completing even innocuous prompts involving Muslim men with violence. That these LLMs give out biased/toxic completions should be no surprise, given that they are in effect trained effectively on our raw Jungian collective subconscious as uploaded to the Web, rife with biases and prejudices.
While "bias" got a lot of attention, the reality is that GPT-3 can neither stand behind the accuracy of its biased statements nor of its unbiased/polite statements. All meaning/accuracy—beyond plausible completion in the context of training data—is in the eye of the beholder. The text generated by LLMs is akin to our subconscious (System 1) thoughts, before they are filtered by conscious (System 2) civilizational norms and constraints. Controlling data-driven AI systems with explicit knowledge constraints—such as the societal norms and mores—is still quite an open research problem. Some recent progress involved making GPT-3 more polite-sounding completions by taking "explicit knowledge" about societal mores and norms, and converting them into carefully curated (hand-coded?) additional training data. Such quixotic methods are brittle, time-consuming, and certainly do nothing to improve the accuracy of the content, even if they happen to make the generated text more polite. We need more effective methods to infuse explicit knowledge about societal mores and norms into LLMs.
As long as we use LLMs as tools for writing assistance in computer-supported cooperative work scenarios, they can be quite effective. After all, much more primitive language models (such as viewing a document as just a bag of words), have been shown to be of use, and current-generation LLMs capture a whole lot more of the structure of the human language. Abundant caution is, however, needed when they are placed in end-user-facing applications. But given the commercial pressures, this can't be guaranteed. In a world with easy access to LLMs, we humans may either be playing a perpetual CAPTCHA trying to tease apart human vs. machine text or, worse yet, getting prepared to compete for attention to our (deeper?) ideas and treatments in the din of syntactically pleasing text summaries and explanations churned out by LLMs.
On the research side, the big open question is when and whether advances in LLMs can make them go beyond imitating syntax. In the case of my son's imitation of us speaking on the phone, as time went on, his subconscious seemingly got even better at the syntax, while his conscious self certainly got better at taming the firehose of his babble and bending it to what he wanted to get across. It remains to be seen whether LLMs can evolve this way. Already there is a rush in the academic community to start research centers to investigate this very question.
Subbarao Kambhampati is a professor of computer science at Arizona State University, and a former president of the Association for the Advancement of Artificial Intelligence, who studies fundamental problems in planning and decision making, motivated in particular by the challenges of human-aware AI systems. He can be followed on Twitter @rao2z.