Machines Beat Humans on a Reading Test. But Do They Understand?


In the fall of 2017, Sam Bowman, a computational linguist at New York University, figured that computers still weren’t very good at understanding the written word. Sure, they had become decent at simulating that understanding in certain narrow domains, like automatic translation or sentiment analysis (for example, determining if a sentence sounds “mean or nice,” he said). But Bowman wanted measurable evidence of the genuine article: bona fide, human-style reading comprehension in English. So he came up with a test.

In an April 2018 paper coauthored with collaborators from the University of Washington and DeepMind, the Google-owned artificial intelligence company, Bowman introduced a battery of nine reading-comprehension tasks for computers called GLUE (General Language Understanding Evaluation). The test was designed as “a fairly representative sample of what the research community thought were interesting challenges,” said Bowman, but also “pretty straightforward for humans.” For example, one task asks whether a sentence is true based on information offered in a preceding sentence. If you can tell that “President Trump landed in Iraq for the start of a seven-day visit” implies that “President Trump is on an overseas visit,” you’ve just passed.

The machines bombed. Even state-of-the-art neural networks scored no higher than 69 out of 100 across all nine tasks: a D-plus, in letter grade terms. Bowman and his coauthors weren’t surprised. Neural networks — layers of computational connections built in a crude approximation of how neurons communicate within mammalian brains — had shown promise in the field of “natural language processing” (NLP), but the researchers weren’t convinced that these systems were learning anything substantial about language itself. And GLUE seemed to prove it. “These early results indicate that solving GLUE is beyond the capabilities of current models and methods,” Bowman and his coauthors wrote.

Their appraisal would be short-lived. In October of 2018, Google introduced a new method nicknamed BERT (Bidirectional Encoder Representations from Transformers). It produced a GLUE score of 80.5. On this brand-new benchmark designed to measure machines’ real understanding of natural language — or to expose their lack thereof — the machines had jumped from a D-plus to a B-minus in just six months.

“That was definitely the ‘oh, crap’ moment,” Bowman recalled, using a more colorful interjection. “The general reaction in the field was incredulity. BERT was getting numbers on many of the tasks that were close to what we thought would be the limit of how well you could do.” Indeed, GLUE didn’t even bother to include human baseline scores before BERT; by the time Bowman and one of his Ph.D. students added them to GLUE in February 2019, they lasted just a few months before a BERT-based system from Microsoft beat them.

As of this writing, nearly every position on the GLUE leaderboard is occupied by a system that incorporates, extends or optimizes BERT. Five of these systems outrank human performance.

But is AI actually starting to understand our language — or is it just getting better at gaming our systems? As BERT-based neural networks have taken benchmarks like GLUE by storm, new evaluation methods have emerged that seem to paint these powerful NLP systems as computational versions of Clever Hans, the early 20th-century horse who seemed smart enough to do arithmetic, but who was actually just following unconscious cues from his trainer.

“We know we’re somewhere in the gray area between solving language in a very boring, narrow sense, and solving AI,” Bowman said. “The general reaction of the field was: Why did this happen? What does this mean? What do we do now?”

Writing Their Own Rules

In the famous Chinese Room thought experiment, a non-Chinese-speaking person sits in a room furnished with many rulebooks. Taken together, these rulebooks perfectly specify how to take any incoming sequence of Chinese symbols and craft an appropriate response. A person outside slips questions written in Chinese under the door. The person inside consults the rulebooks, then sends back perfectly coherent answers in Chinese.

The thought experiment has been used to argue that, no matter how it might appear from the outside, the person inside the room can’t be said to have any true understanding of Chinese. Still, even a simulacrum of understanding has been a good enough goal for natural language processing.

The only problem is that perfect rulebooks don’t exist, because natural language is far too complex and haphazard to be reduced to a rigid set of specifications. Take syntax, for example, the rules (and rules of thumb) that define how words group into meaningful sentences. The phrase “colorless green ideas sleep furiously” has perfect syntax, but any natural speaker knows it’s nonsense. What prewritten rulebook could capture this “unwritten” fact about natural language — or innumerable others?

NLP researchers have tried to square this circle by having neural networks write their own makeshift rulebooks, in a process called pretraining.

Before 2018, one of NLP’s main pretraining tools was something like a dictionary. Known as word embeddings, this dictionary encoded associations between words as numbers in a way that deep neural networks could accept as input — akin to giving the person inside a Chinese room a crude vocabulary book to work with. But a neural network trained with word embeddings is still blind to the meaning of words at the sentence level. “It would think that ‘a man bit the dog’ and ‘a dog bit the man’ are exactly the same thing,” said Tal Linzen, a computational linguist at Johns Hopkins University.

A better method would use pretraining to equip the network with richer rulebooks — not just for vocabulary, but for syntax and context as well — before training it to perform a specific NLP task. In early 2018, researchers at OpenAI, the University of San Francisco, the Allen Institute for Artificial Intelligence and the University of Washington simultaneously discovered a clever way to approximate this feat. Instead of pretraining just the first layer of a network with word embeddings, the researchers began training entire neural networks on a broader basic task called language modeling.

“The simplest kind of language model is: I’m going to read a bunch of words and then try to predict the next word,” explained Myle Ott, a research scientist at Facebook. “If I say, ‘George Bush was born in,’ the model now has to predict the next word in that sentence.”

These deep pre-trained language models could be produced relatively efficiently. Researchers simply fed their neural networks massive amounts of written text copied from freely available sources like Wikipedia — billions of words, preformatted into grammatically correct sentences — and let the networks derive next-word predictions on their own. In essence, it was like asking the person inside a Chinese room to write all his own rules, using only the incoming Chinese messages for reference.

“The great thing about this approach is it turns out that the model learns a ton of stuff about syntax,” Ott said.

What’s more, these pre-trained neural networks could then apply their richer representations of language to the job of learning an unrelated, more specific NLP task, a process called fine-tuning.

“You can take the model from the pretraining stage and kind of adapt it for whatever actual task you care about,” Ott explained. “And when you do that, you get much better results than if you had just started with your end task in the first place.”

Indeed, in June of 2018, when OpenAI unveiled a neural network called GPT, which included a language model trained on nearly a billion words (sourced from 11,038 digital books) for an entire month, its GLUE score of 72.8 immediately took the top spot on the leaderboard. Still, Sam Bowman assumed that the field had a long way to go before any system could even begin to approach human-level performance.

Then BERT appeared.