Games have long been a fertile testing ground for the artificial intelligence community, and not just because of their accessibility to the popular imagination. Games also enable researchers to simulate different models of human intelligence, and to quantify performance. No surprise, then, that the 2016 victory of DeepMind's AlphaGo algorithm—developed by 2019 ACM Computing Prize recipient David Silver, who leads the company's Reinforcement Learning Research Group—over world Go champion Lee Sedol generated excitement both within and outside of the computing community. As it turned out, that victory was only the beginning; subsequent iterations of the algorithm have been able to learn without any human data or prior knowledge except the rules of the game and, eventually, without even knowing the rules. Here, Silver talks about how the work evolved and what it means for the future of general-purpose AI.
You grew up playing games like chess and Scrabble. What drew you to Go?
I learned the game of Go when I was a young kid, but I never really pursued it. Then later on, when I first moved to London, I started playing in a club in Hampstead in a crypt at the bottom of a church. It is a fascinating and beautiful game. Every time you think you know something about Go, you discover—like peeling an onion—there is another level of complexity to it.
"Every time you think you know something about Go, you discover—like peeling an onion—there is another level of complexity to it."
When did you start thinking about teaching computers to play?
I think it was always in my mind. One of the things that drew me to Go as a human player was the understanding that it was a challenging game for computers to play. Humans possess an intuition in Go that appears far beyond the scope of brute force computation, and this—along with the rich history of the game—lends the game a certain mystique. That subsequently led to my work understanding it as a computer scientist.
After a few years working in the games industry, you went to the University of Alberta to get your Ph.D. and see if reinforcement learning techniques could help computers crack Go.
I was working in the games industry, and I took a year out to try and figure out what to do next. I knew I wanted to go back and study AI, but I wasn't sure what direction to take, so I started reading around, and I came across Sutton and Barto's Reinforcement Learning: An Introduction. The moment I read that book, something just connected; it seemed to represent the most promising path for understanding how to solve a problem from first principles. Alberta had both the best games research group in the world and also the best group on reinforcement learning. My idea was to put those things together and try to solve the game of Go through the trial-and-error learning that we see in reinforcement learning.
Eventually, you built a system that learned to play Go a on smaller, nine-by-nine sized board.
We had some successes in the early days on the small-sized boards. Our system did learn, through these very principled trial-and-error reinforcement learning techniques, to associate different patterns with whether they would lead to winning or losing the game. Then I started collaborating Sylvain Gelly at the University of Paris on a project called MoGo, which became the first nine-by-nine Go championship program.
Later, you reconnected with your former Cambridge University classmate and DeepMind co-founder, Demis Hassabis, to continue that work with AlphaGo—which became the first computer program to beat a professional player on a full-sized 19x19 Go board.
I was very keen to have another look at computer Go when I arrived at DeepMind, because it felt like deep learning represented a very promising new possibility. So we started with a research question, namely whether deep learning could address the position evaluation problem. If you look at a pattern of stones on the board, can you predict who's going to win? Can you identify a good move? As we started working on that research question, it quickly became apparent that the answer was "yes." My feeling was that if we could build a system that could achieve the level of amateur dan through neural networks that simply examined a position and picked a move, with no precepts whatsoever—and none of the expertise that game engines have always had—it was time to hit the accelerate button.
In 2016, AlphaGo beat world champion Lee Sedol, who has said that the experience has made him a better player. What do you make of that?
There are multiple books now written on how human players should use AlphaGo strategies in their own playing. It has challenged people to think more holistically about the game, rather than in terms of local contributions to the score.
"AlphaZero will often give up a lot of material in a way that can be quite shocking to chess players, to gain a long-term edge over its opponent."
The same thing has happened in chess with AlphaGo's successor, AlphaZero, a program that has achieved superhuman performance despite starting without any human data or prior knowledge except the game's rules.
In contrast to the way that previous computer programs played the game, AlphaZero has encouraged people to be more flexible; to move away from material evaluation and understand that there are positions that can be enormously valuable in the long run. AlphaZero will often give up a lot of material in a way that can be quite shocking to chess players, to gain a long-term edge over its opponent.
AlphaGo suffered from what you called 'delusions', that is, persistent holes in its evaluation of a play that led it to make mistakes. How did you address these delusions in AlphaZero?
We tried many different things, but ultimately, it came down to being more principled. The more you trust your trial-and-error learning to correct its own errors, the fewer delusions the system will suffer from. We started off with a dataset that contained 100 different delusional positions. By the time we trained up AlphaZero, it got every single one of those delusional positions correct in its understanding. The more iterations of training it went through, the more those delusions it could correct.
So there was no piece of specific additional training that was required?
The fundamental process of reinforcement learning is one of recognizing the holes in your own knowledge and getting the opportunity to correct them. That correction process leads to better results, and we really need to trust it. We would rerun the same algorithm again from new random weights and see it track the same progress, fixing the same delusions in roughly the same order, as if it were peeling its own onion layer by layer.
AlphaZero has mastered a number of different games, from Shogi to Space Invaders. Others have found even broader applications.
The beautiful thing about creating a general-purpose algorithm is that you end up being surprised by the ways in which it is used, and I think that's been true here as well. One group used AlphaZero to do retro-chemical synthesis and found that it outperformed all previous baselines. Another group used it to solve one of the outstanding problems in quantum computation, namely to optimize the quantum dynamics. A startup in North Africa used AlphaZero to solve logistical problems. It is quite nice when other people take your algorithm to achieve good results.
Where is that work taking you next?
I try to ask what seems like the deepest science question. In this case, it felt to me that rather than trying another game, we should address what happens in applications where you don't know the rules—where you're interacting with people or with the real world or where you're dealing with complicated, messy dynamics that noone tells you about. We built a version of this approach that we call MuZero. MuZero is able to learn a model of the rules or dynamics and uses that to plan and solve problems. It is kind of amazing; we plugged it back into ChessGo and Shogi, and found that it could reach super-human performance just as quickly, even without telling it the rules of the game. It was also able to beat baseline results in some of the more traditional reinforcement learning benchmarks, like Atari, where we'd previously been limited to model-free techniques without any lookahead planning.
©2021 ACM 0001-0782/21/9
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and full citation on the first page. Copyright for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request permission to publish from [email protected] or fax (212) 869-0481.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2021 ACM, Inc.