PART 1: From the NIPS Experiment to the ESA Experiment
In 2014, the organizers of the Conference on Neural Information Processing Systems (NeurIPS, then still called NIPS) made an interesting experiment.1 They split their program committee (PC) in two and let each half independently review a bit more than half of the submissions. That way, 10% of all submissions (166 papers) were reviewed by two independent PCs. The aimed at acceptance rate per PC was 23%. The result of the experiment was that among these 166 papers, the set of accepted papers from the two PCs overlapped by only 43%. That is, more than half of the papers accepted by one PC were rejected by the other. This led to a passionate flare-up of the old debate of how effective or random peer-reviewing really is and what we should do about it.
The experiment left open a number of interesting questions:
- How many papers that looked like "clear" accepts in one PC were rejected by the other PC, if any?
- How many papers that looked like "clear" rejects in one PC were accepted by the other PC, if any?
- How well did the rankings of the two PCs correlate, and is there a natural cutoff to determine the set of accepted papers?
- Do the discussions of the papers between PC members help decrease the randomness of the decisions?
- What does this all mean for the future of peer review?
To answer these questions, in 2018 I conducted an experiment similar to the NIPS experiment, but with richer data and a deeper analysis. The target was the 26th edition of the "European Symposium on Algorithms" (ESA), a venerable algorithms conference. ESA receives around 300 submissions every year and has two tracks: the more theoretical Track A and the more practical Track B. For the experiment, I picked Track B, which received 51 submissions that year. Two independent PCs were set up, each with 12 members and tasked with an acceptance rate of 24%. A total of 313 reviews were produced. These numbers are smaller than for the NIPS experiment, but still large enough to yield meaningful results. Importantly, they were small enough to allow for a time-intensive deeper analysis.
Both PCs followed the same standard reviewing process, which was agreed on and laid out in advance as clearly as possible:
- Phase 1: PC members entered their reviews without seeing any of the other reviews.
- Phase 2: PC members discussed with each other, mostly on a per-paper basis, and papers were proposed for acceptance/rejection in rounds.
- Phase 3: The remaining ("gray zone") papers were compared with each other and all papers left without a clear decision in the end were decided by voting.
PC members were explicitly asked and repeatedly reminded to also update the score of their review whenever they changed something in their review. This allowed a quantitative analysis of the various phases of the reviewing process. For more details on the setup, the results, the data, and a script to evaluate and visualize the data in various ways, see the website of the experiment.2
PART 2: The main results of the ESA experiment
Let us first get a quick overview of the results and then, in Part 3, discuss their implications.
What is the overlap in the set of accepted papers? In the NIPS experiment, the overlap was 43%. In the ESA experiment, the overlap was 58%. The acceptance rates were almost the same. To put these figures into perspective: if the reviewing algorithm was deterministic, the overlap would be 100%. If a random subset of papers was accepted by each PC, the expected overlap would be 24%. If 10% / 20% / 20% / 50% of the papers were accepted with probabilities 0.8 / 0.6 / 0.1 / 0.0, the expected overlap would be around 60%. The overlap is not the best number to look at, since it depends rather heavily on the number of accepted papers; see below.
How many clear accepts were there? The score range for each review was +2, +1, 0, -1, -2. The use of 0 was discouraged and it was communicated beforehand that only papers with a +2 from at least one reviewer were considered for acceptance. For a paper that received only +2 scores, there was no incentive for discussion and these papers were accepted right away. There was little agreement between the two PCs concerning such "clear accepts." Out of nine papers that were clear accepts in one PC, four were rejected by the other PC and only two also were clear accepts in the other PC (that is, 4% of all submissions). If papers that are "clear accepts" exist at all, they are very few.
How many clear rejects were there? A paper was counted as a clear reject if one reviewer gave a -2 and no reviewer gave a +1 or +2. There were 20 such clear rejects in PC1 and 17 in PC2. None of these papers were even considered for acceptance in the other PC. At least one-third of the submissions were thus clear rejects in the sense that it is unlikely that any other PC would have accepted any of them. There was only a single paper with a score difference of 3 or more between the two PCs; it was a clear accept in one PC (all reviewers gave it a +2, praising the strong results), while the other PC was very critical of its meaningfulness.
Is there a natural cutoff to determine the set of accepted papers? If both PCs accepted only their best 10%, the overlap in the set of accepted papers would have been 40% (corresponding to the 4% "clear accepts"). For acceptance rates between 14% to 40%, the overlap varied rather erratically between 54% and 70%. Increasing the rate of accepted papers beyond that showed a steady increase in the overlap (due to the "clear rejects" at the bottom). There is no natural cutoff short of the "clear rejects."
How effective were the various reviewing phases? We have seen that the overlap for a fixed acceptance rate is a rather unreliable measure. I therefore also compared the rankings of the two PCs among those papers which were at least considered for acceptance. Ranking similarity was computed via the Kendall tau correlation (1 for identical rankings, 0 for random rankings, -1 if one is the reverse of the other). Again, see the website for details.2 This similarity was 46% after Phase 1, 63% after Phase 2, and 58% after Phase 3, where the increase after Phase 1 is statistically significant (p = 0.02). This suggests that the per-paper discussions play an important role for objectifying paper scores, while any further discussions add little or nothing in that respect. This correlates well with the experience that PC members are willing to adapt their initial scores once, after reading the reviews from the other PC members. After that, their opinion is more or less fixed.
In summary, the PCs did a good job in separating the wheat from the chaff. There appeared to be at least a partial order in the wheat, but there is no natural cutoff. The fewer papers are accepted, the more random is the selection. The initial per-paper discussions helped to make the review scores more objective. Any further discussions had no measurable effect.
The above results are probably an an upper bound for the objectivity of the reviewing process at a computer science conference, for the following reasons:
- ESA is a medium-sized conference with a relatively tightly knit community and a one-tier PC.
- In online discussions, threads frequently stall because PC members forget or do not bother to reply because of other obligations. For this experiment, great care was taken to remind PC members to give feedback so that no discussion threads stalled.
- The reviewing process was laid out in detail beforehand and it was identical for the two PCs.
- The PCs were selected so that their diversity (with respect to seniority, gender, topic, continent) was as similar as possible.
Larger conferences, two-tier PCs, unresponsive PC members, underspecified guidelines, and variance in diversity most likely all further increase the randomness in the reviewing process.
PART 3: What now?
I see four main conclusions from this experiment:
First, we need more experiments of this kind. We have the NIPS experiment and now the ESA experiment.3 They give a first impression, but important questions are still open. For example, it would be very valuable to redo the experiment above for a larger and more heterogeneous conference. One argument I often hear is that it is too much effort, in particular, with respect to the additional number of reviewers needed. I don't buy this argument. There are so many conferences in computer science, many of them very large. If we pick one of these conferences from time to time to make an experiment, the additional load is negligible in the big picture. Another argument I often hear is that improving peer review is an unsolvable problem. This always leaves me baffled. In their respective field, researchers love hard problems and sometimes work their whole life trying to make some progress. But when it comes to the reviewing process, the current status quo is as good as it gets?
Second, we need to fully accept the results of these experiments. The experiments so far provide strong hints that there is a significant signal in reviews, but also a significant amount of noise and randomness. Yet, to this day, the myth of a natural cutoff for determining the set of accepted papers prevails. It is usually acknowledged that there is a gray zone, but not that this "gray zone" might encompass almost all of the papers which are not clear rejects. PCs can spend a lot of time debating papers, blissfully unaware that another PC in a parallel universe did not give these papers much attention because they were accepted or more likely rejected early on in the process. From my own PC experience, I conjecture that there are at least two biases at work here. One is that humans tend to be unaware of their biases and feel that they are much more objective than they actually are. Another is the feeling that if you make a strong effort as a group, then the result is meaningful and fair. The other extreme is fatalism: the feeling that the whole process is random anyway, so why bother to provide a proper review. Both of these extremes are wrong, and this is still not widely understood or acted upon.
Third, how do we incorporate these results to improve the reviewing process? Let us assume that the results from the NIPS and the ESA experiments are not anomalies; then, there are some pretty straightforward ways how we can incorporate them into the current reviewing process. For example, discussion of papers in the alleged "gray zone" could be dropped. Instead, this energy could be used to communicate and implement the semantics of the available scores as clearly as possible in advance. Average scores could then be converted to a probability distribution for at least a subset of the papers, namely those for which at least one, but not all, reviewers spoke up. Papers from this "extended gray zone" could then be accepted with a probability proportional to their score. This would not make the process any more random, but definitely less biased. To reduce not only bias, but also randomness, a simple and effective measure would be to accept more papers. Digital publication no longer imposes a limit on the number of accepted papers and many conferences have already moved away from the "one full talk per paper" principle.
Fourth, all of this knowledge has to be preserved from one PC to the next. Already now, we have a treasure of knowledge on the peer review process. But only a fraction of it is considered or implemented at any particular conference. The main reason I see is the typical way in which administrative jobs are implemented in the academic world. Jobs rotate (often rather quickly), there is little incentive to excel, there is almost no quality control (who reviews the reviewers), and participation in the peer review process is another obligation on top of an already more-than-full-time job. You do get status points for some administrative jobs, but not for doing them particularly well or for investing an outstanding amount of time or energy. Most of us are inherently self-motivated and incredibly perseverant when it comes to our science. Indeed, that is why most of us became scientists in the first place. Administrative tasks are not what we signed up for, not what we were trained for, and not what we were selected for. We understand intellectually how important they are, but we do not really treat them that way.
My bottom line: The reputation of the peer review process is tarnished. Let us work on this with the same love and attention we give to our favorite research problems. Let us do more experiments to gain insights that help us make the process more fair and regain some trust. And let us create powerful incentives, so that whatever we already know is good is actually implemented and carried over from one PC to the next.
1 https://cacm.acm.org/blogs/blog-cacm/181996-the-nips-experiment provides a short description of the NIPS experiment and various links to further analyses and discussions.
3 There are other experiments, like the single-blind vs. double-blind experiment at WSDM'17, which investigated a particular aspect of the reviewing process: https://arxiv.org/abs/1702.00502
Hannah Bast is a professor of computer science at the University of Freiburg, Germany. Before that, she was working at Google, developing the public transit routing algorithm for Google Maps. Right after the ESA experiment, she became Dean of the Faculty of Engineering in Freiburg and a member of the Enquete Commission for Artificial Intelligence of the German parliament (Bundestag). That's why it took her two years to write this blog post.