One of the key drivers for the decision to lock down the U.K. in late March 2020 was a computational epidemiological model developed at Imperial College, London. When the code for the model was released, it immediately attracted criticism with respect to software quality – and these criticisms then provided ammunition for those who argued the lockdown was an overreaction. So, what was the software and what are the criticisms? Are they significant? And, crucially, what can we learn from the experience?
Throughout the first half of March 2020, European nations raced to enforce lockdowns in an attempt to stem the spread of the coronavirus. By mid-March, much of Europe had closed the shutters on ordinary life. But the U.K., it seemed, was intent on business as usual: no lockdown seemed in prospect. And then, more or less overnight, the U.K. government pivoted: just 10 days later, the U.K. was also in lockdown. It became apparent that one driver for the change of heart was an epidemiological model developed by Neil Ferguson and his team at Imperial College, London. The report produced on the basis of the model made international headlines. It is surely one of the most influential and terrifying scientific papers written this century. It predicted that, unless containment measures were taken, then the virus would cause more than half a million excess deaths in the U.K. – more than the total number of U.K. fatalities in World War II. The U.K.'s health care system would be utterly overwhelmed. Given these horrifying predictions, the decision to lock down perhaps seemed inevitable.
The Ferguson model is an example of an agent-based epidemiological model: it models the spread of disease down at the level of individuals and their contacts. At the heart of such a model is a graph representing a network of social contacts, with vertices corresponding to individuals, and edges indicating a social contact, and hence a possible route for infection. This is then overlaid with a model of how the disease spreads through the network, which at its crudest might be a probability that a disease will spread from one infected individual to another individual with whom they have contact. Given such a model and an initial state of the network, it is computationally a relatively straightforward matter to simulate how infection spreads, although of course since the models are stochastic, different simulations will yield different results. The Ferguson model is much richer than this: makes a raft of assumptions about questions such how the disease spreads, whether asymptomatic individuals can infect others, how infection progresses through an individual to recovery or death, the infection fatality rate, the case fatality rate, and so on. At the time of writing, seven months into the pandemic, great uncertainty remains around many of these basic questions. Given such uncertainty, the model must be evaluated with a range of possibilities for each assumption. This highlights some fundamental difficulties with this type of modelling: First, the sheer weight of assumptions means it is hard to be confident about the results of simulation: even small changes to initial assumptions can make big differences to the results. Second, the sheer number of parameters means the space of different combinations is huge. Systematically exploring the entire parameter space is impossible.
The software base used to produce the report was about 15,000 lines of code, adapted from several decades of previous models developed within Ferguson's lab. A cleaned up version of the code (not the original) was made available on GitHub in April 2020. It quickly attracted a great deal of professional interest from the international computing community, and very shortly thereafter, the code began to attract criticism, and at the heart of these criticisms were suggestions that the software was simply of poor quality. It was, according to one U.K. newspaper report, "'totally unreliable… a buggy mess… impossible to read." The main issues raised seem to be as follows:
- Lack of comments! The first thing that many pundits latched onto was that much of the code was uncommented. Of course, this doesn't mean the code was wrong or of poor quality, but it was taken as an indicator that the software that was developed in an environment with relaxed software engineering standards, which in turn undermined confidence. Moreover, comments are there to help others understand the code, and to give confidence that the code is correct. The lack of comments works against such understanding.
- Choice of programming language. For the most part, the code appears to be conventional C. Conventional wisdom for choosing a programming language is to select one that allows you to best express your problem. While C might be the natural language to write a device driver, it isn't an obvious choice for agent-based modelling – it doesn't lend itself to naturally expressing such a model.
- Software structure. The main software engineering techniques used in the code seem to be procedural and functional abstraction, and it was claimed that the original codebase was highly monolithic (supposedly a single 15,000-line file). This was taken as another indicator of being developed in an environment with relaxed software engineering standards.
- Opaque assumptions. While it is inevitable that a model will make assumptions, it is surely essential that these assumptions are explicit and transparent. In particular, they should not be embedded within the code itself: we should not have to "reverse-engineer" assumptions from the code. Another criticism of the model was that it did exactly that, and that this hindered attempts to understand (and hence question) the underlying assumptions.
- Randomness, non-determinism, and reproducibility. All of the criticisms above can be taken as being in some sense superficial—relating to the appearance of the software, rather than what it actually does. A more fundamental issue arose when it was claimed that the software was inherently non-deterministic: running the code with the same starting parameters, and in particular the same random seeds could result in different outputs. However, it was claimed, the Ferguson model failed this test: the same seeds could lead to different results. This would imply that the Ferguson model failed a key requirement for good science: reproducibility.
- Predictions. There is a division of opinion within the agent-based modelling community about the predictive ability of agent-based models. While it is generally accepted that such models can provide useful qualitative insights, it is less clear that such models can reliably be used to provide quantitative predictions: we don't yet understand how to validate and calibrate such models at scale. This observation was raised in the context of the Ferguson model, when it was claimed that this model predicted a far higher death toll in Sweden than was actually observed given that country's lightweight lockdown.
- Openness. As noted above, the code released onto GitHub is not the actual code that was used to generate the report. This raised eyebrows because it fails the standard scientific tests of openness and transparency: for example, an increasing number of scientific journals require all associated data sets and software to be made available for peer scrutiny.
- Lack of peer review. Finally, many commentators pointed out that the report and its conclusions had not received peer review. Peer review is a cornerstone of good scientific practice. Peer review subjects research to the scrutiny of independent experts, and it is the main mechanism via which poor science is weeded out before dissemination. The apparent lack of independent external scrutiny for a report that had played such a major role in national policy was therefore seen as controversial.
Of course, most research software is indeed developed without the ceremony and rigor that accompanies professional software development. Research software is not usually intended to be understood by third parties, or to be reused. It is often developed with the sole purpose of generating a set of results and is then abandoned. Ferguson's position is simply that there just wasn't time to do anything more: high-ceremony software engineering would have delayed the report. He subsequently speculated that if the U.K. had locked down just one week earlier than it did, then 20,000 lives might have been saved. If correct, then even a day's further delay would have led to significantly increased loss of life.
While many members of the scientific community support the view that a lockdown was the only realistic way to control the disease, this view is by no means universally held. A highly vocal community believes the lockdown was an overreaction, and the devastating economic and social consequences that ensued were avoidable. This group seized upon criticisms of the Ferguson model, taking them as evidence that the lockdown was based on what they claimed was poor quality science. David Davis, a U.K. Member of Parliament, queried what he called "secret and potentially flawed calculations." In a bizarre twist of fate, software engineering quality had become a political weapon.
Subsequent analysis seems to have indicated that, while the extensive criticism about relaxed software engineering practices is perhaps justified, it was not fundamentally flawed. The code is not pretty—but this does not mean it was wrong.
While we will have to wait for formal inquiries to truly understand the role the Ferguson model played in the U.K. government's decision making, and the true extent to which the various criticisms of it are substantive, we can, I believe, already identify some lessons to learn going forward, which will make us better equipped to handle an inevitable future pandemic:
- Better modeling environments. We need software environments that are better able to directly capture agent-based models of the kind used in the Ferguson model. A suitable programming environment will made assumptions explicit, and allow developers to focus on the model itself, rather than how the model is expressed in a low-level programming language. A suitable software environment will need to scale to national and ideally global models with millions, and possibly billions of agents.
- National pandemic models. Many criticisms of the Ferguson work could have been avoided had a suitable well-engineered national pandemic model been developed in advance. Developing and openly testing national pandemic models would cost nothing compared to the daily national cost of the pandemic and the shutdown.
- Validation and calibration. We need to understand how to validate and calibrate models like that of the Ferguson team, so we can have confidence both in what they are modeling and how to interpret the simulations they produce.
- Data. The past two decades have seen an increasing trend towards open national data sources, for example relating to travel and health (the Open Data Institute in the U.K. is a good example). The pandemic demonstrates just how valuable this data is for formulating a national response.
- Openness. The apparent reluctance to release the original code of the model fuelled a raft of conspiracy theories, and went against established scientific best practice. Models need to be open, and subject to peer review—and criticism—by other experts.
Guest blogger Michael Wooldridge is a professor of computer science and head of the Department of Computer Science at the University of Oxford, U.K.