Again: The One Sure Way to Advance Software Engineering

Once again, bad software has struck. From 7:30 to late afternoon on November 10, Internet access and email were unavailable to most customers of Swisscom, the main mobile services provider in Switzerland. (If you read German, you can read details here. If you read French, see here, requiring free registration. In Italian, here.) Given how wired our lives have become, such outages can have devastating consequences. As an example, customers of some of the largest banks in Switzerland cannot access their accounts online unless they type in an access code, one-type-pad style, sent to their cell phone when they log in.

That is all the news we will see: Something really bad happened, and it was due to a software bug. A headline for a day or two, then nothing. What we will miss, in this case as with almost all software disasters Normal 0 false false false EN-US X-NONE X-NONE —most recently, the Great Pre-Christmas Skype Outage of 2010 —is the analysis: what went wrong, why it went wrong, and what is being done to ensure it does not happen again. Systematically applying such analysis is the most realistic technique available today for breakthrough improvements in software quality. The IT industry is stubbornly ignoring it. It is our responsibility as software engineering professionals to change that self-defeating and unjustifiable attitude.

I have harped on this theme before [1, 2, 3] and will continue to do so until the attitude changes. Quoting from [1]:

Airplanes today are incomparably safer than 20, 30, 50 years ago: 0.05 deaths per billion kilometers. That’s not by accident.

Rather, it’s by accidents.

What has turned air travel from a game of chance into one of the safest modes of traveling is the relentless study of crashes and other mishaps. In the U.S. the National Transportation Safety Board has investigated more than 110,000 accidents since it began its operations in 1967. Any accident must, by law, be analyzed thoroughly; airplanes themselves carry the famous “black boxes” whose only purpose is to provide evidence in the case of a catastrophe. It is through this systematic and obligatory process of dissecting unsafe flights that the industry has made almost all flights safe.

Now consider software. No week passes without the announcement of some debacle due to “computers”—meaning, in most cases, bad software. The indispensable Risks forum [4] and many pages around the Web collect software errors; several books have been devoted to the topic.A few accidents have been investigated thoroughly; two examples are Nancy Leveson’s milestone study of the Therac-25 patient-killing medical device [2], and Gilles Kahn’s analysis of the Ariane 5 crash, which Jean-Marc Jézéquel and I used as a basis for our 1997 article [6]. Both studies improved our understanding of software engineering. But these are exceptions. Most of what we have elsewhere is made of hearsay and partial information, and plain urban legends—like the endlessly repeated story about the Venus probe that supposedly failed because a period was typed instead of a comma, most likely a canard.

Part of the solution is to use the legal system. For any large-scale software failure in which public money is involved, a law should require the convocation of an expert committee and the publication of a detailed technical analysis. The software engineering community should lobby for the passage of such a law and should not rest until it is enacted.

For private businesses the legal approach may be harder to pursue as some might view it as undue government interference, but it may still be pushed given the obvious public interest in software that works. The scenario would be for the industry to adopt, as a voluntary standard, the principle that every large-scale mishap must automatically lead to an exhaustive and public post-mortem analysis; in Rahm Emanuel’s immortal words, “You never want a serious crisis to go to waste.”

Until that happens, software will remain brittle. Think of the last time you stepped into a plance, and how different you would have felt if aircraft manufacturers had been allowed, disaster after disaster in the past 70 years, to keep the embarrassing details to themselves and continue business as usual.

References

[1] The one sure way to advance software engineering, 21 August 2009, see here (in my personal blog).

[2] Dwelling on the point, 29 November 2009, see here.

[3] Analyzing a software failure, 24 May 2010, see here.

[4] Peter G. Neumann, moderator: The Risks Digest Forum on Risks to the Public in Computers and Related Systems, available online (going back to 1985!).

[5] Nancy Leveson: Medical Devices: The Therac-25, extract from her book Safeware: System Safety and Computers, Addison-Wesley, 1995, available here.

[6] Jean-Marc Jézéquel and Bertrand Meyer: Design by Contract: The Lessons of Ariane, in Computer (IEEE), vol. 30, no. 1, January 1997, pages 129-130, also available here.

Image source: National Transportation Safety Board, reconstruction of crashed TWA 800 aircraft, public domain (see here).