Debugging Incidents in Google's Distributed Systems

Debugging Incidents in Google's Distributed Systems

By Charisma Chan, Beth Cooper

Communications of the ACM, Vol. 63 No. 10, Pages 40-46

[article image]

 

Google has published two books about Site Reliability Engineering (SRE) principles, best practices, and practical applications.1,2 In the heat of the moment when handling a production incident, however, a team's actual response and debugging approaches often differ from ideal best practices.

This article covers the outcomes of research performed in 2019 on how engineers at Google debug production issues, including the types of tools, high-level strategies, and low-level tasks that engineers use in varying combinations to debug effectively. It examines the research approach used to capture data, summarizing the common engineering journeys for production investigations and sharing examples of how experts debug complex distributed systems. Finally, the article extends the Google specifics of this research to provide some practical strategies that you can apply in your organization.


