Nearly every computer system today runs hot ... too hot. For over a decade, thermal constraints have limited the computational capability of computing systems of all sizesfrom mobile phones to datacenters. And, for nearly that long, system designers have cheated those thermal limits, allowing systems to burn more power, and produce more heat, for short periods to deliver bursts of peak performance beyond what can be sustained. This idearunning a computer too hot for a short period of time to get a burst of performanceis called computational sprinting.
We have likely all experienced computational sprinting on our smartphones; it turns out that, if all the cores, accelerators, and peripherals on a modern smartphone are turned on at once, the phone will generate several times more heat than can be dissipated through its case. If you play a demanding 3D video game for more than a few minutes, you might notice the phone get uncomfortably warm. As the phone heats up, eventually, processing speeds have to slow to keep temperature rise in check. When the phone cools, its processor can run full-tilt again.
What might be less widely known is that modern datacenters can play similar tricks; they oversubscribe both power delivery and cooling capability to eke out greater efficiency. Individual servers may sprint by consuming more than their fair share of power to maximize performance when their workload is high. In a datacenter running diverse workloads, different systems will likely sprint at different times, and the average demands of the facility will (probably) remain sustainable. But, a local spike in one server rack might draw too much power from a particular circuit, risking that a circuit breaker trips. Or, all the cores in a particular server might run a sustained compute job at full bore and risk local over-heating. To maximize efficiency, a datacenter should sprint as close to its power and thermal limits as it can ... without going over them.
Current datacenters must either run complex, centralized control systems to allocate power and thermal budgets at fine granularity, or reserve large guard-bands to avoid power or thermal emergencies. But, because they require frequent communication, centralized systems are prone to failure and notoriously difficult to scalethe frequent communication rapidly becomes a bottleneck. Moreover, workloads benefit to different degrees at different times from computational sprinting; judicious use of scarce power and cooling budgets can lead to better overall performance. The challenges of allocating budgets grow even more daunting in cloud computing environments, where each cloud tenant seeks to maximize its own performance and may have no incentive to cooperate.
Economics has long studied the challenges of allocating scarce resources. Game theory, in particular, studies resource allocation among strategic agents that seek to maximize their individual utility and might even lie about their preferences to do so.
The authors of the following paper, Distributed Strategies for Computational Sprints, bring this rich theory to the challenge of managing computational sprinting in datacenters. They formulate the problem of managing computational sprinting as a repeated game: agents managing individual workloads are free to choose when to sprint, but must wait for a cool-off period before sprinting again. Moreover, if too many nodes sprint at once, supplemental battery power must be used to avoid tripping circuit breakers; servers connected to that power circuit are not allowed to sprint again until the battery recharges. To "win" in this game, agents must choose to sprint when they achieve the maximum performance benefit while taking into account the risk they incur that too many concurrent sprinters cause a circuit to trip.
When we consider the resource management challenges that arise in computer systems, we should look beyond the confines of our own discipline.
To optimize the datacenter as a whole, each agent provides a broker with its best estimate of its utility curvehow much benefit it gains from sprinting for various fractions of its execution while taking into account the risks of a circuit breaker trip. The broker then solves for a global equilibrium that maximizes utility, and provides each agent the strategy it should follow to reach that equilibrium. The strength of the underlying economic theory is that agents provably cannot gain an advantage from lying about their utility curve or deviating from their assigned strategy ... so, they are incentivized to cooperate.
The beauty of this approach is that it provides nearly the effectiveness of perfect centralized control while requiring only simple, infrequent interactions with the broker. Because agents cannot gain an advantage by cheating, this kind of coordination mechanism can be used even among mutually distrusting agents, as in the cloud. More generally, the paper teaches us that, when we consider the myriad resource management challenges that arise in computer systems, we ought to look beyond the confines of our own discipline; economics provides a rich toolset from which all of us can learn.
To view the accompanying paper, visit doi.acm.org/10.1145/3299885
The Digital Library is published by the Association for Computing Machinery. Copyright © 2019 ACM, Inc.