Architecture and Hardware Contributed articles

Amdahl’s Law for Tail Latency

Queueing theoretic models can guide design trade-offs in systems targeting tail latency, not just average performance.

By Christina Delimitrou and Christos Kozyrakis

Posted Aug 1 2018

Introduction
Key Insights
Cost Model
Brawny Versus Wimpy Cores
Core Heterogeneity
Caching
Discussion
Conclusion
Acknowledgments
References
Authors
Sidebar: Analytical Framework

Amdahl's Law for Tail Latency, illustration

Translating the impact of Amdahl’s Law on tail latency provides new insights on what future generations of data-center hardware and software architectures should look like. The emphasis on latency, instead of just throughput, puts increased pressure on system designs that improve both parallelism and single-thread performance.

Key Insights

Optimizing for tail latency makes Amdahl’s Law more consequential than when optimizing for average performance.
Queueing theory can provide accurate first-order insights into how hardware for future interactive services should be designed.
As service responsiveness and predictability become more critical, finding a balance between compute and memory resources likewise becomes more critical.

Computer architecture is at an inflection point. The emergence of warehouse-scale computers has brought large online services to the forefront in the form of Web search, social networks, software-as-a-service, and more. These applications service millions of user queries daily, run distributed over thousands of machines, and are concerned with tail latency (such as the 99^th percentile) of user requests in addition to high throughput.⁶ These characteristics represent a significant departure from previous systems, where the performance metric of interest was only throughput, or, at most, average latency. Optimizing for tail latency is already changing the way we build operating systems, cluster managers, and data services.^7,8 This article investigates how the focus on tail latency affects hardware designs, including what types of processor cores to build, how much chip area to invest in caching structures, how much resource interference between services matters, how to schedule different user requests in multicore chips, and how these decisions interact with the desire to minimize energy consumption at the chip or data-center level.²

While the precise answers will come from detailed experiments with both simulated and real systems, there is great value in having an analytical framework that identifies the major trade-offs and challenges in latency-sensitive cloud systems. We aim here to complement the previous analyses on Amdahl’s Law for parallel and multicore systems^1,11 by designing a model that draws from basic queueing theory (see Figure 1 in the sidebar “Analytical Framework”) and can provide first-order insights on how design decisions interact with tail latency. As was the case with the previous analyses based on Amdahl’s Law, our model has significant implications for processor designs for cloud servers.

While analytical models help draw first-order insights, they run the risk of not accurately reflecting the complex operation of a real system. In Figure 2, we show a brief validation study of the queueing model, as discussed in the sidebar, with {1, 4, 8, 16} compute cores against a real instantiation of memcached, a popular in-memory, key-value store, with the same number of cores. We set the mean interarrival rate and service time of the queueing model based on the measured times with memcached. In both cases, when providing memcached with exponentially distributed input load, the mem-cached request latency is close to the one estimated by the queueing model across load levels.

Figure 2. Validation of the queueing model against a real instantiation of an in-memory key-value store (memcached) for {1,4,8,16} cores.

Cost Model

Since hardware resources are not infinite, this analysis requires a cost model that relates resource usage to performance. We use a model similar to the one used by Hill and Marty¹¹ to extend Amdahl’s Law to multicore chips. That is, we assume a given multicore chip is limited to R base core equivalents (BCE) units. This limitation represents area or power-consumption constraints in the chip design. The BCE is an abstract cost unit that captures processor resources and caches but does not share resources (such as interconnection networks and memory controllers). As in Hill and Marty,¹¹ we assume these resources are fairly constant in the system variations we examine. A baseline core that consumes 1BCE unit achieves performance of perf(1)=1. Chip architects can build more powerful cores by dedicating r ∈ [1,R] resource units to each core to achieve performance per f (r), where per f (r) is the rate parameter μ in our performance model. Larger cores have higher service rate μ, which is inversely related to tail latency, as discussed in the sidebar. If performance increases superlinearly with resources, then more cores are always better. In practice per f (r) < r, creating trade-offs between opting for few brawny or many wimpy cores. By default, we follow Shekhar Borkar³ and use per f (r) = sqrt(r) but have also investigated how higher roots affect the corresponding insights.

Brawny Versus Wimpy Cores

We first examine a system where all cores are homogeneous and have identical cost. An important question the designer must answer is: Given a constrained aggregate power or area budget, should architects build a few large cores or many small cores? The answer has been heavily debated in recent years in both academia and industry,^{4,12,14,17,19,22} as it relates to the introduction of new designs (such as the ARM server chips and throughput processors like Xeon Phi).

Assuming the total budget is R = 100BCEs, an architect can build 100 basic cores of 1BCE each, 25 cores of 4BCEs each, one large core of 100BCEs, or in general R/U cores of U units each, as shown in Figure 3. We consider an online service workload with tail latency quality-of-service (QoS) constraints. QoS is defined as a function of the mean service time T_s of the 100BCE machine. For example, a very strict QoS target would require the 99^th percentile of request latency to be T_s. This means the time between arrival and completion of 99% of requests must be less or equal to the machine’s mean service time, allowing no tolerance for queueing or service-time variability. More relaxed QoS targets are defined as multiples of T_s: QoS = αT_s, α ∈ [5, 10, 50, 100]. Figure 4a shows how throughput in queries per second (QPS) changes for different latency QoS targets, under the M/M/N queueing model described in the sidebar. Throughput of 100QPS for QoS=10Ts means the system achieved 100QPS for which the 99^th latency per-centile is 10T_s. The x-axis captures the size of selected cores, moving from many small cores on the left side to a single core of 100BCEs on the right side. We examine all core sizes from 1BCE up to 100BCEs in increments of a single resource unit. In configurations with multiple cores, throughput is aggregated across all cores. The discontinuities in the graph are an artifact of the limited resource budget and homogeneous design; for example, for U = 51, an architect can build a single 51BCE core, while 49 resource units remain unused. Throughput for 10T_s for cores greater than 7BCE overlaps with 100Ts, as does throughput for 5T_s for cores of more than 12BCEs.

Figure 3. Homogeneous server configurations for a budget of R = 100 resource units: (a) 100 1BCE cores; (b) 25 4BCE cores; and (c) one 100BCE core.

Figure 4. Studies on big versus small cores, core heterogeneity, and caching using the queueing model.

Finding 1. Very strict QoS targets put a lot of pressure on single-thread performance. When QoS = T_s or 5 T_s, cores smaller than 22BCEs or 12BCEs, respectively, achieve zero QPS for which the tail latency satisfies the QoS target. This happens because the cores are too weak to handle variability in service time even in the absence of queueing, and the queueing naturally occurs when cores operate close to saturation. This result means that, for services with extremely low-latency requirements (such as in-memory caching and in-memory distributed storage),²¹ architects must focus on improving single-thread performance even at high cost. At the same time, some core parallelism is needed. A single 100BCE core performs significantly worse than four 25BCE cores. This finding is in agreement with industry concerns about the performance of small cores with warehouse-scale services.¹² The need for high single-thread performance also motivates application- or domain-specific accelerators as a more economical way of improving performance than incremental out-of-order core optimizations.

Finding 2. At lower latency constraints, architects should look for ways to balance optimizations for single-thread performance and request-level parallelism. At lower QoS targets, a larger set of medium-size cores achieves the best performance. For example, 7BCE cores are optimal for QoS = 10T_s. For applications with moderate latency requirements (such as Web search and Web servers), architects should seek to balance improvements in single-thread performance (instruction-level parallelism) and multi-core performance (request-level parallelism). Increasing single-thread performance at high cost yields diminishing returns in this case. Nevertheless, a large pool of wimpy cores—1BCE—is optimal only when applications have no latency constraints, as with long data mining queries or log-processing requests. With QoS = 100T_s, applications are essentially throughput-limited and perform best with many wimpy cores.

These findings highlight a disparity between optimal system design when optimizing for throughput versus when optimizing for tail latency. For example, in a homogeneous system where throughput is the only performance metric of interest and parallelism is plentiful, the smallest cores achieve the best performance; see the 1BCE cores in Figure 4a. In comparison, when optimizing for throughput under a tail latency constraint, the optimal design point shifts toward larger cores, unless the latency constraint relaxes significantly.

Finding 3. Limited parallelism also calls for more powerful cores. So far we have assumed all user requests are independent and perfectly parallelizable, though it is rarely the case in practice. Requests are often dependent on each other and on system issues like connection ordering and locks for writes causing serialization. The growing trend of breaking complex services down to smaller components (microservices) will only make the problem of request dependencies more common. This brings up the caveat of Amdahl’s Law. To what extent are the previous findings accurate when parallelism is limited? Figure 4b shows the case of a reasonable QoS (10T_s) with f ∈ {50%, 90%, 99%, 100%}. When, for example, the parallel fraction of the computation f is 90%, 10% of requests are serialized. As a result, while optimal performance was previously achieved with seven BCE cores, the optimal core size now shifts to 25 BCEs. Limited parallelism also affects throughput-centric systems,¹¹ with more powerful cores outperforming wimpy cores in applications with serial regions. Using Hill’s and Marty’s model¹¹ with a 100BCE budget and 10% serialization, an architect would determine that 10BCE cores are optimal for throughput, a less aggressive increase in core size than when optimizing for latency. As parallelism decreases further, more performant cores are needed to drive down tail latency. When 50% of execution is serial, a single 100BCE core is optimal, a dramatic shift from the unlimited-parallelism case; overall throughput is also an order of magnitude lower. Quantifying the degree of parallelism in latency-critical services is essential when deciding how to build the underlying hardware. At the same time, computer scientists should strive to remove serialization across the system stack—at the application level by developing tracing and monitoring systems that detect and minimize cross-service dependencies, at the operating system by minimizing the need for lock serialization, and at the architecture level by investing in methods that increase single-thread performance and intra-request parallelism.⁹

These findings remain consistent for per f (r) scaling with the square, cubic, and fourth root of r. Beyond that point, optimal design favors smaller cores.

Core Heterogeneity

The previous section explored the trade-offs between powerful, brawny cores and power-efficient, wimpy cores. Neither type of core provides high efficiency across a wide range of QoS targets, raising several obvious questions, including: Should an architect combine multiple core types in the same system, as is already the norm in multi-core chips for mobile systems? How should architects determine the size of these cores? And at what ratio should they use them? Determining the right mix of large-versus-little cores, as well as devising schedulers that take advantage of heterogeneous cores, especially in the presence of heterogeneous load, has been a notably active topic of research in computer architecture in recent years.^5,9,15 Figure 4c shows the QPS under various QoS targets for a set of heterogeneous designs. In all cases, the system has two core configurations: small cores with U = 1, benefiting applications with relaxed QoS, and big cores with U = 25, benefiting applications with strict QoS. The system also receives two exponentially distributed input request streams, one with short and the other with long mean-service-time requests, and design a simple heterogeneity-aware scheduler that routes long requests to big cores and short requests to small cores. Requests are admitted to a single queue, as in Figure 5, and the ratio of long-to-short requests is, for now, 1:1. Figure 5 starts with all big cores at the leftmost point of the x-axis, explores the heterogeneous space, and ends with all small cores at the rightmost point.

Figure 5. Heterogeneous server configuration with 25BCE large cores and 1BCE small cores.

Finding 4. Figure 4c captures a surprising trend. For strict QoS targets, like 1 · T_s, homogeneous systems with all big cores achieve optimal performance. In contrast, for very relaxed QoS targets, like 100T_s, using all small cores achieves the best performance. However, for QoS targets in the middle (such as 10T_s), heterogeneous systems, coupled with heterogeneity-aware schedulers, outperform their homogeneous counterparts. This result is especially true when the ratio of big to small cores matches the ratio of long-to-short requests. Varying the request ratio affects these findings significantly. The further away the ratio of long-to-short requests is from the ratio of big-to-small cores the more homogeneous systems outperform their heterogeneous counterparts. This result means that for heterogeneous architectures to make sense the system must closely track the input load and adjust to its changes, a common phenomenon in large-scale online services.¹⁸

Finding 5. We have again assumed unlimited request parallelism. Once serialization between requests is introduced, the optimal operation point shifts. Figure 4d shows QPS under various tail-latency QoS targets for increasing values of f ∈ {50%, 90%, 99%, 100%}. Where previously homogeneity outperformed heterogeneous designs for extreme QoS requirements—very strict and very relaxed—now takes the lead heterogeneity. For example, for a moderate QoS target of 10T_s and f = 0.9 a single big core achieves optimal performance, compared to the 50:50 mix in Figure 4c. In general, the more parallelism is limited the more the optimal operation point shifts left, with more big and fewer smaller cores. This is in agreement with Hill’s and Marty’s observations,¹¹ with the added implication that latency considerations cause a more rapid shift toward larger cores than when throughput is the only performance metric of interest. For example, when f = 0.9 and the system optimizes only for throughput, two 50BCE cores achieve the best performance under Hill’s and Marty’s model. As before, this result highlights the importance of quantifying the degree of parallelism in interactive applications. It also establishes that, even with limited parallelism, scheduling that takes into account the different capabilities of available hardware is essential for harnessing the potential of hardware heterogeneity.

Caching

Architects constantly deal with the trade-off of using the limited resources for compute or caching. Larger caches help avoid the long latencies of main memory but draw significant static power and reduce the amount of resources available for compute cores; see Figure 6 for two characteristic configurations. Using the same total budget as before—R = 100—we explore how QPS under a tail-latency constraint changes as a fraction C ∈ [0, 90] of resources goes toward building caches, as opposed to cores. We use 10BCE cores, benefitting applications with moderately strict QoS targets; Figure 4e shows this trade-off. On the leftmost point of the x-axis all resources are dedicated to building cores. On the rightmost point, 90% of resources go toward building caches and the remaining 10% toward building cores, one 10BCE core in this case. Increasing caching by 10BCE results in one fewer core in the system. We assume caches improve service time under a sqrt(C) function, meaning T_s0 = T_s = sqrt(C).²³ We validate the selection of the scaling factor against a real installation of memcached where the allocated last-level cache partition is adjusted using Intel’s Cache Allocation Technology. As the number of used cores increases, the allocated cache capacity decreases. Figure 7 outlines that the difference between the analytical model and the real system is, in general, marginal. The findings reported in Figure 4e remain consistent for scaling functions until the seventh root of C, which corresponds to progressively lower benefits from caching, causing the optimal point to shift increasingly to the left.

Figure 6. Server configurations with 10BCE cores when dedicating (a) 10 resource units and (b) 70 resource units toward caching.

Figure 7. Validation of the queueing model against a real instantiation of an in-memory key-value store (memcached) with increasing caching and reduced compute resources.

Finding 6. For services with strict tail-latency requirements that exhibit locality, the benefit from caching is critical to achieving QoS. For strict QoS constraints (such as QoS = T_s), at least C = 20 units are needed to lower the core’s service time in a way that achieves QPS under the tail-latency constraint.^16,20 Moderately increasing caching resources beyond C = 20 units further improves performance, as larger fractions of the working set fit in the last-level cache;¹⁶ that is, more requests enjoy the shorter processing time of caches for the purpose of the queueing model. However, the benefits diminish beyond C = 40, and performance degrades rapidly as compute resources become insufficient.¹⁶ Existing server chips dedicate one-third to one-half of their area budget to caches. Our analysis indicates this trend will continue.

Finding 7. For relaxed QoS targets, caching is less critical. Since smaller cores are sufficient for achieving the QoS constraints in this case, and although caching is still beneficial, moderate cache provisioning (such as C = 10 units to 30 units) yields most of its potential performance benefits. Increasing caching units to C = 40 has no effect on performance, and further increase degrades performance. Architects should focus instead on exploiting request parallelism in a way that keeps the large number of smaller cores busy.^12,16

Finding 8. Limited parallelism highlights the importance of increased caching. Figure 4f reports the performance for a moderate QoS target of 10T_s and increasing values of f ∈ [50%, 90%, 99%, 100%]. When 10% of the requests need to be serialized, the optimal point for caching is C = 40 units compared to C = 30 units with unlimited parallelism. Serialized execution requires higher single-thread performance, and larger on-chip caches is one way to achieve such performance.

Discussion

The models we offer here aim to provide first-order insight into how system design decisions affect tail latency and throughput in QoS-constrained services. These models do not capture every aspect of a data-center machine or application.¹³ For example, while we can arbitrarily scale service times using the presented queueing model, system call and RPC overheads in real systems have hard lower limits. Likewise, software, especially in cloud applications, is not static. These frequent changes in cloud environments affect the degree of dependencies across requests, in terms of both the request fanout and the dependencies across components of a service (such as in microservices-based cloud applications). A more sophisticated model that captures such dependencies, potentially through a queueing network, can provide more accurate performance estimations at the cost of greater complexity. Finally, in hardware, architects cannot build cores with arbitrarily higher performance by simply adding more resources. They must also account for such factors as locality, coherence, and memory scheduling absent from our current model.

We see queueing theoretic models as a starting point for using queueing theory principles to draw insights into system design. We hope this analysis motivates researchers to develop more sophisticated models that address the limitations we have identified and, more important, the hardware and software that can achieve the performance requirements we highlighted.

Conclusion

Amdahl’s Law is as pervasive when it comes to tail latency as it has been for traditional systems. Our goal here has been to offer a simple, intuitive, practical model that can lend first-order insights into which optimizations make sense when an application cares about tail performance. Using it, we have shown the overarching trade-offs in large-versus-small-core systems, heterogeneity, and caching. We encourage computer systems researchers to expand this model to express more sophisticated systems and studies.

Acknowledgments

We thank Mark Hill, Partha Rangana-than, Daniel Sanchez, and the anonymous reviewers for their helpful feedback on earlier drafts of this article.

Sidebar: Analytical Framework

Amdahl’s Law describes the speedup of a program when a fraction f of the computation is accelerated by a factor S. Speedup is then defined as

In a multi-core machine, Amdahl’s Law captures the benefit from multiple cores in average performance. While this interpretation is still relevant, it is, by itself, insufficient for describing tail latency requirements. To bridge the gap we build upon ideas from queueing theory, which provides a framework to reason about task-arrival rates, service times, and end-to-end response times. Simple models (such as M/M/1 and M/M/k) are particularly attractive for first-order performance calculations because they can concisely describe performance in closed-form expressions.

M/M/1 model. We start with one of the simplest queueing models: the M/M/1 queue, modeling a system in which a single server processes incoming tasks. Tasks arrive under a Poisson process with rate λ. The service times also follow an exponential distribution, with rate parameter p and mean service time T_s = 1/μ (μ=per f(r) in the main text of the article. A larger μ means a more powerful server and results in lower latency. Tasks are processed in a simple first-in-first-out order. This simple queueing system is stable when μ > λ. In contrast, when μ > λ, queued tasks keep increasing, leading to instability. The load of the system is defined as ρ = λ/μ. Given these definitions, the mean number of tasks in the system is

where N is a random variable for the number of tasks. Likewise, the mean of task response time (using random variable R) is

and the ρ-th percentile of response time is

Figure 1a outlines the 99^th percentile of request latency as a function of the service rate μ. As μ increases, tail latency drops both at low and high load.

Figure 1. Building system insights from queueing theory: (a) 99^th percentile response time in an M/M/1 model; and (b) 99^th percentile queueing time in an M/M/4 model as a function of μ.

M/M/k model. We now extend the M/M/1 model to a more realistic system with k equivalent servers in order to model a multicore machine. Tasks are now added to a single, shared queue, where servers draw them from for processing. As with the M/M/1 model, tasks arrive under a Poisson process with arrival rate λ and each server processes tasks with service rate μ. Closed-form solutions for the mean response time and response-time percentiles exist but are more complicated than in the M/M/1 model. Specifically, system load is ρ = λ/(kμ). The probability that a new task must be enqueued is given by Erlang’s C formula

and the mean number of tasks in the system

The average response time is

Finally, the p-th percentile of queueing time is

Figure 1b outlines how the 99^th percentile of queueing time correlates to the service rate μ for one and four servers. Higher service rates correspond to less time spent by requests in the queue. We use the M/M/k model for analysis of system trade-offs unless otherwise specified. In the article’s section on validation, we verify that this model closely reflects real system behavior. For applications with non-Poisson arrival-and service-time distributions, more general queueing models may be needed (such as the G/G/k model).^10,24 For more complex applications (such as multi-tier services), system architects would need a more sophisticated analytical model (such as a queueing network).

Submit an Article to CACM

CACM welcomes unsolicited submissions on topics of relevance and value to the computing community.

You Just Read

Amdahl’s Law for Tail Latency

View in the ACM Digital Library

Copyright held by the authors. Publication rights licensed to ACM.
Request permission to publish from permissions@acm.org

DOI

10.1145/3232559

August 2018 Issue

Published: August 1, 2018

Vol. 61 No. 8

Pages: 65-72

Table of Contents

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Explore More

News Apr 18 2024

Keeping AI Out of Elections

Bennie Mols

Artificial Intelligence and Machine Learning

BLOG@CACM Apr 17 2024

Technical Marvels

Herbert Bruderer

Computer History

BLOG@CACM Apr 16 2024

The Value of Data in Embodied Artificial Intelligence

Shaoshan Liu

Artificial Intelligence and Machine Learning

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More