Are You Load Balancing Wrong?

Are You Load Balancing Wrong? illustrative photo

A reader contacted me recently to ask if it is better to use a load balancer to add capacity or to make a service more resilient to failure. The answer is: both are appropriate uses of a load balancer. The problem, however, is that most people who use load balancers are doing it wrong.

In today’s Web-centric, service-centric environments the use of load balancers is widespread. I assert, however, that most of the time they are used incorrectly. To understand the problem, we first need to discuss a little about load balancers in general. Then we can look at the problem and solutions.

A load balancer receives requests and distributes them to two or more machines. These machines are called replicas, as they provide the same service. For the sake of simplicity, assume these are HTTP requests from Web browsers, but load balancers can also be used with HTTPS requests, DNS queries, SMTP (email) connections, and many other protocols. Most modern applications are engineered to work behind a load balancer.

There are two primary ways to use load balancers: to increase capacity and to improve resiliency.

Using a load balancer to increase capacity is very simple. If one replica is not powerful enough to handle the entire incoming workload, a load balancer can be used to distribute the workload among multiple replicas.

Suppose a single replica can handle 100QPS (queries per second). As long as fewer than 100QPS are arriving, it should run fine. If more than 100QPS arrive, then the replica becomes overloaded, rejects requests, or crashes. None of these is a happy situation.

If there are two machines behind a load balancer configured as replicas, then capacity is 200QPS; three replicas would provide 300QPS of capacity, and so on. As more capacity is needed, more replicas can be added. This is horizontal scaling.

Load balancers can also be used to improve resiliency. Resilience means the ability to survive a failure. Individual machines fail, but the system should continue to provide service. All machines eventually fail—that’s physics. Even if a replica had near-perfect uptime, you would still need resiliency mechanisms because of other externalities such as software upgrades or the need to physically move a machine.

A load balancer can be used to achieve resiliency by leaving enough spare capacity that a single replica can fail and the remaining replicas can handle the incoming requests.

Continuing the example, suppose four replicas have been deployed to achieve 400QPS of capacity. If you are currently receiving 300QPS, each replica will receive approximately 75QPS (one-quarter of the workload). What will happen if a single replica fails? The load balancer will quickly see the outage and shift traffic such that each replica receives about 100QPS. That means each replica is running at maximum capacity. That’s cutting it close, but it is acceptable.

What if the system had been receiving 400QPS? Under normal operation, each of the four replicas would receive approximately 100QPS. If a single replica died, however, the remaining replicas would receive approximately 133QPS each. Since each replica can process about 100QPS, this means each one of them is overloaded by a third. The system might slow to a crawl and become unusable. It might crash.

The determining factor in how the load balancer was used is whether or not the arriving workload was above or below 300QPS. If 300 or fewer QPS were arriving, this would be a load balancer used for resiliency. If 301 or more QPS were arriving, this would be a load balancer for increased capacity.

There are two primary ways to use load balancers: to increase capacity and to improve resiliency.

The difference between using a load balancer to increase capacity or improve resiliency is an operational difference, not a configuration difference. Both use cases configure the hardware and network (or virtual hardware and virtual network) the same, and configure the load balancer with the same settings.

The term N+1 redundancy refers to a system that is configured such that if a single replica dies, enough capacity is left over in the remaining N replicas for the system to work properly. A system is N+0 if there is no spare capacity. A system can also be designed to be N+2 redundant, which would permit the system to survive two dead replicas, and so on.

Three Ways to Do It Wrong

Now that we understand two different ways a load balancer can be used, let’s examine how most teams fail.

Level 1: The Team Disagrees

Ask members of the team whether the load balancer is being used to add capacity or improve resiliency. If different people on the team give different answers, you’re load balancing wrong.

If the team disagrees, then different members of the team will be making different engineering decisions. At best, this leads to confusion. At worst, it leads to suffering.

You would be surprised at how many teams are at this level.

Level 2: Capacity Undefined

Another likely mistake is not agreeing how to measure the capacity of the system. Without this definition, you do not know if this system is N+0 or N+1. In other words, you might have agreement that the load balancing is for capacity or resilience, but you do not know whether or not you are using it that way.

To know for sure, you have to know the actual capacity of each replica. In an ideal world, you would know how many QPS each replica can handle. The math to calculate the N+1 threshold (or high-water mark) would be simple arithmetic. Sadly, the world is not so simple.

You can’t simply look at the source code and know how much time and resources each request will require and determine the capacity of a replica. Even if you did know the theoretical capacity of a replica, you would need to verify it experimentally. We are scientists, not barbarians!

Capacity is best determined by benchmarks. Queries are generated and sent to the system at different rates, with the response times measured. Suppose you consider a 200ms response time to be sufficient. You can start by generating queries at 50 per second and slowly increase the rate until the system is overloaded and responds slower than 200ms. The last QPS rate that resulted in sufficiently fast response times determines the capacity of the replica.

How do you quantify response time when measuring thousands or millions of queries? Not all queries run in the same amount of time. You can’t take the average, as a single long-running request could result in a misleading statistic. Averages also obscure bimodal distributions. (For more on this, see chapter 17, Monitoring Architecture and Practice, of The Practice of Cloud System Administration, Volume 2, by T. Limoncelli, S.R. Chalup, and C.J. Hogan; Addison-Wesley, 2015).

Since a simple average is insufficient, most sites use a percentile. For example, the requirement might be that the 90th percentile response time must be 200ms or better. This is a very easy way to toss out the most extreme outliers. Many sites are starting to use MAD (median absolute deviation), which is explained in a 2015 paper by David Goldberg and Yinan Shan, “The Importance of Features for Statistical Anomaly Detection” (https://www.usenix.org/system/files/conference/hotcloud15/hotcloud15-goldberg.pdf).

Generating synthetic queries to use in such benchmarks is another challenge. Not all queries take the same amount of time. There are short and long requests. A replica that can handle 100QPS might actually handle 80 long queries and 120 short queries. The benchmark must use a mix that reflects the real world.

If all queries are read-only or do not mutate the system, you can simply record an hour’s worth of actual queries and replay them during the benchmark. At a previous employer, we had a dataset of 11 billion search queries used for benchmarking our service. We would send the first 1 billion queries to the system to warm up the cache. We recorded measurements during the remaining queries to gauge performance.

Not all workloads are read-only. If a mixture of read and write queries is required, the benchmark dataset and process is much more complex. It is important that the mixture of read and write queries reflects real-world scenarios.

Sadly, the mix of query types can change over time as a result of the introduction of new features or unanticipated changes in user-access patterns. A system that was capable of 200QPS today may be rated at 50QPS tomorrow when an old feature gains new popularity.

Software performance can change with every release. Each release should be benchmarked to verify that capacity assumptions have not changed.

If this benchmarking is done manually, there’s a good chance it will be done only on major releases or rarely. If the benchmarking is automated, then it can be integrated into your continuous integration (CI) system. It should fail any release that is significantly slower than the release running in production. Such automation not only improves engineering productivity because it eliminates the manual task, but also boosts engineering productivity because you immediately know the exact change that caused the regression. If the benchmarks are done occasionally, then finding a performance regression involves hours or days of searching for which change caused the problem.

Ideally, the benchmarks are validated by also measuring live performance in production. The two statistics should match up. If they don’t, you must true-up the benchmarks.

Another reason why benchmarks are so complicated is caches. Caches have unexpected side effects. For example, intuitively you would expect that a system should get faster as replicas are added. Many hands make light work. Some applications get slower with more replicas, however, because cache utilization goes down. If a replica has a local cache, it is more likely to have a cache hit if the replica is highly utilized.

Level 3: Definition But No Monitoring

Another mistake a team is likely to make is to have all these definitions agreed upon, but no monitoring to detect whether or not you are in compliance.

Suppose the team has determined that the load balancer is for improving both capacity and resilience, they have defined an algorithm for measuring the capacity of a replica, and they have done the benchmarks to ascertain the capacity of each replica.

The next step is to monitor the system to determine whether the system is N+1 or whatever the desired state is.

The system should not only monitor the utilization and alert the operations team when the system is out of compliance, but also alert the team when the system is nearing that state. Ideally, if it takes T minutes to add capacity, the system must send the alert at least T minutes before that capacity is needed.

Cloud-computing systems such as Amazon Web Services (AWS) have systems that can add more capacity on demand. If you run your own hardware, provisioning new capacity may take weeks or months. If adding capacity always requires a visit to the CFO to sign a purchase order, you are not living in the dynamic, fast-paced, high-tech world you think you are.

Summary

Anyone can use a load balancer. Using it properly is much more difficult. Some questions to ask:

Is this load balancer used to increase capacity (N+0) or to improve resiliency (N+1)?
How do you measure the capacity of each replica? How do you create benchmark input? How do you process the benchmark results to arrive at the threshold between good and bad?
Are you monitoring whether you are compliant with your N+M configuration? Are you alerting in a way that provides enough time to add capacity so that you stay compliant?

If the answer to any of these questions is “I don’t know” or “No,” then you’re doing it wrong.