A/B Testing Netflix.Com
Netflix is steeped in a culture of A/B testing. All elements of the service, from movie personalization algorithms to video encoding, all the way down to the UI, are potential targets for an A/B test. It is not unusual to find the typical Netflix subscriber allocated into 30 to 50 different A/B tests simultaneously. Running the tests at this scale provides the flexibility to try radically new approaches and multiple evolutionary approaches at the same time. Nowhere is this more apparent than in the UI.
While many of the A/B tests are launched in synchrony across multiple platforms and devices, they can also target specific devices (phones or tablets). The tests allow experimentation with radically different UI experiences from subscriber to subscriber, and the active lifetime of these tests can range from one day to six months or more. The goal is to understand how the fundamental differences in the core philosophy behind each of these designs can enable Netflix to deliver a better user experience.
Facets to Features to Modules
It is useful first to draw a clear line connecting the personalization facets and their impact on the UI. A simple example can help illustrate this relationship. Let's imagine that today we want to A/B test a search box. For this test, we may have a control cell, which is the traditional experience that sends users to a search-results page. To accommodate for regional differences in user experiences, we also have a slight variation of that control cell depending on whether the subscriber is located within the U.S. The first test cell provides autocomplete capability, and is available to all subscribers allocated in cell 1. Allocation in this scenario means the subscriber was randomly selected to participate in this test. A secondary test cell provides search results right on the current page by displaying results as the user types. Let's call this instant search, and it is available to all subscribers allocated in cell 2. These are three distinct experiences, or "features," with each one being gated by a set of very specific personalization facets. Thus, users are presented only one of these search experiences when they are allocated to the test and when their facets fulfill the test's requirements (see Table 1). Other parts of the page, such as the header or footer, can be tested in a similar manner without affecting the search-box test.
In this case, there is yet another driving force behind modules. They allow seamless feature portability from one page to the next. Division of a Web page into smaller and smaller pieces should be done until it is possible to compose new payloads using existing modules. If functionality must be broken out from a previous module to achieve that, it is a likely indicator the module in question had too many responsibilities. The smaller the units, the easier they are to maintain, test, and deploy.
Netflix is steeped in a culture of A/B testing. All elements of the service, from movie personalization algorithms to video encoding, all the way down to the UI, are potential targets for an A/B test.
Through the years, the Web community devised several methods to handle this complexity, with varying degrees of success. Early solutions simply included all dependencies on the page, regardless of whether or not the module would be used. While simple and consistent, this penalized users across the board, with bandwidth constraints often exacerbating already long load times. Later solutions relied on the browser making multiple asynchronous requests back to the server as it determined missing dependencies. This, too, had its drawbacks, as it penalized deep dependency trees. In this implementation, a payload with a dependency tree N nodes deep could potentially take up to N – 1 serial requests before all dependencies were loaded.
More recently, the introduction of asynchronous module definition (AMD) libraries such as RequireJS allows users to create modules, then preemptively generate payloads on a per-page basis by statically analyzing the dependency tree. This solution combined the best of both previous solutions by generating specific payloads containing only the things needed by the page and by avoiding unnecessary penalization based on the depth of the dependency tree. More interestingly, users can also opt out entirely from the static-analysis step and fall back on asynchronous retrieval of dependencies, or they can employ a combination of both. In Figure 1, a module called
foo has three dependencies. Because
depC is fetched asynchronously, N – 1 additional request(s) are made before the page is ready (where N=2, and N is the depth of the tree). An application's dependency tree can be built using static-analysis tools.
The problem with AMD and similar solutions is their assumption of a static-dependency tree. In situations where the runtime environment is colocated with the source code, it is common to import all possible dependencies but exercise only one code path, depending on the context. Unfortunately, the penalty for doing so in the browser is much more severe, especially at scale.
The problem can be better visualized by recalling the previous search-box A/B test, which has three distinct search experiences. If the page header depends on a search box, how do you load only the correct search box experience for that given user? It is possible to add all of them to the payload, then have the parent module add logic that allows it to determine the correct course of action (see Figure 2). This is unscalable, however, as it bleeds knowledge of A/B test features into the consuming parent module. Loading all possible dependencies also increases the payload size, thereby increasing the time it takes for a page to load.
A second option of fetching dependencies just-in-time is possible but may introduce arbitrary delays in the responsiveness of the UI (see Figure 3). In this option, only the modules that are needed are loaded, at the expense of an additional asynchronous request. If any of the search modules has additional dependencies, there will be yet another request, and so on, before search can be initialized.
Big Numbers Change Everything
This number is eye-catching, though not entirely honest. Of the 600 different modules, most are not independently selectable. Many of those modules depend on other common platform modules that then depend on third party modules. Furthermore, even the largest of A/B tests usually affects fewer than three million users. This seems like a large population to test on, but in reality it is still a small percentage of the total 50-plus million subscriber base. This information leads to some early conclusions: first, the allocation of the tests is not large enough to spread evenly over the entirety of the Netflix subscriber base; and second, the number of independently selectable files is extremely low. Both of these will contribute to a significantly reduced number of unique combinations.
Given this huge number, it is tempting to go the route of letting the browser fetch dependencies as the tree is resolved. This solution works for small code repositories, as the additional serial requests may be relatively insignificant. As previously mentioned, however, a typical payload on the website contains 30 to 50 different modules because of the scale of A/B testing. Even if the browser's parallel resource fetching could be leveraged for maximum efficiency, the latency accumulated across a potential 30-plus requests is significant enough to create a suboptimal experience. In Figure 4, even with a significantly simplified example with a depth of only five nodes, the page will make four asynchronous requests before the page is ready. A real production page may easily have 15-plus depth.
Just-in-Time Dependency Resolution
Let's add another column to the search-box test definition (see Table 2). This table now represents a complete abstraction of all data needed to build the payload. In practice, the final column mapping exists only in the UI layer, not in the core service that provides the A/B test definition. Often, it is up to the consumers of the test definitions to build this mapping since it is most likely unique for each device or platform. For the purposes of this article, however, it is easier to visualize the data in a single place.
init() method. Modules with complex public APIs tend to be shared common libraries, which are less likely to be A/B tested in this manner.
It is also worth noting the number of differences between each of these A/B experiences can often drive whether or not doing a drop-in replacement is even possible. In some cases where the new experiences are designed to be intentionally and maybe even radically different, it can make sense to have differences in the public API. This almost certainly increases complexity in the consuming parent modules, but that is the accepted cost of running radically different experiences concurrently. Other strategies can help mitigate the complexity, such as returning module stubs (see Figure 7), rather than attempting a true drop-in replacement. In this scenario, the module loader can be configured to return an empty object with a stub flag, indicating it is not a true implementation. This strategy can be useful if the A/B experiences in question share almost nothing in common, and would benefit very little, if at all, from a common public API.
Continuing with the example of the homepage payload, when a request comes in asking for the homepage payload (see Figure 8), we already know all the possible files the subscriber may receive, as a result of static analysis.
As we begin appending files to the payload, we can look up in the search-box test table (Table 2) whether or not this file is backed by an eligibility requirement (that is, whether the subscriber is eligible for that feature). This resolution will return a Boolean value, which is used to determine if the file gets appended (Figure 9).
For performance reasons, it is never desirable to deliver the entire payload via an inline script. Inline scripts cannot be cached independently from the HTML content, so the benefits of browser-side caching are lost immediately. It is much more desirable to deliver it via a script tag that points to an URL representing this payload, which a browser can easily cache. In most cases, this is a CDN (content delivery network)-hosted URL whose origin server points back to the original server that generated this payload. Thus, everything discussed up to this point is merely responsible for generating the uniqueness of the payload.
It is not sufficient, however, simply to cache the unique payload with a randomly generated identifier. If the server has multiple instances running for load balancing, any one of those instances could receive the incoming request for this payload. If the request goes to an instance that has not yet generated (or cached) that unique payload, it cannot resolve the request. To solve this issue, it is critically important the payload's URL is reverse resolvable; any instance of your server must be able to resolve the files in a unique payload by simply looking at the URL. This can be solved in a few ways, most often by representing a file by referencing the file name directly in the URL or by using a combination of unique hashes, where each chunk of the hash can be resolved to a specific file.
Though we have optimized for a single payload, there is potential to use parallel browser requests for additional performance gains. We want to avoid unbundling the entire payload, which forces us to take the route of making 30-plus requests, but we could split our single payload into two, with the first containing all common third-party libraries or shared modules, and the second bundle containing page-specific modules. This would allow the browser to cache common modules from page to page, further decreasing the upper limit of time to page ready as the user moves through the site. This strikes a nice balance between the bandwidth and latency constraints that Web browsers must typically deal with.
Reveling in Constraints
Multitier Programming in Hop
Manuel Serrano and Gérard Berry
The Antifragile Organization
1. Grigorik, I. Latency: The new Web performance bottleneck; https://www.igvita.com/2012/07/19/latency-the-new-web-performance-bottleneck/.
2. HTTP Archive. Trends; http://httparchive.org/trends.php?s=All&minlabel=Nov+15+2010&maxlabel=Jun+15+2014.
3. Nielsen, J. Response times: The three important limits (1993 updated 2014); http://www.nngroup.com/articles/response-times-3-important-limits/.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2014 ACM, Inc.