Online human-behavior experimentation is pervasive, manifold, and unavoidable. Leading digital companies routinely conduct over 1,000 A/B tests every month with millions of users. Online labor markets boast hundreds of thousands of workers to hire for crowdsourcing tasks, including experimentation and beta-testing. Outside industry, academic researchers utilize online labor markets to run behavioral experiments that span from cooperation games to protein folding tasks.
Hidden behind a deceiving façade of simplicity, implementing a human-behavior experiment for unbiased statistical inference is a task not to be taken lightly. It requires knowledge of computer programming, statistical inference, experimental design, and even behavioral insights. This unique mix of skills is generally honed with practice, heart-breaking mistakes, and "code smells." Popularized by Martin Fowler's book, Refactoring: Improving the Design of Existing Code, code smells indicate certain code structures that violate fundamental design principles and increase the risk of unintended software behavior. Code smells that lead to failures of randomization—the process of assigning observation units (users, devices, and so on) to treatments—are a threat to the validity of experiments. For instance, a probability incorrectly set may bar users to enter a particular treatment, or a degraded user experience in one treatment might lead to a higher attrition rate (that is, dropouts).
Detecting the source of a smell is not always trivial because experiments interact with multiple components, including external systems. The presence of several points of failure and the lack of a mathematical formalism to validate experiments in the context of their programming language have made human expert review the gold standard for assessing their correctness. But even experts are fallible. Therefore, two complementary practices are common in "smell-hunting": simulations and pilots. Both are useful and both have drawbacks. Simulations involve an array of bots randomly clicking their way through the experiment. They can catch internal bugs, but they cannot detect faulty interactions with external systems or failures in randomization due to idiosyncratic population characteristics or differential attrition rates. These issues are addressed by pilots, scaled-down versions of the experiment with real users. However, pilots require additional time and money and may frustrate participants if the user experience is poor. Moreover, a failed pilot may tarnish an experimenter's reputation in crowd-sourcing markets. Finally, in some cases, pilots are not possible at all, for example, in one-shot field experiments.
For all these reasons, I welcome the PlanAlyzer software as detailed in the following paper by Tosch et al. PlanAlyzer is a linter for PlanOut, a framework for online experiments popular in corporate settings, in particular Facebook, where it was originally developed. In addition to flagging code smells, PlanAlyzer also reports in a human-readable fashion which hypotheses a PlanOut script can and cannot test statistically. How does PlanAlyzer achieve this goal? It takes a PlanOut script as input and translates it into an internal representation that assigns special labels to variables in the code. It then builds a data dependence graph, based on which it establishes reliable causal paths between those specially labeled variables (that is, the contrasts). To account for missing information and interactions with external systems, manually annotated labels may be integrated.
PlanAlyzer is the first tool to statistically check the validity of online experiments.
PlanAlyzer was validated against a corpus of actual PlanOut scripts created and deployed at Facebook. The results are very encouraging: PlanAlyzer replicated 82% of all contrasts manually annotated by domain experts and achieved a precision and recall of 92% each in detecting code smells in a synthetically mutated dataset. Moreover, the authors unveiled a collection of common bad coding practices, including ambiguous type comparisons, modulus operators applied to fractions, and the use of PlanOut scripts for application configuration only. Future work in this area might focus on automatically correcting errors in the code, generating statistical code to analyze the output of an experiment (another potential source of smell), or introducing reasoning about hypotheses (for example, whether non-proportional sampling of observation units is valid).
PlanAlyzer is the first tool to statically check the validity of online experiments. It is cheaper, faster, and possibly safer than deploying bots or running a pilot. In sum, it is a major milestone. Together with recent advances in AI-driven methods for choosing optimal values of experimental parameters, adaptively ordering survey questions, and imputing missing responses, it shows how computer-assisted methods for the design, validation, and analysis of experiments are gaining a foothold. As this pattern will continue to grow in the future, we should expect two things: consolidation in the extremely fragmented landscape of tools for online experimentation, and the establishment of a validated set of coding standards. Both outcomes will boost the replicability of experimental results, paving the way for faster progress in the study of online human-behavior in industry and academia.
To view the accompanying paper, visit doi.acm.org/10.1145/3474385
The Digital Library is published by the Association for Computing Machinery. Copyright © 2021 ACM, Inc.