Machine learning (ML), systems, especially deep neural networks, can find subtle patterns in large datasets that give them powerful capabilities in image classification, speech recognition, natural-language processing, and other tasks. Despite this power—or rather because of it—these systems can be led astray by hidden regularities in the datasets used to train them.
Issues occur when the training data contains systematic flaws due to the origin of the data or the biases of those preparing it. Another hazard is "over-fitting," in which a model predicts the limited training data well, but errs when presented with new data, either similar test data or the less-controlled examples encountered in the real world. This discrepancy resembles the well-known statistical issue in which clinical trial data has high "internal validity" on carefully selected subjects, but may have lower "external validity" for real patients.
Because any good ML system will find the same regularities, redesigning it may not solve the problem. Therefore, researchers and companies are looking for ways to analyze and improve the underlying data, including supplying additional "synthetic" data for training.
Errors and Biases
Distortions embedded in artificial intelligence systems have profound consequences for people applying for a loan or seeking medical treatment. To promote greater accuracy, as well as confidence, a growing community is demanding greater fairness, accountability, and transparency (FAT) in artificial intelligence, and holds a regular ACM conference (the ACM Conference on Fairness, Accountability and Transparency (ACM FAccT).
Many experts, however, "tend to focus innovation on models, and they forget that in some ways models are just a mirror of the data," said Aleksander Madry, a professor of computer science at the Massachusetts Institute of Technology (MIT) and director of the MIT Center for Deployable Machine Learning. "You really need to intervene on data to make sure that your model has a chance of learning the right concepts."
"Seemingly innocent things can influence how bad in bias the models are," Madry said. ImageNet, for example, a huge set of labeled images that is widely used for training, was drawn from the photo-sharing site Flickr, and its examples strongly suggest the natural habitat for a crab is on a dinner plate. More seriously, medical images showing tuberculosis frequently come from less-developed countries whose older imagers have digital signatures that systems learn to associate with the disease.
Many training sets rely on online workers from Amazon's Mechanical Turk program to label the data, which poses its own reliability and bias issues.
"We are just scratching the surface" in assessing how prevalent these unintentional errors are, Madry said, but there are "definitely more than we [have a right to] expect." His group also has explored how arcane details of labeling protocols can lead to surprising classifications, which machine learning tools then must reverse-engineer, along with their other goals.
These labeling problems can also reflect—and seemingly validate—the social biases of the human annotators. For example, a woman in a lab coat may be more frequently labeled as a "nurse" than as a "doctor" or "chemist." Many training sets rely on online workers from Amazon's Mechanical Turk program to label the data, which poses its own reliability and bias problems. "How do you annotate this data in a way that you are not leaking some biases just by the choice of the labels?" Madry asked. "All of this is very much an open question at this point and something that needs to be urgently tackled."
AI systems can learn sexism and racism simply by "soaking up data from the world," said Margaret Mitchell, who was a leader of Google's Ethical AI research group until her acrimonious departure in February. In addition to being unfair, systems embodying these biases can fail at their primary goal, instead wasting resources by inaccurately ranking candidates for a loan or a job.
In addition, the most available and widely used datasets may include systematic or random errors. "Dataset creation has been very chaotic and not really well formed," Mitchell said, so it "incorporates all kinds of all things that we don't want, garbage and bias, and there isn't a way to trace back problematic sources."
Mitchell and her former Google colleagues advocate more systematic documentation at each stage of dataset assembly. This effort parallels growing mandates in other scientific fields that authors deposit their code and data in a public repository. This "open science" model can improve accuracy by letting others check for reproducibility, but it is a hard sell for companies that view their data as a competitive advantage.
Computer science assistant professor Olga Russakovsky and her group at Princeton University have built a tool to help reveal biases in existing large-scale image datasets. For example, the tool can analyze the distribution of training pictures with various attributes, including "protected" attributes like gender that users may want to avoid using in models.
Designers can use this information to curate the data or otherwise compensate for biases. Although these issues are particularly important for assessing fairness, human choices have always affected performance, Russakovsky stressed. "There's very much a person component in any part of building an AI system."
One approach to biased data is to include repeated copies of under-represented examples, but Russakovsky said this type of "oversampling" is not very effective. A better approach, she said, is "not just sampling from the same distribution as your training data comes from, but manipulating that distribution." One way to do this is by augmenting the training data with synthetic data to compensate for underrepresented attributes.
As an uncontroversial example, Russakovsky describes training systems to recognize people with sunglasses or hats in images in which the two features usually appear together. To help the system distinguish the features, designers can add synthetic images of faces with only sunglasses or only hats. Similarly, researchers can use a three-dimensional model to generate training images viewed from various angles.
Mitchell agreed that synthetic data can be "somewhat useful" to augment data, for example with "long-tailed" datasets that have few examples of extreme attributes. The technique is easily implemented in text processing by swapping in synonyms, she said, but "On the image side, synthetic data are not quite there yet."
For assembling large datasets, however, Mitchell noted that "It doesn't make sense to have that be synthetic data because it's going to be too biased, too templatic, or not have sort of the real-world variation that you want to have." Similarly, Madry worries that "Using synthetic data as a cure for biases is a chicken-and-egg problem. The whole premise of machine learning is to infer a model of the world from data," he notes. "If you know your model of the world, why do you do machine learning to begin with?"
"The whole premise of machine learning is to infer a model of the world from data. If you know your model of the world, why do you do machine learning to begin with?"
Synthetic data also plays a central role in one of the hottest areas of machine learning: generative adversarial networks (GANs). These systems pit neural networks against one another, one generating data and the other responding to it. "GANs escape, a little bit, the duality" associated with synthetic data, since the generating network eventually uses principles that were not built into it, Madry said. Indeed, Madry's MIT colleague Antonio Torralba has explored using GANs to improve both the fairness and interpretability of AI systems.
In spite of such efforts to curate it, "At the end of the day, your data is going to have biases," Russakovsky said, for which algorithms may need to compensate. "The tension there is that machine learning models are really good at learning from data. As soon as you start adding additional constraints, you're overriding what the model wants to do."
"This is an inherent problem with neural networks," Madry agreed. "If you give them a specific task: maximize my accuracy on this particular data set, they figure out the features that do that. The problem is we don't understand what features they use." Moreover, what works in one setting may fail in another, Madry said. "The kind of signals and features that ImageNet makes you develop"—for example, to distinguish friends on social networks—"will be of limited usefulness in the context of medical AI."
In the long run, developers need to consider how to present data to help systems organize information in the best way, said Patrick Shafto, a professor of mathematics and computer Science at Rutgers University-Newark. "You don't sample information randomly. You pick it to try to help them understand."
In his work, Shafto draws on notions of cooperation that are well-known in the study of language and education. For example, the "teacher" might first select data that establishes a general principle, with more subtle examples later on. Other established pedagogical techniques, such as posing questions, might also encourage better generalization by AI systems as they do for human students. "We don't want our learning to be capped at what the person teaching us has learned," he said. "In the ideal world, it would go beyond that."
Current machine learning, tuned to minimize errors on training data, is reminiscent of the much-maligned "teaching to the test," which is "not a good objective," Shafto said. "We need new objectives to conceptualize what machine learning can and should be, going forward."
ACM Conference on Fairness, Accountability, and Transparency (ACM FAccT), https://facctconference.org/index.html
Gradient Science, a blog from Aleksander Madry's lab, https://gradientscience.org/
Hutchinson, B., Smart, A., et al,
Towards Accountability for Machine Learning Datasets: Practices from Software Engineering and Infrastructure, FAccT '21, 560 (2021), https://dl.acm.org/doi/abs/10.1145/3442188.3445918
Wang, A., Narayanan, A., and Olga Russakovsky, O.,
REVISE: A Tool for Measuring and Mitigating Bias in Visual Datasets (2020), https://arxiv.org/abs/2004.07999
©2021 ACM 0001-0782/21/12
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and full citation on the first page. Copyright for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request permission to publish from [email protected] or fax (212) 869-0481.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2021 ACM, Inc.