The recent emergence of artificial intelligence (AI)-powered media manipulations has widespread societal implications for journalism and democracy,7 national security,1 and art.8,14 AI models have the potential to scale misinformation to unprecedented levels by creating various forms of synthetic media.21 For example, AI systems can synthesize realistic video portraits of an individual with full control of facial expressions, including eye and lip movement;11,18,34,35,36 clone a speaker's voice with a few training samples and generate new natural-sounding audio of something the speaker never said;2 synthesize visually indicated sound effects;28 generate high-quality, relevant text based on an initial prompt;31 produce photorealistic images of a variety of objects from text inputs;5,17,27 and generate photorealistic videos of people expressing emotions from only a single image.3,40 The technologies for producing machine-generated, fake media online may outpace the ability to manually detect and respond to such media.
We developed a neural network architecture that combines instance segmentation with image inpainting to automatically remove people and other objects from images.13,39 Figure 1 presents four examples of participant-submitted images and their transformations. The AI, which we call a "target object removal architecture," detects an object, removes it, and replaces its pixels with pixels that approximate what the background should look like without the object. This architecture operationalizes one of the oldest forms of media manipulation, known in Latin as damnatio memoriae, which means erasing someone from official accounts.
The earliest known instances of damnatio memoriae were discovered in ancient Egyptian artifacts, and similar patterns of removal have appeared since.10,37 Historically, visual and audio manipulations required both skilled experts and a significant investment of time and resources. Our architecture can produce photo- realistic manipulations nearly instantaneously, which magnifies the potential scale of misinformation. This new capacity for scalable manipulation raises the question of how prepared people are to detect manipulated media.
To publicly expose the realism of AI-media manipulations, we hosted a website called Deep Angel, where anyone in the world could examine our neural-network architecture and its resulting manipulations. Between August 2018 and May 2019, 110,000 people visited the website. We integrated a randomized experiment based on a two-alternative, forced-choice design within the Deep Angel website to examine how repeated exposure to machine-manipulated images affects an individual's ability to accurately identify manipulated imagery.
Two-Alternative, Forced-Choice Randomized Experiment
On the Deep Angel website's "Detect Fakes" page, participants are presented with two images consistent with standard two-alternative, forced-choice methodology and are asked a single question: "Which image has something removed by Deep Angel?" The pair of images contains one image manipulated by AI and one unaltered image. After the participant selects an image, the website reveals the manipulation and asks the participant to try again. The MIT Committee on the Use of Humans as Experimental Subjects (COUHES) approved IRB 1807431100 for this study on July 26, 2018.
The manipulated images are drawn from a population of 440 images submitted by participants to be shared publicly. The population of unaltered images contains 5,008 images from the MS-COCO dataset.23 Images are randomly selected with replacements from each population of images. By randomizing the order of images that participants see, this experiment can causally evaluate the effect of image order on participants' ability to recognize fake media. We test the causal effects with the following linear probability models:
where yi,j is the binary accuracy (correct or incorrect guess) of participant j on manipulated image i. X is a matrix of covariates indexed by i and j, Tin represents the order n in which manipulated image i appears to participant j, μi represents the manipulated image-fixed effects, Vj represents the participant-fixed effects, and €i,j represents the error term. The first model fits a logarithmic transformation of Tin to yi,j. The second model estimates treatment effects separately for each image position. Both models use Huber-White (robust) standard errors, and errors are clustered at the image level.
Participation and average accuracy.
From August 2018 to May 2019, 242,216 guesses were submitted from 16,542 unique IP addresses with a mean identification accuracy of 86%. The website did not require participant sign-in, so we study participant behavior under the assumption that each IP address represents a unique individual. The majority of participants participated in the two-alternative, forced-choice experiment multiple times, and 7,576 participants submitted at least 10 guesses.
Each image appears as the first image an average of 35 times and the tenth image an average of 15 times. The majority of manipulated images were identified correctly more than 90% of the time. In the sample of participants who saw at least 10 images, the mean percentage correct classification is 78% on the first image seen and 88% on the tenth image seen. Figure 2a shows the distribution of identification accuracy across images, while Figure 2b shows the distribution of how many images each participant saw. The interquartile range of the number of guesses per participant is from three to 18 with a median of eight.
Figure 3a plots participant accuracy on the y-axis and image order on the x-axis, revealing a logarithmic relationship between accuracy and exposure to manipulated images. In this plot showing scores for all participants, accuracy increases rapidly over the first 10 images and plateaus around 88%.
Learning rate. With 242,216 observations, we run an ordinary least-squares regression with participant-and image-fixed effects on the likelihood of correctly guessing the manipulated image. The results of these regressions are presented in Tables 1 and 2 in the online appendix (https://dl.acm.org/doi/10.1145/3445972). Each column in Tables 1 and 2 adds an incremental filter to offer a series of robustness checks. The first column shows all observations. The second column drops all participants who submitted fewer than 10 guesses and removes all control images where nothing was removed. The third column drops all observations where a participant has already seen an image. The fourth column drops all images qualitatively judged as below very high quality.
Across all four robustness checks with and without fixed-effects, our models show a positive and statistically significant relationship between Tn and i,j. In the linear-log model, a one-unit increase in log(Tin) is associated with a 3% increase in i,j. This effect is significant at the p<.01 level. In the model that estimates Equation 2, we find a 1% average marginal treatment effect size of image position on i,j. This effect is also significant at the p<.01 level. In other words, participants improve their ability to guess by 1% for each of the first 10 guesses. Figure 3b shows these results graphically.
Within the context of object removal manipulations, exposure to media manipulation and feedback on what has been manipulated improves a participant's ability to recognize faked media. After getting feedback on 10 pairs of images for an average of 1 min., 14 sec., a participant's ability to detect manipulations improves by 10%. With clear evidence that human detection of machine-manipulated media can improve, the next question is: what is the mechanism that drives participant learning rates? How do feedback, image characteristics, and participant qualities affect learning rates?
Potential Explanatory Mechanisms
We can explore what drives the learning rate by examining heterogeneous effects of image characteristics and participant qualities. Figure 4 presents 10 plots of heterogeneous learning rates based on image-fixed effects regressions with errors clustered at the participant level.
We evaluate the quality of a manipulation across five measures: (a) a subjective quality rating, (b) 1st and 4th quartile image entropy, (c) 1st and 4th quartile proportion of area of the manipulated image, (d) 1st and 4th quartile mean identification accuracy per image, and (e) number of objects disappeared. The subjective quality rating is based on ratings provided by an outside party and is a binary rating based on whether obvious artifacts were created by the image manipulation.
Image entropy is measured based on delentropy, an extension of Shannon entropy for images.20 To help understand delentropy, Figure 4 presents three pairs of images subjectively rated as high quality. Their corresponding entropy scores are included, along with the proportion of the image transformed, mean accuracy of participants' first guesses, and mean accuracy of subsequent participant guesses to exemplify what study participants learned.
For images subjectively marked as high quality, participants correctly discern 75% for the first image and 83% for the tenth image seen. In contrast, participant accuracy on the low-quality images is higher, at 82% and 94% for the first and tenth image seen, respectively. Table 3 (see online appendix) shows that the difference in means across the subjective quality measure is statistically significant at the 99% confidence level (p <.01), but we do not find a statistically significant difference in learning rates.
As seen in Figure 4a, there is evidence that participants learn to identify low-quality images faster than high-quality images if only looking at the first five images seen. When examining the first 10 images seen, we do not find a statistically significant difference in the interaction between subjective quality and the logarithm of the image position. These results indicate that the main effect is not simply driven by participants becoming proficient at guessing low-quality images in our data.
The other proxies for image quality provide insight into how subtleties play a role in discerning image manipulations. Participants learn to identify low-entropy images faster than high-entropy images, and they recognize images with a large masked area faster than images with a small masked area. Table 3 shows that this difference in learning rates is statistically significant at the 95% (p <.05) and 90% (p <.10) levels, respectively.
Smaller masked areas and lower entropy is associated with less stark and more subtle changes between original and manipulated images. This relationship may indicate that participants learn more from subtle changes than more obvious manipulations. It may even mean people are learning to detect which kinds of images are hard to discern and, therefore, potentially likely to contain a manipulation when no obvious manipulation is apparent. It is important to note that neither the split between the 1st and 4th quartiles of mean accuracy per image nor the split between one object and many disappeared objects has a statistically significant effect on the learning rates. This means we find no association between overall manipulation discernment difficulty and learning rates.
A participant's initial performance is indicative of his or her future performance. In Figure 4, we compare subsequent learning rates of participants who correctly identified a manipulation on their first attempt to participants who failed on their first attempt and succeeded on their second. In this comparison, the omitted position for each learning curve represents perfect accuracy, which makes the marginal effects of subsequent image positions negative relative to these omitted image positions. On the first three of four image positions in this comparison, which correspond to the third through sixth image positions, we find that initially successful participants learn faster than participants who were initially unsuccessful. This heterogeneous effect does not persist in the seventh position or beyond. Overall, this heterogeneous effect is statistically significant at the 99% level (p <.01), suggesting that people who are better at discerning manipulations are also faster at learning to discern manipulations.
This new capacity for scalable manipulation raises the question of how prepared people are to detect manipulated media.
We find participants learn to discern manipulations involving disappeared people faster than images with any other object removed. This difference is statistically significant at the 95% confidence interval (p <.05) in the log-linear regression as shown in Table 3. Figure 4 also shows this difference as statistically significant in two of the 10 image positions, suggesting that participants may be learning to detect the kinds of images that are conducive to plausible object removals.
There is a clear difference in the learning rate of participants based on whether they participated with mobile phones or computers. Participants on mobile phones learn at a consistently faster rate than participants on computers, and this difference is statistically significant as shown in Table 3 and displayed across nine of 10 image positions in Figure 4. It is possible that the seamlessness of the zoom feature on a phone relative to a computer enables mobile participants to inspect each image more closely. We do not find evidence that image placement on the website correlates with overall accuracy.
No strong evidence suggests that the speed with which a participant rated 11 images is related to the learning rate, but we do find evidence of an interaction between answering speed upon wrong guesses of high-quality images. In Table 4 (see online appendix), we present a regression of current and lagged features on participant accuracy. It is important to note that we find high-quality images reduce participant accuracy by 4%, which is significant at the 99% confidence interval (p <0.1), but we do not find a relationship between whether the previous image was high quality and participant accuracy on the current image. However, the interaction of seconds, guessing the previous answer incorrectly, and the previous image being high quality, is associated with a 0.3% increase in participant accuracy for every marginal second (p <.05). This correlational evidence suggests that when participants slow down after guessing incorrectly on high-quality, harder-to-guess images, they perform better.
While AI models can improve clinical diagnoses9,19,30 and enable autonomous driving,6 they also have the potential to scale censorship,32 amplify polarization,4 and spread fake news and manipulated media.38 We present results from a large-scale, randomized experiment showing that the combination of exposure to manipulated media and feedback on which media has been manipulated improves an individual's ability to detect media manipulations.
Direct interaction with cutting-edge technologies for content creation might enable more discerning media consumption across society. In practice, the news media has exposed high-profile, Al-manipulated media, including fake videos of the Speaker of the House of Representatives Nancy Pelosi and Facebook CEO Mark Zuckerberg, which serves as feedback to everyone on what manipulations look like.24,25 Our results build on recent research showing that people can detect low-quality news,29 human intuition can be a reliable source of information about adversarial perturbations to images,42 and familiarizing people with how fake news is produced may confer them with cognitive immunity when they are later exposed to misinformation.33 Our results also offer suggestive evidence for what drives learning to detect fake content. In this experiment, presenting participants with low-entropy images with minor manipulations on mobile devices increased learning rates at statistically significant levels. Participants appear to learn best from the most subtle manipulations.
Our results focus on a bespoke, custom-designed, neural-network architecture in a controlled, two-alternative, forced-choice experimental setting. The external validity of our findings should be further explored in different domains, using different generative models, and in settings where people are not instructed explicitly to look out for fakes, but rather encounter them in a more naturalistic social media feed, and in the context of reduced attention span. Likewise, future research in human perception of manipulated media should explore to what degree an individual's ability to adaptively detect manipulated media comes from learning by doing, direct feedback, and awareness that anything is manipulated at all.
With clear evidence that human detection of machine-manipulated media can improve, what is the mechanism that drives participants' learning rates?
Our results suggest a need to re-examine the precautionary principle that is commonly applied to content-generation technologies. In 2018, Google published BigGAN, which can generate realistic-appearing objects in images, but while the company hosted the generator for anyone to explore, it explicitly withheld the discriminator for its model.5 Similarly, OpenAI restricted access to its GPT-2 model, which can generate plausible long-form stories given an initial text prompt, by only providing a pared-down model of GPT-2 trained with fewer parameters.31 If exposure to manipulated content can prepare people to detect future manipulations, then censoring dissemination of AI research on content generation may prove harmful to society by leaving it unprepared for a future of ubiquitous AI-mediated content.
We developed a Target Object Removal architecture, combining instance segmentation with image inpainting to remove objects in images and replace those objects with a plausible background. Technically, we combine a convolutional neural network (CNN) trained to detect objects with a generative adversarial network (GAN) trained to inpaint missing pixels in an image.12,13,16,22 Specifically, we generate object masks with a CNN based on a RoIAlign bilinear interpolation on nearby points in the feature map.13 We crop the object masks from the image and apply a generative inpainting architecture to fill in the object masks.15,39 The generative inpainting architecture is based on dilated CNNs with an adversarial loss function, allowing the generative inpainting architecture to learn semantic information from large-scale datasets and generate missing content that makes contextual sense in the masked portion of the image.39
Target Object Removal Pipeline
Our end-to-end, targeted object removal pipeline consists of three interfacing neural networks:
- Object Mask Generator (G): This network creates a segmentation mask X' = G(X, y) given an input image X and a target class y. In our experiments, we initialize G from a semantic segmentation network trained on the 2014 MS-COCO dataset following the Mask-RCNN algorithm.13 The network generates masks for all object classes present in an image, and we select only the correct masks based on input y. This network was trained on 60 object classes.
- Generative Inpainter (I): This network creates an inpainted version Z = I(X', X) of the input image X and the object mask X'. I is initialized following the DeepFill algorithm trained on the MIT Places 2 dataset.39,41
- Local Discriminator (D): The final discriminator network takes in the in-painted image and determines its validity. Following the training of a GAN discriminator, D is trained simultaneously on I, where X are images from the MIT Places 2 dataset and X' are the same images with randomly assigned holes following.39,41
The Deep Angel website enabled us to make the Target Object Removal architecture publicly available.a We hosted the architecture API with a single Nvidia Geforce GTX Titan X; anyone could upload an image to the site and select an object to be removed from the image.
Participants uploaded 18,152 unique images from mobile phones and computers; they also directed the crawling of 12,580 unique images from Instagram. We can surface the most plausible object removal manipulations by examining the images with the lowest guessing accuracy. The Target Object Removal architecture can produce plausible content, but the plausibility is largely image dependent and constrained to specific domains, where objects are a small portion of the image, and the background is natural and uncluttered by other objects.
Data availability: The data and replication code are available at: https://github.com/mattgroh/human-detection-machine-manipulated-media-data-code.
Acknowledgments. We thank Abhimanyu Dubey, Mohit Tiwari, and David McKenzie for their helpful comments and feedback.
Figure. Watch the authors discuss this work in the exclusive Communications video. https://cacm.acm.org/videos/machine-manipulated-media
6. Chen, C., Seff, A., Kornhauser, A., and Xiao, J. Deepdriving: Learning affordance for direct perception in autonomous driving. In Proc. of the IEEE Intern. Conf. on Computer Vision (2015), 2722–2730.
11. Garrido, P., Valgaerts, L., Sarmadi, H., Steiner, I., Varanasi, K., Perez, P., and Theobalt, C. Vdub: Modifying face video of actors for plausible visual alignment to a dubbed audio track. Computer Graphics Forum 34. Wiley Online Library (2015), 193–204.
19. Kooi, T., Litjens, G., Van Ginneken, B., Gubern-Mérida, A., Sánchez, C.I., Mann, R., den Heeten, A., and Karssemeijer, N. Large scale deep learning for computer aided detection of mammographic lesions. Medical Image Analysis 35 (2017), 303–312.
21. Lazer, D.M., Baum, M.A., Benkler, Y., Berinsky, A.J., Greenhill, K.M., Menczer, F., Metzger, M.J., Nyhan, B., Pennycook, G., Rothschild, D., et al. The science of fake news. Science 359, 6380 (2018), 1094–1096.
23. Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., and Zitnick, C.L. Microsoft COCO: Common objects in context. In European Conference on Computer Vision, Springer. (2014), 740–755.
24. Mervosh, S. Distorted videos of Nancy Pelosi spread on Facebook and Twitter, helped by Trump. (May 2019). https://www.nytimes.com/2019/05/24/us/politics/pelosi-doctored-video.html.
25. Metz, C. A fake Zuckerberg video challenges Facebook's rules. (June 2019). https://www.nytimes.com/2019/06/11/technology/fake-zuckerberg-video-facebook.html.
30. Poplin, R., Varadarajan, A.V., Blumer, K., Liu, Y., McConnell, M.V., Corrado, G.S., Peng, L., and Webster, D.R. Prediction of cardiovascular risk factors from retinal fundus photographs via deep learning. Nature Biomedical Engineering 2, 3 (2018), 158.
36. Thies, J., Zollhofer, M., Stamminger, M., Theobalt, C., and Nießner, M. Face2face: Real-time face capture and reenactment of RGB videos. In Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition. (2016), 2387–2395.
Author contributions. M.G. implemented the methods, M.G., Z.E., N.O. analyzed data and wrote the article. All authors conceived the original idea, designed the research, and provided critical feedback on the analysis and manuscript.
This work is licensed under a https://creativecommons.org/licenses/by/4.0/
The Digital Library is published by the Association for Computing Machinery. Copyright © 2021 ACM, Inc.