Artificial Intelligence and Machine Learning Research highlights

Face2Face: Real-Time Face Capture and Reenactment of RGB Videos

By Justus Thies, Michael Zollhöfer, Marc Stamminger, Christian Theobalt, and Matthias Nießner

Posted Jan 1 2019

Abstract
1. Introduction
2. Related Work
3. Use Cases
4. Method Overview
5. Synthesis of Facial Imagery
6. Energy Formulation
7. Data-Parallel Optimization
8. Non-Rigid Model-Based Bundling
9. Expression Transfer
10. Mouth Retrieval
11. Results
12. Limitations
13. Discussion
14. Conclusion
Acknowledgments
References
Authors
Footnotes

Face2Face is an approach for real-time facial reenactment of a monocular target video sequence (e.g., Youtube video). The source sequence is also a monocular video stream, captured live with a commodity webcam. Our goal is to animate the facial expressions of the target video by a source actor and re-render the manipulated output video in a photo-realistic fashion. To this end, we first address the under-constrained problem of facial identity recovery from monocular video by non-rigid model-based bundling. At run time, we track facial expressions of both source and target video using a dense photometric consistency measure. Reenactment is then achieved by fast and efficient deformation transfer between source and target. The mouth interior that best matches the re-targeted expression is retrieved from the target sequence and warped to produce an accurate fit. Finally, we convincingly re-render the synthesized target face on top of the corresponding video stream such that it seamlessly blends with the real-world illumination. We demonstrate our method in a live setup, where Youtube videos are reenacted in real time. This live setup has also been shown at SIGGRAPH Emerging Technologies 2016, by Thies et al.²⁰ where it won the Best in Show Award.

1. Introduction

In recent years, real-time markerless facial performance capture based on commodity sensors has been demonstrated. Impressive results have been achieved, both based on Red-Green-Blue (RGB) as well as RGB-D data. These techniques have become increasingly popular for the animation of virtual Computer Graphics (CG) avatars in video games and movies. It is now feasible to run these face capture and tracking algorithms from home, which is the foundation for many Virtual Reality (VR) and Augmented Reality (AR) applications, such as teleconferencing.

In this paper, we employ a new dense markerless facial performance capture method based on monocular RGB data, similar to state-of-the-art methods. However, instead of transferring facial expressions to virtual CG characters, our main contribution is monocular facial reenactment in real-time. In contrast to previous reenactment approaches that run offline, our goal is the online transfer of facial expressions of a source actor captured by an RGB sensor to a target actor. The target sequence can be any monocular video; for example, legacy video footage downloaded from Youtube with a facial performance. We aim to modify the target video in a photo-realistic fashion, such that it is virtually impossible to notice the manipulations. Faithful photo-realistic facial reenactment is the foundation for a variety of applications; for instance, in video conferencing, the video feed can be adapted to match the face motion of a translator, or face videos can be convincingly dubbed to a foreign language.

In our method, we first reconstruct the shape identity of the target actor using a new global non-rigid modelbased bundling approach based on a prerecorded training sequence. As this preprocess is performed globally on a set of training frames, we can resolve geometric ambiguities common to monocular reconstruction. At runtime, we track both the expressions of the source and target actor’s video by a dense analysis-by-synthesis approach based on a statistical facial prior. We demonstrate that our RGB tracking accuracy is on par with the state of the art, even with online tracking methods relying on depth data. In order to transfer expressions from the source to the target actor in real-time, we propose a novel transfer functions that efficiently applies deformation transfer¹⁸ directly in the used low-dimensional expression space. For final image synthesis, we re-render the target’s face with transferred expression coefficients and composite it with the target video’s background under consideration of the estimated environment lighting. Finally, we introduce a new image-based mouth synthesis approach that generates a realistic mouth interior by retrieving and warping best matching mouth shapes from the offline sample sequence. It is important to note that we maintain the appearance of the target mouth shape; in contrast, existing methods either copy the source mouth region onto the target²³ or a generic teeth proxy is rendered,^{8, 19} both of which leads to inconsistent results. Figure 2 shows an overview of our method.

Figure 1. Proposed online reenactment setup: A monocular target video sequence (e.g., from Youtube) is reenacted based on the expressions of a source actor who is recorded live with a commodity webcam.

Figure 2. An overview of our reenactment approach: In a preprocessing step we analyze and reconstruct the face of the target actor. During live reenactment, we track the expression of the source actor and transfer them to the reconstructed target face. Finally, we composite a novel image of the target person using a mouth interior of the target sequence that best matches the new expression.

We demonstrate highly convincing transfer of facial expressions from a source to a target video in real time. We show results with a live setup where a source video stream, which is captured by a webcam, is used to manipulate a target Youtube video (see Figure 1). In addition, we compare against state-of-the-art reenactment methods, which we outperform both in terms of resulting video quality and runtime (we are the first real-time RGB reenactment method). In summary, our key contributions are:

dense, global non-rigid model-based bundling,
accurate tracking, appearance, and lighting estimation in unconstrained live RGB video,
person-dependent expression transfer using subspace deformations,
and a novel mouth synthesis approach.

2. Related Work

2.1. Offline RGB performance capture

Recent offline performance capture techniques approach the hard monocular reconstruction problem by fitting a blendshape or a multilinear face model to the input video sequence. Even geometric fine-scale surface detail is extracted via inverse shading-based surface refinement. Shi et al.¹⁶ achieve impressive results based on global energy optimization of a set of selected keyframes. Our model-based bundling formulation to recover actor identities is similar to their approach; however, we use robust and dense global photometric alignment, which we enforce with an efficient data-parallel optimization strategy on the Graphics Processing Unit (GPU).

2.2. Online RGB-D performance capture

Weise et al.²⁵ capture facial performances in real-time by fitting a parametric blendshape model to RGB-D data, but they require a professional, custom capture setup. The first real-time facial performance capture system based on a commodity depth sensor has been demonstrated by Weise et al.²⁴ Follow up work focused on corrective shapes,² dynamically adapting the blend-shape basis,¹¹ non-rigid mesh deformation.⁶ These works achieve impressive results, but rely on depth data which is typically unavailable in most video footage.

2.3. Online RGB performance capture

While many sparse real-time face trackers exist, for example, Saragih et al.,¹⁵ real-time dense monocular tracking is the basis of realistic online facial reenactment. Cao et al.⁵ propose a real-time regression-based approach to infer 3D positions of facial landmarks which constrain a user-specific blendshape model. Follow-up work⁴ also regresses fine-scale face wrinkles. These methods achieve impressive results, but are not directly applicable as a component in facial reenactment, since they do not facilitate dense, pixel-accurate tracking.

2.4. Offline reenactment

Vlasic et al.²³ perform facial reenactment by tracking a face template, which is re-rendered under different expression parameters on top of the target; the mouth interior is directly copied from the source video. Image-based offline mouth re-animation was shown in Bregler et al.³ Garrido et al.⁷ propose an automatic purely image-based approach to replace the entire face. These approaches merely enable self-reenactment; that is, when source and target are the same person; in contrast, we perform reenactment of a different target actor. Recent work presents virtual dubbing,⁸ a problem similar to ours; however, the method runs at slow offline rates and relies on a generic teeth proxy for the mouth interior. Li et al.¹² retrieve frames from a database based on a similarity metric. They use optical flow as appearance and velocity measure and search for the k-nearest neighbors based on time stamps and flow distance. Saragih et al.¹⁵ present a real-time avatar animation system from a single image. Their approach is based on sparse landmark tracking, and the mouth of the source is copied to the target using texture warping.

2.5. Online reenactment

Recently, first online facial reenactment approaches based on RGB-(D) data have been proposed. Kemelmacher-Shlizerman et al.¹⁰ enable image-based puppetry by querying similar images from a database. They employ an appearance cost metric and consider rotation angular distance. While they achieve impressive results, the retrieved stream of faces is not temporally coherent. Thies et al.¹⁹ show the first online reenactment system; however, they rely on depth data and use a generic teeth proxy for the mouth region. In this paper, we address both shortcomings: (1) our method is the first real-time RGB-only reenactment technique; (2) we synthesize the mouth regions exclusively from the target sequence (no need for a teeth proxy or direct source-to-target copy).

2.6. Follow-up work

The core component of the proposed approach is the dense face reconstruction algorithm. It has already been adapted for several applications, such as head mounted display removal,²² facial projection mapping,¹⁷ and avatar digitization.⁹ FaceVR²² demonstrates self-reenactment for head mounted display removal, which is particularly useful for enabling natural teleconferences in virtual reality. The FaceForge¹⁷ system enables real-time facial projection mapping to dynamically alter the appearance of a person in the real world. The avatar digitization approach of Hu et al.⁹ reconstructs a stylized 3D avatar that includes hair and teeth, from just a single image. The resulting 3D avatars can for example be used in computer games.

3. Use Cases

The proposed facial tracking and reenactment has several use-cases that we want to highlight in this section. In movie productions the idea of facial reenactment can be used as a video editing tool to change for example the expression of an actor in a particular shot. Using the estimated geometry of an actor, it can also be used to modify the appearance of a face in a post-process, for example, changing the illumination. Another field in post-production is the synchronization of an audio channel to the video. If a movie is translated to another language, the movements of the mouth do not match the audio of the so called dubber. Nowadays, to match the video, the audio including the spoken text is adapted, which might result in a loss of information. Using facial reenactment instead, the expressions of the dubber can be transferred to the actor in the movie and thus the audio and video is synchronized. Since our reenactment approach runs in real time, it is also possible to setup a teleconferencing system with a live interpreter that simultaneously translates the speech of a person to another language.

In contrast to state-of-the-art movie production setups that work with markers and complex camera setups, our system presented in this paper only requires commodity hardware without the need for markers. Our tracking results can also be used to animate virtual characters. These virtual characters can be part of animation movies, but can also be used in computer games. With the introduction of virtual reality glasses, also called Head Mounted Displays (HMDs), the realistic animation of such virtual avatars, becomes more and more important for an immersive game-play. FaceVR²² demonstrates that facial tracking is also possible if the face is almost completely occluded by such an HMD. The project also paves the way to new applications like teleconferencing in VR based on HMD removal.

Besides these consumer applications, you can also think of numerous medical applications. For example, one can build a training system that helps patients to train expressions after a stroke.

4. Method Overview

In the following, we describe our real-time facial reenactment pipeline (see Figure 2). Input to our method is a monocular target video sequence and a live video stream captured by a commodity webcam. First, we describe how we synthesize facial imagery using a statistical prior and an image formation model (see Section 5). We find optimal parameters that best explain the input observations by solving a variational energy minimization problem (see Section 6). We minimize this energy with a tailored, data-parallel GPU-based Iteratively Reweighted Least Squares (IRLS) solver (see Section 7). We employ IRLS for off-line non-rigid model-based bundling (see Section 8) on a set of selected keyframes to obtain the facial identity of the source as well as of the target actor. This step jointly recovers the facial identity, expression, skin reflectance, and illumination from monocular input data. At runtime, both source and target animations are reconstructed based on a model-to-frame tracking strategy with a similar energy formulation. For reenactment, we propose a fast and efficient deformation transfer approach that directly operates in the subspace spanned by the used statistical prior (see Section 9). The mouth interior that best matches the re-targeted expression is retrieved from the input target sequence (see Section 10) and is warped to produce an accurate fit. We demonstrate our complete pipeline in a live reenactment setup that enables the modification of arbitrary video footage and perform a comparison to state-of-the-art tracking as well as reenactment approaches (see Section 11). In Section 12, we show the limitations of our proposed method.

Since we are aware of the implications of a video editing tool like Face2Face, we included a section in this paper that discusses the potential misuse of the presented technology (see Section 13). Finally, we conclude with an outlook on future work (see Section 14).

5. Synthesis of Facial Imagery

The synthesis of facial imagery is based on a multi-linear face model (see the original Face2Face paper for more details). The first two dimensions represent facial identity—that is, geometric shape and skin reflectance—and the third dimension controls the facial expression. Hence, we parametrize a face as:

This prior assumes a multivariate normal probability distribution of shape and reflectance around the average shape a_id ∈ R³ⁿ and reflectance a_alb ∈ R³ⁿ. The shape E_id ∈ R^3n×80, reflectance E_alb ∈ R^3n×80, and expression E_exp ∈ R^3n×76 basis and the corresponding standard deviations σ_id ∈ R⁸⁰, σ_alb ∈ R⁸⁰, and σ_exp ∈ R⁷⁶ are given. The model has 53K vertices and 106K faces. A synthesized image C_S is generated through rasterization of the model under a rigid model transformation Φ(v) and the full perspective transformation Π(v). Illumination is approximated by the first three bands of Spherical Harmonics (SH)¹³ basis functions, assuming Labertian surfaces and smooth distant illumination, neglecting self-shadowing.

Synthesis is dependent on the face model parameters α, β, δ, the illumination parameters γ, the rigid transformation R, t, and the camera parameters k defining Π. The vector of unknowns P is the union of these parameters.

6. Energy Formulation

Given a monocular input sequence, we reconstruct all unknown parameters P jointly with a robust variational optimization. The proposed objective is highly non-linear in the unknowns and has the following components:

The data term measures the similarity between the synthesized imagery and the input data in terms of photo-consistency E_col and facial feature alignment E_lan. The likelihood of a given parameter vector P is taken into account by the statistical regularizer E_reg. The weights w_col, w_lan, and w_reg balance the three different sub-objectives. In all of our experiments, we set w_col = 1, w_lan = 10, and w_reg = 2.5 · 10^-5. In the following, we introduce the different sub-objectives.

Photo-Consistency. In order to quantify how well the input data is explained by a synthesized image, we measure the photometric alignment error on pixel level:

where C_S is the synthesized image, C_I is the input RGB image, and p ∈ V denote all visible pixel positions in C_S. We use the ℓ_2,1-norm instead of a least-squares formulation to be robust against outliers. In our scenario, distance in color space is based on ℓ₂ while in the summation over all pixels an ℓ₁-norm is used to enforce sparsity.

Feature Alignment. In addition, we enforce feature similarity between a set of salient facial feature point pairs detected in the RGB stream:

To this end, we employ a state-of-the-art facial landmark tracking algorithm by Saragih et al.¹⁴ Each feature point f_j ∈ F ⊂ R² comes with a detection confidence w_conf,j and corresponds to a unique vertex v_j = M_geo(α, δ) ∈ R³ of our face prior. This helps avoiding local minima in the highly complex energy landscape of E_col(P).

Statistical Regularization. We enforce plausibility of the synthesized faces based on the assumption of a normal distributed population. To this end, we enforce the parameters to stay statistically close to the mean:

This commonly used regularization strategy prevents degenerations of the facial geometry and reflectance, and guides the optimization strategy out of local minima.¹

7. Data-Parallel Optimization

The proposed robust tracking objective is a general unconstrained non-linear optimization problem. We use IRLS to minimize this objective in real-time using a novel data-parallel GPU-based solver. The key idea of IRLS is to transform the problem, in each iteration, to a non-linear least-squares problem by splitting the norm in two components:

Here, r(·) is a general residual and P_old is the solution computed in the last iteration. Thus, the first part is kept constant during one iteration and updated afterwards. Close in spirit to Thies et al.,¹⁹ each single iteration step is implemented using the Gauss-Newton approach. We take a single GN step in every IRLS iteration and solve the corresponding system of normal equations J^T Jδ* = –J^TF based on PCG (Preconditioned Conjugate Gradient) to obtain an optimal linear parameter update δ^*. The Jacobian J and the systems’ right hand side –J^TF are precomputed and stored in device memory for later processing as proposed by Thies et al.¹⁹ For more details we refer to the original paper.²¹ Note that our complete framework is implemented using DirectX for rendering and DirectCompute for optimization. The joint graphics and compute capability of DirectX11 enables us to execute the analysis-by-synthesis loop without any resource mapping overhead between these two stages. In the case of an analysis-by-synthesis approach, this is essential for runtime performance, since many rendering-to-compute switches are required. To compute the Jacobian J we developed a differential renderer that is based on the standard rasterizer of the graphics pipeline. To this end, during the synthesis stage, we additionally store the vertex and triangle attributes that are required for computing the partial derivatives to dedicated rendertargets. Using this information a compute shader calculates the final derivatives that are needed for the optimization.

8. Non-Rigid Model-Based Bundling

To estimate the identity of the actors in the heavily underconstrained scenario of monocular reconstruction, we introduce a non-rigid model-based bundling approach. Based on the proposed objective, we jointly estimate all parameters over k key-frames of the input video sequence. The estimated unknowns are the global identity {α, β} and intrinsics k as well as the unknown per-frame pose {δ^k, R^k, t^k}_k and illumination parameters {γ^k}_k. We use a similar data-parallel optimization strategy as proposed for model-to-frame tracking, but jointly solve the normal equations for the entire keyframe set. For our non-rigid model-based bundling problem, the non-zero structure of the corresponding Jacobian is block dense. Our PCG solver exploits the non-zero structure for increased performance (see original paper). Since all keyframes observe the same face identity under potentially varying illumination, expression, and viewing angle, we can robustly separate identity from all other problem dimensions. Note that we also solve for the intrinsic camera parameters of Π, thus being able to process uncalibrated video footage. The employed Gauss-Newton framework is embedded in a hierarchical solution strategy (see Figure 3). The underlying hierarchy enables faster convergence and avoids getting stuck in local minima of the optimized energy function. We start optimizing on a coarse level and lift the solution to the next finer level using the parametric face model. In our experiments we used three levels with 25, 5, and 1 Gauss-Newton iterations for the coarsest, the medium, and the finest level, respectively. In each Gauss-Newton iteration, we employ 4 PCG steps to efficiently solve the underlying normal equations. Our implementation is not restricted to the number k of used keyframes, but the processing time increases linearly with k. In our experiments we used k = 6 keyframes for the estimation of the identity parameters, which results in a processing time of only a few seconds (∼ 20s).

Figure 3. Non-rigid model-based bundling hierarchy: The top row shows the hierarchy of the input video and the second row the overlaid face model.

9. Expression Transfer

To transfer the expression changes from the source to the target actor while preserving person-specificness in each actor’s expressions, we propose a sub-space deformation transfer technique. We are inspired by the deformation transfer energy of Sumner et al.,¹⁸ but operate directly in the space spanned by the expression blend-shapes. This not only allows for the precomputation of the pseudo-inverse of the system matrix, but also drastically reduces the dimensionality of the optimization problem allowing for fast real-time transfer rates. Assuming source identity α^S and target identity α^T fixed, transfer takes as input the neutral deformed source δ^S, and the neutral target expression. Output is the transferred facial expression δ^T directly in the reduced sub-space of the parametric prior.

As proposed by Sumner and Popović,¹⁸ we first compute the source deformation gradients A_i ∈ R^3×3 that transform the source triangles from neutral to deformed. The deformed target is then found based on the undeformed state by solving a linear least-squares problem. Let (i₀, i₁, i₂) be the vertex indices of the i-th triangle, and , then the optimal unknown target deformation δ^T is the minimizer of:

This problem can be rewritten in the canonical least-squares form by substitution:

The matrix A ∈ R^6|F|×76 is constant and contains the edge information of the template mesh projected to the expression sub-space. Edge information of the target in neutral expression is included in the right-hand side b ∈ R^6|F| b varies with δ^S and is computed on the GPU for each new input frame. The minimizer of the quadratic energy can be computed by solving the corresponding normal equations. Since the system matrix is constant, we can precompute its Pseudo Inverse using a Singular Value Decomposition (SVD). Later, the small 76 × 76 linear system is solved in real-time. No additional smoothness term as in Bouaziz et al.² and Sumner and Popović¹⁸, is needed, since the blendshape model implicitly restricts the result to plausible shapes and guarantees smoothness.

10. Mouth Retrieval

For a given transferred facial expression, we need to synthesize a realistic target mouth region. To this end, we retrieve and warp the best matching mouth image from the target actor sequence (see Figure 4). We assume that sufficient mouth variation is available in the target video, that is, we assume that the entire target video is known or at least a short part of it. It is also important to note that we maintain the appearance of the target mouth. This leads to much more realistic results than either copying the source mouth region²³ or using a generic 3D teeth proxy.^{8, 19} For detailed information on the mouth retrieval process, we refer to the original paper.

Figure 4. Mouth Database: We use the appearance of the mouth of a person that has been captured in the target video sequence.

11. Results

11.1. Live reenactment setup

Our live reenactment setup consists of standard consumer-level hardware. We capture a live video with a commodity webcam (source), and download monocular video clips from Youtube (target). In our experiments, we use a Logitech HD Pro C920 camera running at 30Hz in a resolution of 640 × 480; although our approach is applicable to any consumer RGB camera. Overall, we show highly realistic reenactment examples of our algorithm on a variety of target Youtube videos at a resolution of 1280 × 720. The videos show different subjects in different scenes filmed from varying camera angles; each video is reenacted by several volunteers as source actors. Reenactment results are generated at a resolution of 1280 × 720. We show real-time reenactment results in Figure 5 and in the accompanying video.

Figure 5. Results of our reenactment system. Corresponding run times are listed in Table 1. The length of the source and resulting output sequences is 965, 1436, and 1791 frames, respectively; the length of the input target sequences is 431, 286, and 392 frames, respectively.

11.2. Runtime

For all experiments, we use three hierarchy levels for tracking (source and target). In pose optimization, we only consider the second and third level, where we run one and seven Gauss-Newton steps, respectively. Within a Gauss-Newton step, we always run four PCG steps. In addition to tracking, our reenactment pipeline has additional stages whose timings are listed in Table 1. Our method runs in real time on a commodity desktop computer with an NVIDIA Titan X and an Intel Core i7-4770.

Table 1. Avg. run times for the three sequences of Figure 5, from top to bottom.^a

11.3. Tracking comparison to previous work

Face tracking alone is not the main focus of our work, but the following comparisons show that our tracking is on par with or exceeds the state of the art. Here we show some of the comparisons that we conducted in the original paper.

Cao et al. 2014:⁵ They capture face performance from monocular RGB in real time. In most cases, our and their method produce similar high-quality results (see Figure 6); our identity and expression estimates are slightly more accurate though.

Figure 6. Comparison of our RGB tracking to Cao et al.⁵ and to RGB-D tracking by Thies et al.¹⁹

Thies et al. 2015:¹⁹ Their approach captures face performance in real-time from RGB-D, Figure 6. While we do not require depth data, results of both approaches are similarly accurate.

11.4. Reenactment evaluation

In Figure 7, we compare our approach against state-of-the art reenactment by Garrido et al.⁸ Both methods provide highly realistic reenactment results; however, their method is fundamentally offline, as they require all frames of a sequence to be present at any time. In addition, they rely on a generic geometric teeth proxy which in some frames makes reenactment less convincing. In Figure 8, we compare against the work by Thies et al.¹⁹ Runtime and visual quality are similar for both approaches; however, their geometric teeth proxy leads to an undesired appearance of the reenacted mouth. Thies et al. use an RGB-D camera, which limits the application range; they cannot reenact Youtube videos.

Figure 7. Dubbing: Comparison to Garrido et al.⁸

Figure 8. Comparison of the proposed RGB reenactment to the RGB-D reenactment of Thies et al.¹⁹

12. Limitations

The assumption of Lambertian surfaces and smooth illumination is limiting, and may lead to artifacts in the presence of hard shadows or specular highlights; a limitation shared by most state-of-the-art methods. Scenes with face occlusions by long hair and a beard are challenging. Furthermore, we only reconstruct and track a low-dimensional blend-shape model (76 coefficients), which omits fine-scale static and transient surface details. Our retrieval-based mouth synthesis assumes sufficient visible expression variation in the target sequence. On a too short sequence, or when the target remains static, we cannot learn the person-specific mouth behavior. In this case, temporal aliasing can be observed, as the target space of the retrieved mouth samples is too sparse. Another limitation is caused by our commodity hardware setup (webcam, USB, and PCI), which introduces a small delay of ≈ 3 frames.

13. Discussion

Our face reconstruction and photo-realistic re-rendering approach enables the manipulation of videos at real-time frame rates. In addition, the combination of the proposed approach with a voice impersonator or a voice synthesis system, would enable the generation of made-up video content that could potentially be used to defame people or to spread so-called “fake-news.” We want to emphasize that computer-generated content has been a big part of feature-film movies for over 30 years. Virtually every high-end movie production contains a significant percentage of synthetically generated content (from Lord of the Rings to Benjamin Button). These results are already hard to distinguish from reality and it often goes unnoticed that the content is not real. Thus, the synthetic modification of video clips was already possible for a long time, but it was a time consuming process and required domain experts. Our approach is a game changer, since it enables editing of videos in real time on a commodity PC, which makes this technology accessible to non-experts. We hope that the numerous demonstrations of our reenactment systems will teach people to think more critical about the video content they consume every day, especially if there is no proof of origin. The presented system also demonstrates the need for sophisticated fraud detection and watermarking algorithms. We believe that the field of digital forensics will receive a lot of attention in the future.

14. Conclusion

The presented approach is the first real-time facial reenactment system that requires just monocular RGB input. Our live setup enables the animation of legacy video footage—for example, from Youtube—in real time. Overall, we believe our system will pave the way for many new and exciting applications in the fields of VR/AR, teleconferencing, or on-the-fly dubbing of videos with translated audio. One direction for future work is to provide full control over the target head. A properly rigged mouth and tongue model reconstructed from monocular input data will provide control over the mouth cavity, a wrinkle formation model will provide more realistic results by adding fine-scale surface detail and eye-tracking will enable control over the target’s eye movement.

Acknowledgments

We would like to thank Chen Cao and Kun Zhou for the blendshape models and comparison data, as well as Volker Blanz, Thomas Vetter, and Oleg Alexander for the provided face data. The facial landmark tracker was kindly provided by TrueVisionSolution. We thank Angela Dai for the video voice over and Daniel Ritchie for video reenactment. This research is funded by the German Research Foundation (DFG), grant GRK-1773 Heterogeneous Image Systems, the ERC Starting Grant 335545 CapReal, and the Max Planck Center for Visual Computing and Communications (MPC-VCC). We also gratefully acknowledge the support from NVIDIA Corporation for hardware donations.

Figure. Watch the authors discuss this work in the exclusive Communications video. https://cacm.acm.org/videos/face2face

Submit an Article to CACM

CACM welcomes unsolicited submissions on topics of relevance and value to the computing community.

You Just Read

Face2Face: Real-Time Face Capture and Reenactment of RGB Videos

View in the ACM Digital Library

Copyright held by authors/owners. Publication rights licensed to ACM.
Request permission to publish from permissions@acm.org

DOI

10.1145/3292039

January 2019 Issue

Published: January 1, 2019

Vol. 62 No. 1

Pages: 96-104

Table of Contents

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Explore More

News Apr 23 2024

Maximizing Power Grid Security

R. Colin Johnson

Security and Privacy

News Apr 18 2024

Keeping AI Out of Elections

Bennie Mols

Artificial Intelligence and Machine Learning

BLOG@CACM Apr 17 2024

Technical Marvels

Herbert Bruderer

Computer History

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More

1. Introduction

2. Related Work

3. Use Cases

4. Method Overview

5. Synthesis of Facial Imagery

6. Energy Formulation

7. Data-Parallel Optimization

8. Non-Rigid Model-Based Bundling

9. Expression Transfer

10. Mouth Retrieval

11. Results

12. Limitations

13. Discussion

14. Conclusion

Acknowledgments

Face2Face: Real-Time Face Capture and Reenactment of RGB Videos

DOI

January 2019 Issue

Related Reading

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.