Title: Audiovisual Persona Reconstruction

Advisors: Steven Seitz and Irena Kemelmaher

Supervisor Committee: Steven Seitz (co-Chair), Irena Kemelmaher (co-Chair), Duane Storti (GSR, ME), and Richard Szeliski (Facebook)

Abstract: How can Tom Hanks come across so differently in "Forrest Gump" and "Catch Me If You Can?" What makes him unique in each of his roles? Is it his appearance? The way he talks? The way he moves? In my thesis, I introduce the problem of persona reconstruction. I define it as a modeling process that accurately represents the likeness of a person, and propose solutions to address the problem with the goal of creating a model that looks, talks, and acts like the recorded person. The specific aspects of persona modelled in this thesis include facial shape and appearance, motion and expression dynamics, the aging process, the speaking style and how a person talks through solving the visual speech synthesis problem.

These goals are highly ambitious. The key idea of this thesis is that the utilization of a large amount of unconstrained data enables overcoming many of the challenges. Unlike most traditional modeling techniques which require a sophisticated capturing process, my solutions to these tasks operate only on unconstrained data such as a uncalibrated personal photo and video collection, and thus can be scaled to virtually anyone, even historical figures, with minimal efforts.
In particular, I first propose new techniques to reconstruct time-varying facial geometry equipped with expression-dependent texture that captures even minute shape variations such as wrinkles and creases using a combination of uncalibrated photometric stereo, novel 3D optical flow, dense pixel-level face alignment, and frequency-based image blending. Then I demonstrate a way to drive or animate the reconstructed model with a source video of another actor by transferring the expression dynamics while preserving the likeness of the person. Together these techniques represent the first system that allows reconstruction of a controllable 3D model of any person from just a photo collection. 
Next, I model facial changes due to aging by learning the aging transformation from unstructured Internet photos using a novel illumination-subspace matching technique. Then I apply such a transformation in an application that takes as input a photograph of a child and produces a series of age-progressed outputs between 1 and 80 years of age. The proposed technique establishes a new state of the art for the most difficult aging case of babies to adults. This is demonstrated by an extensive evaluation of age progression techniques in the literature.
Finally, I model how a person talks via a system that can synthesize a realistic video of a person speaking given just an input audio. Unlike prior work which requires a carefully constructed speech database from many individuals, my solution solves the video speech problem by requiring only existing video footage of a single person. Specifically, it focuses on a single person (Barack Obama) and relies on an LSTM-based recurrent neural network trained on Obama's footage to synthesize a high-quality mouth video of him speaking. My approach generates compelling and believable videos from audio that enable a range of important applications such as lip-reading for hearing-impaired people, video bandwidth reduction, and creating digital humans which are central to entertainment applications like special effects in films.
CSE 203
Tuesday, June 20, 2017 - 12:00 to 14:00