When the Microsoft Kinect for Xbox 360 was introduced in November 2010, it was an instant success. Via the Kinect, users can control their Xbox through natural body gestures and commands thanks to a depth camera that enables gesture recognition. In contrast to a conventional camera, which measures the color at each pixel location, a depth camera returns the distance to that point in the scene. Depth cameras make it easy to separate the Xbox user from the background of the room, and reduce the complexities caused by color variation, for example, in clothing.
While the role of the depth camera in the success of the Kinect is well-known, what is less well-known is the innovative computer vision technology that underlies the Kinect's gesture recognition capabilities. The following article by Shotton et al. describes a landmark computer vision system that takes a single depth image containing a person and automatically estimates the pose of the person's body in 3D. This novel method for pose estimation is the key to the Kinect's success.
Three important ideas define the Kinect architecture: tracking by detection, data-driven learning, and discriminative part models. These ideas have their origin in object recognition and tracking research from the computer vision community over the past 10 years. Their development in the Kinect has led to some exciting and innovative work on feature representations and training methods. The resulting system is a dramatic improvement over the previous state of the art.
In order to recognize a user's gesture, the Kinect must track the user's motion in a sequence of depth images. An important aspect of the Kinect architecture is that body poses are detected independently in each frame, without incorporating information from previous frames. This tracking by detection approach has the potential for greater robustness because errors made over time are less likely to accumulate. It is enabled by an extremely efficient and reliable solution to the pose estimation problem.
The challenge of pose estimation, as in other vision problems, is to reliably measure the desired variables, while remaining unaffected by other sources of variability. Body pose is described by a vector of joint angles. When you bend your elbow, for example, you are changing one joint angle. However, the appearance of your elbow in a sequence of depth images is affected by many factors: your position and orientation with respect to the camera, the clothing you are wearing, whether your build is thin or stocky, and so forth. An additional challenge comes from the large number of pose variables. Around 30 joint angles are needed to describe the basic configurations of the human body. If each joint could assume only five positions, this would result in 530 possible poses. Fortunately, joints are coupled during coordinated movement, and many achievable poses, such as those found in yoga, are rarely encountered in general settings.
The authors employ data-driven learning to address the tremendous variability in pose and appearance. Motion capture data was used to characterize the space of possible poses: actors performed gestures used in gaming (for example, dancing or kicking) and their joint angles were measured, resulting in a dataset of 100,000 poses. Given a single pose, a simulated depth image can be produced by transferring the pose to a character model and rendering the clothing and hair. By varying body types and sizes, and by sampling different clothing and hairstyles, the authors automatically obtained a huge training dataset of depth images.
The final idea is the use of discriminative part models to represent the body pose. Parts are crucial. They decompose the problem of predicting the pose into a series of independent subproblems: given an input depth image, each pixel is labeled with its corresponding part, and the parts are grouped into hypotheses about joint locations. Each pixel can be processed independently in this approach, making it possible to leverage the Xbox GPU and obtain real-time performance. This efficiency is enhanced by a clever feature design.
The Kinect's impact has extended well beyond the gaming market. It has become a popular sensor in the robotics community, where its low cost and ability to support human-robot interaction are hugely appealing. A survey of the two main robotics conferences in 2012 (IROS and ICRA) reveals that among the more than 1,600 papers, 9% mentioned the Kinect. At Georgia Tech, we are using the Kinect to measure children's behavior, in order to support the research and treatment of autism, and other developmental and behavioral disorders.
In summary, the Kinect is a potent combination of innovative hardware and software design, informed by decades of computer vision research. The proliferation of depth camera technology in the coming years will enable new advances in vision-based sensing and support an increasingly diverse set of applications.
©2013 ACM 0001-0782/13/01
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and full citation on the first page. Copyright for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request permission to publish from [email protected] or fax (212) 869-0481.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2013 ACM, Inc.