Imagine taking a virtual flight over the Yosemite Valley. You are in control of the flightyou soar up through the clouds, dive deep into the ravines, flying in whichever direction you pleaseall by simply moving your hand through the air. No gloves to wear and no complex keyboard commands to remember. You have absolutely no devices to controlyou point a finger, pretend your hand is a fighter jet, and just fly.
Today's home computers are becoming increasingly powerful while also becoming very affordable. Complex real-time animations that required high-end graphics workstations a few years ago can now be done on home computers equipped with inexpensive graphics cards. A whole suite of interactive 3D applications is now accessible to the average user. Examples range from molecular simulators to 3D presentation tools that allow manipulation of virtual objects. However, what has not changed is the complex interface to these applications. 3D fly-thru's, for example, require intricate maneuvers and learning the input controls is extremely painstaking unless, of course, you are a teenager. The problem is with the way we communicate with computers.
Traditional computer interfaces (keyboard, mouse, joystick) are not always appropriate for interacting with 3D applications. First of all, they control too few parameters. For instance, while placing an observer in a 3D space requires six degrees of freedom, a mouse allows us to control only two. The issue here is not only the number of parameters but also the ease with which these parameters can be manipulated. Certain new devices like the SpaceBall provide more parameters but their control involves actions like pulling and twisting, which are not quite natural.
One way to provide intuitive 3D controls is to allow users to mimic natural gestures like pointing and grasping. Users can then interact with a virtual world in the same way they interact with the real world. Such controls are possible with devices like the DataGlove, but these devices are too restrictive. Merely wearing a glove, for instance, makes it clumsy to perform mundane tasks like holding a coffee mug.
Speech recognition algorithms are very useful for certain tasks but they cannot be used to input spatial parameters that have to be varied quickly and continuously. In the 3D fly-thru example, for instance, it would be very difficult to pilot the flight with discrete voice commands. However, it would be both easy and natural to manipulate the flight parameters by moving the hand in 3D. Such control is possible if the computer can see the user's hand and determine the hand's configuration in real time.
A computer can "see" a scene using one or more video cameras but making it understand what it is seeing is a complex task. Visual information processing requires a great deal of computational power. This is true not only for the machine but also for the human whose visual cortex occupies most of the brain. However, with the development of newer algorithms and faster computers, it is becoming practical to perform real-time tasks on the computer using visual input.
This article describes a vision-based input interface system developed at Bell Labs. The system uses two video cameras focused on the desktop, which can be mounted on the computer monitor or attached to the ceiling. The common field of view of the two cameras defines a 3D volume that serves as the 3D analogue of the mousepad. Users can control appropriate applications by moving their hands in this 3D region. Figure 1 shows an example where the user controls a virtual robot hand using a grasping gesture.
Our system has been designed specifically to aid tasks like 3D navigation, object manipulation and visualization. The system operates in real time at 60Hz and recognizes three simple gestures that can be treated as input commands. In addition, the system tracks the movement of the user's thumb and index finger. For each of these two fingers the system computes a five parameter pose that consists of the 3D position (X, Y, Z) of the fingertip and the azimuth and elevation angles a, e of the finger's axis.1 When the interface is connected to an application program, the recognized gesture is mapped to a command or action, while the pose parameters give the numerical arguments of the command. For example, the gesture shown in Figure 1 invokes a "Move Gripper" command that controls a simulated robot gripper whose position, orientation and jaw width are given by the pose parameters.
The gestures recognized by the system are shown in Figure 2the gestures include:
- Point: Extended index finger (thumb may or may not be extended); other fingers are closed.
- Reach: All fingers stretched out.
- Click: Quick bending of pointing finger (like pulling the trigger of a gun).
All other gestures (as well as the absence of the hand in the scene) are referred to as ground. Our input interface can be used with most applications that require smooth multidimensional control. These include computer games, virtual fly-thru's and 3D design tools. In the following, we describe a few applications. The main themes are data visualization and data manipulation.
3D Virtual Fly-thru
This application allows the user to fly over a graphically generated terrain. The terrain, in our case, is generated using actual elevation data of the Yosemite Valley in California. The user stretches out the thumb and the pointing finger and imitates flying by moving the hand. The velocity of flight is controlled by the position of the pointing finger along a line that runs from the near end of the desk to the far end. The velocity is zero if the fingertip is at the center of the desktop and increases as the user pushes his hand forward. When the user pulls the hand backward, the velocity decreases to zero and then becomes negative allowing backward motion. If the hand is held stationary at a particular location, the user will continue to fly at a constant speed. The reach gesture is mapped to the stop commandthe user simply opens the hand to stop the flight.
The direction of the pointing finger controls the direction of the flight. This control is incremental and is somewhat like controlling a car's steering wheelwhen the finger points away from the zero direction, the flight path turns until the finger is straightened. This means that, to reverse the direction of flight, the user does not have to point backwardhe points at a convenient angle and turns around smoothly. The pitch and yaw angles of the flight are proportional to the elevation and azimuth angles of the pointing finger. The roll is determined by the line joining the two fingertips.
Figure 3 shows three snapshots through a flight. In Figure 3a the user is controlling the roll of the flight by rolling his hand and in Figure 3b, he is diving into a canyon by pointing his finger down. In Figure 3c, he is simply cruising through the canyon. Note that this application is not limited to just flying over various terrain. In general, one can navigate any 3D database while zooming in and out. One can then enter smaller data blocks and query or modify their properties. Modifying data properties calls for data interaction, which is what the next application does.
3D Scene Composer
This application allows users to produce and interact with complex 3D scenes by selecting and manipulating simple object primitives. The system has the following features:
- It is entirely gesture-driven.
- The user can draw curves and ribbons in 3D by tracing out the curve or ribbon with a finger.
- The user can select objects from a pallette and place them in the scene using gestures.
- The user can drive a robot hand to pick up, move, and rotate objects in the scene. This allows the user to build complex objects from simpler primitives.
- The user can translate and rotate the entire 3D scene and look at it from different vantage points.
- The user can switch between modes either using direct gesture commands, or by picking the desired mode from a menu. Menu picking is also controlled by gestures.
Figure 4a shows the 3D Composer with a few user-selected objects. The mesh in the figure denotes a wall surrounding the virtual world and shadows of the objects are projected onto this wall to provide a better sense of depth. Overlaid on the editor window is a menu window that allows the user to select from one of many operating modes. For example, the "objects-mode" allows the user to select an object primitive from a pallette of primitives and the "gripper-mode" allows the user to control the robot hand and move/rotate the object primitives. The "draw-mode" allows the user to draw curves and ribbons in 3D by tracing out the curve or ribbon with the pointing finger (see Figure 4b). To select an item from the menu the user moves the cursor over the item with the point gesture and then opens the hand (reach gesture) to activate the item.2
Figures 4c and 4d illustrate the control of the robot hand. The distance between the two jaws is variable and is proportional to the distance between the fingertips. The two jaws are forced to be paralleltheir azimuth and elevation angles are determined by those of the pointing finger and their roll angle is determined by the line joining the fingertips. This is an example of a seven degrees-of-freedom (7-DOF) control where the user can translate and rotate an object while simultaneously changing its size. This application allows users to build complex scenes from simpler primitives. Two examples are shown in Figure 4e and 4f.
Modern computer games offer rich 3D environments for players to interact with and our system provides a natural and intuitive medium for such interactions. Even in games where the navigation is limited to two dimensions, our system provides an interesting alternative to traditional devices. For example, we have developed a one-camera setup with which users can play 2D games. In this setup, the system tracks the pointing finger in 2D and estimates three parameters, namely, the orientation of the finger and the (X, Y) coordinates of the tip. This interface allows the user to navigate corridors by pointing, open doors by opening the hand (reach gesture) and shoot at aliens by simply clicking the finger.
A number of users have tried the preceding applications and consistently agree our interface is much more natural to use than traditional interfaces.
Accuracy, Stability, and Performance
To estimate the system's accuracy and usability, we conducted over 200 qualitative and quantitative trials with users of different age, gender, skin complexion and hand size. We used the three static gestures: "point," "reach," and "ground," where "ground" was represented by a closed fist. Users were asked to show one gesture at a time, and to move the hand and fingers without changing the gesture. The system then counted the answers that disagreed with the shown gesture. The resulting error rate was in the order of 1/500, with most errors attributable to the motion blur. The errors increased with the angle between the normal to the palm and midline between the axes of the cameras.
The reliability of the "click" recognition was measured by counting the errors manually. The results varied widely among users. Some users easily reached a 90% recognition rate, while for others it was worse than 50%. However, most user could perfect this gesture within a few minutes of training.
Stability of the pose estimation was computed as a standard deviation over short (35 sec) epochs, with the hand held steady. The planar jitter was about one pixel in position and less than half a degree in the angle q. The jitter in 3D pose varied with the position and orientation of the hand. In the best configuration, (hand held horizontally, pointing ahead) the jitter in orientation is less than 2 degrees and under 5mm in height. Figure 5 illustrates the stability of the 3D position. A circle and a figure "8" were drawn in the XZ plane (vertical plane normal to the pointing direction) using the "point" gesture, with the hand in the air. The raw data points shown in the figure demonstrate the precision of the input.
As for usability, fatigue can become an issue if the hand is held above the table for extended periods of time. This is not a problem for applications like the fly-thru where the arm rests on the table and wrist movements control flight parameters. But, in applications involving object manipulation where the hand mimics real actions like grasping and moving, the hand is usually held above the table and performing such actions continuously over long periods of time can tire the hand. In these cases, the elbow can rest on the table and provide some support. Our gesture recognition and hand tracking algorithm is computationally inexpensive. Running on a 400MHz Pentium, the software processes video streams from two cameras at the rate of 60Hz with over 50% of CPU power to spare (the 60Hz rate is imposed by the video hardware). The system has been field tested and is ready for use. In addition to the computer, the system requires a pair of cameras that cost approximately $100 each.
In order to make the system fast and robust, we have assumed some control over the environment. Specifically, the system requires a high-contrast stationary background and a stable ambient illumination. The most noticeable restriction in usability is the limit on the angular ranges. These ranges are determined by camera placement and, in our prototype, the range of elevation angle is about ±40° and that of the roll angle is about ±30°. One way to to extend this limit is to add more cameras such that, at any given time, there are at least two cameras that give a good view.
Gesture recognition technology is an active area of research today and a number of articles have been published on this topic. A pioneer in this area is Myron W. Krueger, whose book Artificial Reality II  describes his vision of human-computer interaction. The Proceedings of the International Conference on Face and Gesture Recognition contains recent articles in this area and a literature review can be found in . The gesture-based interfaces reported in the literature differ in many aspects, ranging from the type of applications they address to the accuracy and the number of parameters they control. Another critical difference among these systems is the computational speedsome of them run in real time and others don't. One excellent real-time system is Sivit, developed by Maggioni  and his colleagues at Siemens. Visit the Web site www.atd.siemens.de/news/0398038e.html for more information on Sivit.
Video input interfaces based on computer vision techniques introduce natural and non-confining means for interacting with complex 3D applications. The system we described here offers robust and precise control of many 3D applications using hand gestures. However, much research needs to be done to fully utilize the potential of such interfaces. A system that combines speech and vision will provide a more complete answer to the problem of designing natural interfaces.
The technical aspects of the system described in this article can be found in  and . An online version of  containing some video clips available at www.acm.org/sigmm/MM98/electronic/_proceedings/segen/index.html.
©2000 ACM 0002-0782/00/0700 $5.00
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2000 ACM, Inc.