Our visual system helps us carry out our daily business: walking, driving, reading, playing sports, or socializing. It is difficult to think of an activity that does not depend on vision. Our eyes and brain help us by measuring shapes, trajectories, and distances in world around us, and by recognizing materials, objects, and scenes. How is this done? Can we reproduce these abilities in a machine?
The following paper by Felzenszwalb et al. describes what is currently the best system for detecting object categories (a pedestrian, a bottle, a cat) in images. Like much work in computer vision, their system is built upon insight coming from a diverse set of areas of science and engineering: biological vision, geometry, signal processing, machine learning, and computer algorithms.
Three ingredients make their system successful. First, objects are described as collections of visually distinctive parts (for example, eyes, nose, and mouth in a face) that appear in a consistent, although not rigid, mutual position, or shape. This idea may be traced back to Fischler and Elschlager,6 although much work was necessary to make it work in practice; for example, making representations invariant to scale, representing the fact that parts are sometimes occluded and thus invisible, and giving shape and occlusion probabilistic interpretation.2
The second ingredient is representing parts (eyes, among others) using patterns of local orientations in the image. This simple idea makes a big difference. It turns out that orientation is less sensitive to changes in lighting conditions and viewpoint than pixel values. This observation comes from studying biological vision systems4 and is the foundation of the most successful descriptors for image patches: shape contexts, SIFT, and HOG.1,3,7 The authors here add one twist to the idea: rather than building detectors based on what the part looks like, it is better to build detectors as discriminative classifiers; that is, optimizing their ability to tell the difference between a given part (for example, the head of a pedestrian) and the environment that typically surrounds it (bookshelves, the shoulders, and arms of the pedestrian).
The third ingredient is an efficient search algorithm, originating with Felzenszwalb's thesis,5 which detects an object in a handful of seconds, focusing computation only on the most promising areas of the image.
Is detecting visual categories a solved problem? The reader will be amused by how poorly our best algorithms work. A quick perusal of Table 1 in Felzenszwalb et al. will reveal that, on a good day, less than half of the people are detected in the PASCAL VOC dataset. Boats and birds are even more difficult to find. This is precisely what makes computer vision an exciting field of research today: there is much progress to be made; we are still a few big ideas away from the ultimate design. Twenty years ago we only had nebulous ideas about how to approach visual categorization, and 10 years ago the performance numbers would have probably been in the few percent.
What is missing? Quite a few things; I will mention a couple. First of all, our models are purely phenomenological, based on statistics of how objects look in 2D images. We do not take into account 3D geometry, nor the properties and materials of surfaces. Second, today's goal is to recognize widely different categories: bottle vs. cat vs. person. There is a whole world of fine distinctions, for example, Anopheles vs. Culex mosquito, Siamese vs. Burmese cat. We do not yet know how to handle such fine-grained classifications. Third, people can learn to recognize new categories with just a few training examples; how many femurs does a medical student need to see to learn the category? Our algorithms must see thousands of training examples to become halfway decent. The mother of all challenges is scaling: there are millions of meaningful visual categories to recognize (105 vertebrate species, 107 insect species, not to speak of shoes, wristwatches, and handbags). We need to develop systems able to train themselves by using information available on the Web, and that are able to tap into the expertise of knowledgeable humans by asking them intelligent questions.
A growing number of talented researchers are hard at work tackling these questions. It is an exciting moment for computer vision. Stay tuned.
3. Dalal, N. and Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE Computer Society (2005), 886893.
4. Edelman, S., Intrator, N., and Poggio, T. Complex cells and object recognition. Unpublished; http://cogprints.org/561/2/199710003.ps.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2013 ACM, Inc.