Visual object recognition, a subdiscipline of machine vision, addresses the problem of finding and identifying objects in images. Research in this area primarily focuses on techniques that use models of specific objects, based on properties such as shape and appearance. Such techniques are referred to as model-based recognition methods, because of the strong reliance on prior models of specific objects. In contrast, human visual recognition is characterized by an ability to recognize novel objects for which the observer has no specific prior model. Such generic recognition involves the ability to perform CATEGORIZATION on the basis of abstract reasoning about objects, such as inferring their form from how they function. While there has been some study of generic object recognition in MACHINE VISION, the primary focus has been on model-based recognition.
Most approaches to model-based object recognition involve comparing an unknown image against stored object models, in order to determine whether any of the models are present in the image. Many techniques perform both recognition and localization, both identifying what objects are present in the image and recovering their locations in the image or in the world. Object recognition is often phrased as a search problem, involving several kinds of search, including search over possible locations of the object in the image, search over possible viewpoints of the observer with respect to the object, and search over possible object models. Not all recognition tasks involve all of these kinds of search. For example, recognizing faces in a database of mug shots need not involve search over possible viewpoints because the pictures are all frontal views.
A number of factors contribute to the difficulty of OBJECT RECOGNITION tasks. One factor is the complexity of the scene. This includes the number of objects in the image, the presence of objects that are touching and partly occlu-ding one another, backgrounds that are highly textured or cluttered, and poor lighting conditions. Another factor is the generality of the object models. Objects that are composed of rigidly connected subparts, such as a pair of scissors, are harder to recognize than rigid objects such as a car. Nonrigid objects, such as a cat, are even more difficult to recognize. A third factor is the number of object models that a recognition system must consider. Many systems can only handle a small number of objects, in effect considering each model separately. A fourth factor is the complexity of the viewing transformation that maps the model coordinate frame to the image coordinate frame. For example, if an object can be viewed from an arbitrary 3-D position, then the different views of the object may look very different.
There is a trade-off in current approaches to object recognition: either it is possible to recognize objects from a small set of models appearing in complex scenes (with clutter and unknown viewpoints), or it is possible to recognize objects from a large set of models appearing in simple scenes (with a uniform background and known viewpoint). The remainder of this article will provide a brief overview of some of the major approaches used in object recognition. First, we consider search-based techniques, which operate by comparing local features of the model and image. These kinds of techniques are generally limited to a small set of object models, but handle complex images. Then we consider indexing approaches, which operate by computing a key which is used as an index into a large table or database of models. These techniques are generally limited to simple scenes.
Feature-based approaches to object recognition generally operate by recovering a correspondence between local attributes, or features, of an image and an object model. The features are usually geometrical, and are often based on detecting intensity edges in the image (places where there is a large change in image brightness). Brightness changes often correspond to the boundaries of objects or to surface markings on the objects. Local geometrical features can be simple, like corners, or involve more complex fitting of geometrical primitives, such as quadratic curves. The 2-D geometrical descriptions extracted from an image are compared with geometrical models, which may be either 2-D or 3-D.
Three major classes of feature-matching recognition methods can be identified, based on how the search for possible matches between model and image features is performed: (1) correspondence methods consider the space of possible corresponding features; (2) transformation space methods consider the space of possible transformations mapping the model to the image; and (3) hypothesize and test methods consider k-tuples of model and data features. A more detailed treatment of geometrical search methods can be found in Grimson (1990). In addition there are geometrical matching methods which make use of more global shape descriptors, such as entire silhouettes of objects (e.g., Kriegman and Ponce 1990).
Indexing-based approaches to object recognition are based on computing numerical descriptors of an image or portion of an image. These descriptors are then used as keys to index (or hash) into a table of object models. The most effective such methods are based on storing many 2-D views of each object. Such approaches are generally referred to as view-based because they explicitly store images or keys corresponding to each viewpoint from which an object could be seen. Another kind of indexing-based approach to recognition is based on computing invariant descriptions of objects that do not change as the viewpoint changes. The invariant properties of objects are generally geometrical. More information about invariant-based recognition methods can be found in Mundy and Zisserman (1992).
The most successful view-based approaches to object recognition are based on subspace techniques, which use principal components (or eigenvector) analysis to produce keys that form a concise description of a given set of images (e.g., Murase and Nayar, 1995). The main advantage of such methods is that they are useful for tasks in which there is a large database of objects to be searched. The main disadvantage is that in general they do not work well with occlusion or with complex scenes and cluttered backgrounds, because the measure of similarity is sensitive to such variation. A different view-based approach is taken in Huttenlocher, Klanderman, and Rucklidge (1993), which is based on computing distances between point sets using a measure of image similarity based on the Hausdorff distance. This similarity measure is designed to allow for partial occlusion and the presence of background clutter.
Grimson, W. E. L. (1990). Object Recognition by Computer: The Role of Geometric Constraints. Cambridge, MA: MIT Press.
Huttenlocher, D. P., G. A. Klanderman, and W. J. Rucklidge. (1993). Comparing images using the Hausdorff distance. IEEE Transactions on Pattern Analysis and Machine Intelligence 15(9):850-863.
Kriegman, D. J., and J. Ponce. (1990). On recognizing and positioning curved 3-D objects from image contours. IEEE Transactions on Pattern Analysis and Machine Intelligence 12:1127-1137.
Mundy, J. L., and A. Zisserman. (1992). Geometric Invariants in Computer Vision. Cambridge, MA: MIT Press.
Murase, H., and S. K. Nayar. (1995). Visual learning and recognition of 3-D objects from appearance. International Journal of Computer Vision 14:5-24.