The analysis of a visual image yields a rich understanding of what is in the world, where objects are located, and how they are changing with time, allowing a biological or machine system to recognize and manipulate objects and to interact physically with its environment. The computational approach to the study of vision explores the information-processing mechanisms needed to extract this important information. The integration of a computational perspective with experimental studies of biological vision systems from psychology and neuroscience can ultimately yield a more complete functional understanding of the neural mechanisms underlying visual processing.
Vision begins with a large array of measurements of the light reflected from object surfaces onto the eye. Analysis then proceeds in multiple stages, with each producing increasingly more useful representations of information in the scene. Computational studies suggest three primary representational stages. Early representations may capture information such as the location, contrast, and sharpness of significant intensity changes or edges in the image. Such changes correspond to physical features such as object boundaries, texture contours, and markings on object surfaces, shadow boundaries, and highlights. In the case of a dynamically changing scene, the early representations may also describe the direction and speed of movement of image intensity changes. Intermediate representations describe information about the three-dimensional (3-D) shape of object surfaces from the perspective of the viewer, such as the orientation of small surface regions or the distance to surface points from the eye. Such representations may also describe the motion of surface features in three dimensions. Visual processing may then proceed to higher-level representations of objects that describe their 3-D shape, form, and orientation relative to a coordinate frame based on the objects or on a fixed location in the world. Tasks such as object recognition, object manipulation, and navigation may operate from the intermediate or higher-level representations of the 3-D layout of objects in the world. (See also MACHINE VISION for a discussion of representations for visual processing.)
Models for computing the early representations of intensity edges typically begin by filtering the image with filters that smooth and differentiate the image intensities. Smoothing at multiple spatial scales allows the simultaneous representation of the gross structure of image contours, while preserving the fine detail of surface markings and TEXTURE. The differentiation operation transforms the image into a representation that facilitates the localization of edge contours and computation of properties such as their sharpness and contrast. Significant intensity changes may correspond to maxima, or peaks, in the first derivative, or to zero-crossings in the second derivative, of the image intensities. Subsequent image analysis may operate on a representation of image contours. Alternative models suggest that later processes operate directly on the result of the filtering stage.
Several sources of information are used to compute the 3-D shape of object surfaces. Binocular stereo uses the relative location of corresponding features in the images seen by the left and right eyes to infer the distance to object surfaces. Abrupt changes in motion between adjacent image regions indicate object boundaries, while smooth variations in the direction and speed of motion within image regions can be used to recover surface shape. Other cues include systematic variations in the geometric structure of image textures, such as changes in the orientation, size, or density of texture elements; image shading, which refers to smooth variations of intensity that occur as surfaces bend toward or away from a light source; and perspective, which refers to the distortion of object contours that results from the perspective projection of the 3-D scene onto the two-dimensional (2-D) image. (See STRUCTURE FROM VISUAL INFORMATION SOURCES and STEREO AND MOTION PERCEPTION for further discussion of visual cues to structure and form.)
The computation of 3-D structure cannot proceed unambiguously from the 2-D image alone. Models also incorporate physical constraints that capture the typical behavior of objects in the world. For the early and intermediate stages of processing, these constraints are as general as possible. Existing models use constraints based on the following typical behaviors: object surfaces are coherent and typically vary smoothly and continuously from one image location to the next; objects usually move rigidly, at least within small image regions; illumination usually shines from above the observer; changes in the reflectance properties of a surface (such as its color) usually occur abruptly while illumination may vary slowly across the image. Models also incorporate the known physics of how the image is formed from the perspective projection of light reflected from surfaces onto the eyes. Computational studies of vision identify appropriate physical constraints and show how they can be built into a specific algorithm for computing the image representations.
Among cues for recovering 3-D structure from 2-D images, the two most extensively studied by computational and biological researchers are binocular stereo and motion. For both stereo and motion measurement, the most challenging computational problem is the correspondence problem. Given a representation of features in the left and right images, or two images displaced in time, a matching process must identify pairs of features in the two images that are projections of the same physical structure in space. Many models attempt to match edge features in the two images. Some models, such as an early model of human stereo vision proposed by MARR and Poggio (Marr 1982), simultaneously match image edge representations at multiple spatial scales. The correspondence of features at a coarse scale can provide a rough 3-D layout of a scene that can guide the correspondence of features at finer scales. Information such as the orientation or contrast of edge features can help identify pairs of similar features likely to correspond to one another. Stereo and motion models also typically use physical constraints such as uniqueness (i.e., features in one image have a unique corresponding feature in the other) and continuity or smoothness (i.e., nearby features in the image lie at similar depths or have a similar direction and speed of motion). Many models incorporate some form of optimization: a solution is found that best satisfies a complex set of criteria based on all of the physical constraints taken together. In the case of motion processing, the analysis of the movement of features in the changing 2-D image is followed by a process that infers the 3-D structure of the moving features. Most computational models of this inference use the rigidity constraint: they attempt to find a rigidly moving 3-D structure consistent with the computed 2-D image motion. (For specific models of stereo and motion processing, see Faugeras 1993; Hildreth and Ullman 1989; Kasturi and Jain 1991; Landy and Movshon 1991; Marr 1982; Martin and Aggarwal 1988; and Wandell 1995.)
Much attention has been devoted to the higher-level problem of object recognition, which requires that a representation derived from a viewed object in the image be matched with internal representations of a similar object stored in memory. Most computational models consider the recognition of objects on the basis of their 2-D or 3-D shape. Recognition is difficult because a given 3-D object can have many appearances in the 2-D image. Most recognition models can be classified into three main approaches. The first assumes that objects have certain invariant properties that are common to all of their views. Recognition typically proceeds in this case by first computing a set of simple geometric properties of a viewed object from image information, and then selecting an object model that offers the closest fit to the set of observed property values. The second approach focuses on the decomposition of objects into primitive, salient parts. In this case, models first find primitive parts in an image, and then identify objects on the basis of the detected parts and their spatial arrangement. The best-known model of this type was proposed by Biederman (1985; see Ullman 1996). The third major approach to object recognition uses a process that explicitly compensates for the transformation between a viewed object and its stored model. One example of this approach proposed by Ullman (1996) first computes the geometric transformations that best explain the mapping between a viewed object and each object model in a database. A second stage then recognizes the object by finding which combination of object model and transformation best matches the viewed object. (Some specific models of recognition are described in Faugeras 1993 and Ullman 1996.)
Biederman, I. (1985). Human image understanding: Recent research and a theory. Computer Vision, Graphics, and Image Processing 32:29-73.
Faugeras, O. (1993). Three-Dimensional Computer Vision: A Geometric Viewpoint. Cambridge, MA: MIT Press.
Haralick, R. M., and L. G. Shapiro. (1992). Computer and Robot Vision. 2 vols. Reading, MA: Addison-Wesley.
Hildreth, E. C., and S. Ullman. (1989). The computational study of vision. In M. Posner, Ed., Foundations of Cognitive Science. Cambridge, MA: MIT Press, pp. 581-630.
Horn, B. K. P. (1989). Shape from Shading. Cambridge, MA: MIT Press.
Kasturi, R., and R. C. Jain, Eds. (1991). Computer Vision: Principles. Los Alamitos, CA: IEEE Computer Society Press.
Landy, M. S., and J. A. Movshon, Eds. (1991). Computational Models of Visual Processing. Cambridge, MA: MIT Press.
Marr, D. (1982). Vision. San Francisco: Freeman.
Martin, W. N., and J. K. Aggarwal, Eds. (1988). Motion Understanding: Robot and Human Vision. Boston: Kluwer.
Ullman, S. (1996). High-level Vision: Object Recognition and Visual Cognition. Cambridge, MA: MIT Press.
Wandell, B. A. (1995). Foundations of Vision. Sunderland, MA: Sinauer.
Aloimonos, J., and D. Shulman. (1989). Integration of Visual Modules. Boston: Academic Press.
Blake, A., and A. Yuille, Eds. (1992). Active Vision. Cambridge, MA: MIT Press.
Blake, A., and A. Zisserman. (1987). Visual Reconstruction. Cambridge: MIT Press.
Brooks, R. A. (1991). Intelligence without representation. Artificial Intelligence Journal 47:139-160.
Fischler, M. A., and O. Firschein, Eds. (1987). Readings in Computer Vision: Issues, Problems, Principles, and Paradigms. Los Altos, CA: Kaufman.
Grimson, W. E. L. (1990). Object Recognition by Computer: The Role of Geometric Constraints. Cambridge, MA: MIT Press.
Horn, B. K. P. (1986). Robot Vision. Cambridge, MA: MIT Press.
Kanade, T., Ed. (1987). Three-Dimensional Machine Vision. Boston: Kluwer.
Koenderink, J. J. (1990). Solid Shape. Cambridge, MA: MIT Press.
Levine, M. D. (1985). Vision in Man and Machine. New York: McGraw-Hill.
Lowe, D. (1985). Perceptual Organization and Visual Recognition. Cambridge, MA: MIT Press.
Malik, J. (1995). Perception. In S. J. Russell and P. Norvig, Eds., Artificial Intelligence: A Modern Approach. Englewood Cliffs, NJ: Prentice Hall, pp. 724-756.
Mayhew, J. E. W., and J. P. Frisby. (1991). 3-D Model Recognition from Stereoscopic Cues. Cambridge, MA: MIT Press.
Mitiche, A. (1994). Computational Analysis of Visual Motion. New York: Plenum Press.
Mundy, J. L., and A. Zisserman, Eds. (1992). Geometric Invariance in Computer Vision. Cambridge, MA: MIT Press.
Ullman, S. (1984). Visual routines. Cognition 18:97-159.