Structure from Visual Information Sources

Looking about, it is obvious to us that the surrounding world has many objects in it, each object with a particular identity and location in the environment. The primate nervous system effortlessly determines both the object structure and its location from multiple visual sources. The motion of the object, its reflectance and obscuring of light, and the fact that we view the object with two eyes are all combined to form its internal representation in our brain. These sources of visual information have been under intense scrutiny for more than a century beginning with the work of the physicist and physiologist HELMHOLTZ (Helmholtz 1962). However, it is only a relatively recent realization that there are substantial differences in how information from these sources is encoded by the primate brain to derive the shapes of objects. This development was driven by psychophysical, physiological, and anatomical determination of two VISUAL PROCESSING STREAMS in VISUAL CORTEX, one specialized for object shape and one for SPATIAL PERCEPTION. In the analysis of spatial environment, large portions of the visual field are typically analyzed through neurons leading to the parietal lobe. In contrast, the analysis of object shape requires fine details of each object integrated over space and time and occurs in the temporal cortex.

There are three major cues for the analysis of object shape. The most thoroughly studied is derived from motion. Fifteen years before any scientific publications, Ginger Rogers and Fred Astaire exploited the illusion of depth from motion in the black-and-white film The Gay Divorcee (1939). The shadow from a cardboard cutout of dancers rotating on a phonograph turntable fooled a gigolo into thinking the young couple was pirouetting behind closed doors. A number of papers by the gestalt psychologists (see GESTALT PERCEPTION) thoroughly explored the ability to extract 3-D shape-from-motion, typified by the work of Gibson (1966) and Wallach and O'Connell (1953). Subjects were only able to reconstruct the 3-D shape of a bent wire from its shadow when the wire was rotated. Further studies identified that 3-D structural cues such as length and angles between segments could be used to extract the 3-D shape, as reviewed by Ullman (1979). These studies provided a number of different explanations for the extraction of structure-from-motion, pitting purely motion cues against varying recognizable form cues. This bottom-up vs. top-down controversy was further examined using computer-generated displays in which the motion of each element could be individually controlled (Ullman 1979). These studies have shown that pure motion is sufficient to extract form information in both human and nonhuman primates (Siegel and Andersen 1988), although they do not explicitly exclude the possibility that form cues may supplement motion.

Another source of visual information arises from the horizontal separation between the two eyes (figure 1). The binocular depth effect is seen with old-style Wheatstone stereoscopes in which the scene leaps into depth when viewed with both eyes (Wade and Ono 1985). Sufficient information exists in the disparity of the two images on the RETINA to provide depth profiles. As in the study of structure-from-motion, the issue arises as to whether recognizable details are needed to extract structure-from-disparity. Julesz (1995) generated computer displays in which all the visual cues were removed except for disparity using stimuli that appeared to each eye as random visual noise. The fusion of the random dot stereograms demonstrated that depth could unambiguously be derived from the disparity of the retinal images.

The third major source of information about the shape of an object arises from the reflectance of light from its surface (see figure 1). Different surface characteristics provide shape information. Specular (shiny) highlights arise from shiny surfaces and may accentuate regions of high curvature while color may also help in determining object shape. A well-studied luminance cue is shape-from-shading in which light passes over an object and part of the object is illuminated while part remains in shadow. Unlike structure-from-motion and structure-from-disparity, a top-down assumption that the light source is above the object is needed to explain much of the psychophysical data.

Figure 1 In both pairs of figures, the shape is easily determined from the shading cues, although the lower figure incorrectly appears to have a ball at one end. This is due to the expectation that all light sources are above. The pairs of figures may be fused by using a stereoviewer or free-fusing to provide unambiguous depth cues. In the upper pair, where the stereo and shape-from-shading cues agree, the right end of the tube is correctly seen as an opening. In the lower pair of figures, the disparity cues and the shape-from-shading cues are in opposition. Even with the presumably unambiguous disparity cues, the ball remains at the end of the tube. The shape-from-shading cues are dominant, suggesting that it is processed prior to disparity. Studies in which one type of shape cue is pitted against another are often used in psychophysical studies, with the actual physiological measurements lagging behind.

An assumption, such as the invocation of a highly elevated light source in structure-from-shading, is called a constraint by those in COMPUTATIONAL NEUROSCIENCE. Constraints are often invoked in shape analysis because the raw measurements of motion, disparity, or luminance are insufficient to define objects in the world (MARR 1982; Poggio and Koch 1985). It arises because there are too many objects that can give rise to the measured motion, disparity, or shaded images. The constraints may be explicit or implicit in the theories, algorithms, or implementations (Marr 1982) that solve problems of shape recognition. For example, given a black-and-white photograph of an egg, most persons assume that the egg is lit from above and thus the egg's surface is perceived as convex. In fact, if the photograph is simply turned over, most observers will describe the surface as convex, their perception seemingly fixed on the idea that the light source is overhead (Ramachandran 1988). The rigidity of an object is often assumed in the analysis of structure-from-motion (Ullman 1979; Bennett and Hoffman 1985). It is still unknown how constraints are expressed in the brain and whether they are innate or develop through experience.

These assumptions combined with a geometrical description of the problem have led to a number of theorems that demonstrate the feasibility of obtaining information from the visual input. As a test of these theorems, algorithms are implemented on digital computers. These have been successful to some extent in that certain problems in MACHINE VISION, such as automobile part recognition, may be performed. However, the question as to whether the primate brain implements mathematically based approaches remains open.

Anatomical and physiological studies of the CEREBRAL CORTEX have been able to determine some of the actual processes and brain regions that are involved in shape recognition. Both hierarchical and parallel processes are involved in the representation of shape. The best-understood recognition process is the motion pathway that passes through striate cortex, to the middle temporal motion area (MT/V5). In MT/V5 there is a representation of the velocity of the image for different parts of the visual field (Albright, Desimone, and Gross 1984). This motion representation is further developed in the medial superior temporal area (MST) in which neurons are found that respond to environmental optic flow for spatial vision (Tanaka et al. 1986; Duffy and Wurtz 1991). Beyond MST, the motion signal passes to the parietal cortex (area 7a; Siegel and Read 1997) in which optic flow signals are further processed and combined with eye position information. Both 7a and MST project to the anterior polysensory temporal area (STPa) which has neurons that represent both flow and apparently 3-D shape (Bruce, Desimone, and Gross 1981; Anderson and Siegel submitted).

Running roughly in parallel to the processing of visual motion is the analysis of disparity cues. At each step from striate to MT/V5 to MST to 7a, neurons are found that are tuned to disparity (Poggio and Poggio 1984; Roy, Komatsu, and Wurtz 1992; Gnadt and Mays 1995). Little is known as to how binocular cues are used for shape representation.

A more temporal cortical stream represents shape using luminance and color cues. Neurons have been described that represent all sorts of luminance cues, such as orientation (Hubel and Wiesel 1977) and borders (von der Heydt, Peterhans, and Baumgartner 1984). Geometrical figures (Tanaka 1993), as well as shapes as complex as faces (Gross 1973), may be represented by temporal cortical neurons. Color analysis surely is used in object identification, although little formal work has been done. Surprisingly, the dependence of these neurons upon parameters of motion (Perrett et al. 1985) and disparity are as yet little explored. Such studies are crucial, as the psychophysical ability to describe shape (a putative temporal stream analysand) does not deteriorate when motion or disparity (a putative dorsal stream analy-sand) is the underlying representation.

In summary, the visual perception of 3-D structure utilizes motion, disparity, and luminance. Psychophysical studies have defined the limits of our ability, while computational studies have developed a formal framework to describe the perceptual process as well as to test hypotheses. Anatomical and physiological results have provided essential cues from functional systems.

References

Albright T. D., R. Desimone, and C. G. Gross. (1984). Columnar organization of directionally selective cells in visual area MT of the macaque. J. Neurophysiol. 51:16-31.

Anderson, K. A., and R. M. Siegel. (Submitted). Representation of three-dimensional structure-from-motion in STPa of the behaving monkey. Cereb. Cortex.

Bennett, B. M., and D. D. Hoffman. (1985). The computation of structure from fixed-axis motion: Nonrigid structures. Biol. Cybern. 51:293-300.

Bruce, C., R. Desimone, and C. G. Gross. (1981). Visual properties of neurons in a polysensory area in superior temporal sulcus of the macaque. J. Neurophysiol. 46:369-384.

Duffy, C. J., and R. H. Wurtz. (1991). Sensitivity of MST neurons to optic flow stimuli. 1. A continuum of response selectivity to large-field stimuli. J. Neurophysiol. 65:1329-1345.

Gibson, J. J. (1966). The Senses Considered as Perceptual Systems. Boston: Houghton Mifflin.

Gnadt, J. W., and L. E. Mays. (1995). Neurons in monkey parietal area LIP are tuned for eye-movement parameters in 3-D space. J. Neurophysiol. 73:280-297.

Gross, C. G. (1973). Visual functions of inferotemporal cortex. In H. Autrum, R. Jung, W. R. Loewenstein, D. M. McKay, and H. L. Teuber, Eds., Handbook of Sensory Physiology7/3B. Berlin: Springer, pp. 451-482.

Helmholtz, H. L. F. von. (1962). Helmholtz's Treatise on Physiological Optics. Translated by from the 3rd German ed., James P. C. Southall, Ed. New York: Dover Publications.

Hubel, D. H., and T. N. Wiesel. (1977). The Ferrier Lecture. Functional architecture of macaque monkey visual cortex. Proc. R. Soc. Lond. B. Biol. Sci. 198:1-59.

Julesz, B. (1995). Dialogues on Perception. Cambridge, MA: MIT Press.

Marr, D. (1982). Vision. San Francisco: W. H. Freeman.

Perrett, D. I., P. A. Smith, A. J. Mistlin, A. J. Chitty, A. S. Head, D. D. Potter, R. Broennimann, A. D. Milner, and M. A. Jeeves. (1985). Visual analysis of body movements by neurones in the temporal cortex of the macaque monkey: A preliminary report. Behav. Brain Res. 16:153-170.

Poggio, G. F., and T. Poggio. (1984). The analysis of stereopsis. Ann. Rev. Neurosci. 7:379-412.

Poggio, T., and C. Koch. (1985). Ill-posed problems in early vision: From computational theory to analogue networks. Proc. R. Soc. Lond. B. Biol. Sci. 226:303-323.

Ramachandran, V. S. (1988). Perceiving shape from shading. Sci. Am. 259(2):76-83.

Roy, J. P., H. Komatsu, and R. H. Wurtz. (1992). Disparity sensitivity of neurons in monkey extrastriate area MST. J. Neurosci. 12:2478-2492.

Siegel, R. M., and R. A. Andersen. (1988). Perception of three-dimensional structure from two-dimensional motion in monkey and man. Nature 331:259-261.

Siegel, R. M., and H. L. Read. (1997). Analysis of optic flow in the monkey parietal area 7a. Cereb. Cortex 7:327-346.

Tanaka, K. (1993). Neuronal mechanisms of object recognition. Science 262:685-688.

Tanaka, K., K. Hikosaka, H. Saito, M. Yukie, Y. Fukada, and E. Iwai. (1986). Analysis of local and wide-field movements in the superior temporal visual areas of the macaque monkey. J. Neurosci. 6:134-144.

Ullman, S. (1979). The Interpretation of Visual Motion. Cambridge, MA: MIT Press.

von der Heydt, R., E. Peterhans, and G. Baumgartner. (1984). Illusory contours and cortical neuron responses. Science 224:1260-1262.

Wade, N. J., and H. Ono. (1985). The stereoscopic views of Wheatstone and Brewster. Psychol. Res. 47(3):125-133.

Wallach, H., and D. N. O'Connell. (1953). The kinetic depth effect J. Exp. Psychol. 45:205-217 .

Structure from Visual Information Sources

See also

References