Vision and Learning

Learning is now perceived as the gateway to understanding the problem of INTELLIGENCE. Because seeing is a factor in intelligence, learning is also becoming a key to the study of artificial and biological vision. In the last few years both computer vision -- which attempts to build machines that see -- and visual neuroscience -- which aims to understand how our visual system works -- are undergoing fundamental changes in their approaches. Visual neuroscience is beginning to focus on the mechanisms that allow the CEREBRAL CORTEX to adapt its circuitry and learn a new task. Instead of building a hardwired machine or program to solve a specific visual task, computer vision is trying to develop systems that can be trained with examples of any of a number of visual tasks. The challenge is to develop machines that learn to perform visual tasks such as visual inspection and visual recognition from a set of training examples or even in an unsupervised way from visual experience.

This reflects an overall trend -- to make intelligent systems that do not need to be fully and painfully programmed for specific tasks. In other words, computers will have to be much more like our brain, learning to see rather than being programmed to see. Biological visual systems are more robust and flexible than machine vision mainly because they continuously adapt and learn from experience. At stake are engineering as well as scientific issues. On the engineering side the possibility to build vision systems that can adapt to different tasks can have enormous impact in many areas such as automatic inspection, image processing, video editing, virtual reality, multimedia databases, computer graphics, and man-machine interfaces. On the biological side, our present understanding of how the cortex works may radically change if adaption and learning turn out to play a key role. Instead of the hardwired cortical structures implied by classical work, for instance by Harvard's David Hubel and Torsten Wiesel, we may be confronted with significant NEURAL PLASTICITY -- that is, neuron properties and connectivities that change as a function of visual experience over time scales of a few minutes or seconds.

There are two main classes of learning techniques that are being applied to machine vision: supervised and unsupervised learning algorithms (see UNSUPERVISED LEARNING). Supervised learning -- or learning-from-examples -- refers to a system that is trained, instead of programmed, by a set of examples. The training thus can be considered as using input-output pairs. At run-time the trained system provides a correct output for a new input not contained in the training set. The underlying theory makes use of function approximation techniques, neural network architectures, and statistical methods. Systems have been developed that learn to recognize objects, in particular faces (see FACE RECOGNITION), systems that learn to find specific objects in cluttered scenes, software that learns to draw cartoon characters from an artist's drawings, and algorithms that learn to synthesize novel image sequences from a few real pictures and thereby promise to achieve extremely high compression in applications such as video conference and video e-mail. So far the most ambitious unsupervised learning techniques have been used only in simple, "toy" examples, but they represent the ultimate goal: learning to see, from experience, without a teacher.

In computer vision tasks (see COMPUTATIONAL VISION) the input to the supervised learning system is a digitized image or an image sequence and the output is a set of parameters estimated from the image. For instance, in the ALVIN system, developed by Dean Pomerleau (1993) at Carnegie Mellon University for the task of driving a car, the input is a series of images of the road and the output is degrees of steering. In recognition tasks the output parameters consist of a label identifying the object in the image (see VISUAL OBJECT RECOGNITION, AI).

The analysis problems of estimating object labels as well as other parameters from images is the problem of vision. It is the inverse of the classical problem of classical optics and modern computer graphics, where the question is how to synthesize images corresponding to given 3-D surfaces as a function of parameters such as direction of illuminant, position of the camera and material properties of the object. In the supervised learning framework it is natural to use a learning module to associate input parameters to output images. This module can then synthesize new images. Traditional 3-D computer graphics simulates the physics of the world by building 3-D models, transforming them in 3-D space, simulating their physical properties, and finally rendering them by simulating geometrical optics. The learning- from-examples paradigm suggests a rather different and unconventional approach: take several real images of a 3-D object and create new images by generalizing from those views, under the control of appropriate pose and expression parameters, assigned by the user during the training phase.

A large class of SUPERVISED LEARNING schemes suggests directly a view-based approach to computer vision and to computer graphics. Though it cannot be seen as a substitute for the more traditional approaches, the learning-from-examples approach to vision and graphics may represent an effective shortcut to several problems.

An obvious application of the supervised learning framework is recognition of 3-D objects. The idea is to train the learning module with a few views of the object to be recognized -- in general, from different viewpoints and under different illuminations -- and the corresponding label (as output), without any explicit 3-D model. This corresponds to a classification problem as opposed to the regression problem of estimating real-valued parameters associated with the image. An interesting demonstration of the power of this view-based paradigm is the development of several successful face recognition systems.

Even more difficult than recognizing an isolated specific object is detecting an object of a certain class in a cluttered image. Again, supervised learning systems have been developed that can be trained to detect faces, cars, and people in complex images.

The key problem for the practical use of most learning-from-examples schemes is often the insufficient size of the training set. Because input vectors typically have a high dimension (like the number of pixels in an image), the required number of training examples is very high. This is the so-called curse of dimensionality. The natural idea is to exploit prior information to generate additional virtual examples from a small set of real example images. For in-stance, knowledge of symmetry properties of a class of 3-D objects allows the synthesis of additional examples. More generally, it is possible to learn legal transformations typical for a certain class of objects from examples drawn from images of objects of the same class.

The example-based approach is successful in practical problems of object recognition, image analysis, and image synthesis. It is not surprising therefore to ask whether a similar approach may be used by our brain. Networks that learn from examples have an obvious appeal given our knowledge of neural mechanisms. Over the last four years psychophysical experiments have indeed supported the view-based schemes and physiological experiments have provided a suggestive glimpse on how neurons in the IT cortex may represent objects for recognition (Logothetis, Pauls, and Poggio 1995 and references therein). The experimental results seem to agree to a surprising extent with the view-based models.

See also

-- Tomaso Poggio

References

Logothetis, N. K., J. Pauls, and T. Poggio. (1995). Shape representation in the inferior temporal cortex of monkeys. Current Biology 5(5):552-563.

Pomerleau, D. A. (1993). Neural Network Perception for Mobile Robot Guidance. Dordrecht: Kluwer.

Further Readings

Beymer, D., and T. Poggio. (1996). Image representation for visual learning. Science 272:1905-1909.

Murase, H., and S. K. Nayar. (1995). Visual learning and recognition of 3-D objects from appearance. International Journal of Computer Vision 14(1):5-24.

Nayar, S. K., and T. Poggio, Eds. (1996). Early Visual Learning. Oxford: Oxford University Press.

Poggio, T., and D. Beymer. (1996). Learning to see. IEEE Spectrum: 60-69.

Poggio, T., and S. Edelman. (1990). A network that learns to recognize 3-D objects. Nature 343:263-266.

Rowley, H. A., S. Baluja, and T. Kanade. (1995). Human face detection in visual scenes. Technical report CMU-CS-95-158R. School of Computer Science, CMU.

Turk, M., and A. Pentland. (1991). Face recognition using eigenfaces. Proceedings CVPR, Los Alamitos, CA: IEEE Computer Society Press, 586-591.