Stereo and Motion Perception

Given several views of a scene, how does one recover its structure and, possibly, the pose of the cameras that have acquired the views? In the case of two views, this is known as the stereo problem ; when there are more than two views, it is the discrete structure from motion (DSFM) problem. These are two of the subjects of COMPUTATIONAL VISION. Many applications of DSFM are found in robotics in general and in MOBILE ROBOTS in particular, as well as in VISUAL OBJECT RECOGNITION.

A general solution to the DSFM problem requires the availability of correspondences, based, for example, on TEXTURE, between some of the views. How to obtain those correspondences is itself a formidable problem that completely determines the solution to the DSFM problem.

Why is this a formidable task? Suppose that there are M views and the number of pixels per view is N. A correspondence is an M-tuple of features and there are NM such tuples. If we take N = 1 million, M = 10, the number of possible tuples is 1060!

To reduce this number, we must exploit the geometrical constraints that are induced by the fact that a camera is a projective engine. Projective geometry is a rich source of constraints on the number of possible correspondences and, as we will show, is a fundamental tool for determining the sort of geometrical structure that can be recovered from the views. DFSM can be thought of as a combination of a HEURISTIC SEARCH and CONSTRAINT SATISFACTION problem.

A camera is a projective engine because it can usually be accurately modeled as a pinhole camera. In the pinhole camera, an image is formed on the retinal plane by perspective projection with respect to the optical center (Faugeras 1993). Since this is a projective construction, it implies that the tools of projective geometry (Semple and Kneebone 1952) can immediately be brought to bear on the DSFM problem. In particular, projective geometry allows us to represent a camera by a matrix, called the perspective projection matrix.

Projective geometry is a rich source of constraints for the DSFM problem because M image points are the images of a single 3-D point, if and only if their image coordinates satisfy a number of polynomial constraints. The coefficients of those polynomials are functions of the (unknown) coefficients of the perspective projection matrices.

The simplest example occurs in the case of two views, that is, stereo, where the coordinates of two corresponding points satisfy a quadratic polynomial. This polynomial arises from the beautiful epipolar geometry of two views that is routinely used in common stereo programs (Grimson 1985).

The next example is the case of three views (M = 3), which offers an even richer geometry and algebra. The coordinates of three corresponding points satisfy four algebraically independent polynomials of degree 3, the trilinear constraints (Hartley 1997; Shashua 1995).

When does this process stop? There are two phenomena that take place. First, when M = 4 the constraints of degree 4 are algebraically dependent on the polynomials of degrees 2 and 3 obtained from subsets of 2 and 3 views of the M views (Faugeras and Mourrain 1995). Second, the case M > 4 does not bring in any new constraints. Since, moreover, the constraints of degree 3 imply all constraints of degree 2 (Heyden 1995), the geometrical constraints between M > 3 views are completely described by the trilinear constraints between all triples of views among the M.

In order to use the constraints to eliminate wrong correspondences or to reconstruct the scene, their coefficients have to be estimated. This can only be achieved if "correct" correspondences can be obtained. This sounds like a chicken-and-egg problem since correct correspondences are needed to estimate the constraints that will then be used to extract correct correspondences. RANSACK -like techniques (Fischler and Bolles 1981) can be used successfully to break that vicious circle and produce reliable estimates of the constraints (Zhang et al. 1995; Torr and Zissermann 1997).

The geometry of the M views being represented by the trilinear constraints, can we recover the perspective projection matrices of the cameras from them? There is no unique solution to this problem in the absence of further information; given all the trilinear constraints (in practice only a small subset of those are necessary), a four parameter family of perspective projection matrices of the M cameras can be computed. The four parameters correspond to the global scale of the scene and to the choice of a plane of reference (Luong and Viéville 1994).

Any element of this family of perspective projection matrices allows us to recover the projective structure of the scene (Faugeras 1992; Hartley and Gupta 1993). By projective structure we mean that the 3-D coordinates of the reconstructed points are defined up to an arbitrary projective transformation of the 3-D space, that is, up to fifteen parameters.

How can the Euclidean coordinates of the points be obtained? It is in general possible, even without any a priori information on the scene, thanks to the use of group invariance theory. The information contained in the perspective projection matrix of a camera can be decomposed into two parts, one that encodes the internal parameters of the camera, for instance its focal length, and the other that encodes its pose. In order to access the Euclidean coordinates of the scene it is only necessary to recover the internal parameters since the camera poses can be computed from them.

The internal parameters of the cameras can be recovered from feature correspondences in at least two cases: first, if the images were acquired by the same camera (Maybank and Faugeras 1992) with the same internal parameters; and second, if the internal parameters are different but satisfy some constraints which are true for most cameras (Heyden and Åström 1997). In both cases, a simple mathematical entity called the absolute conic, or umbilic, plays the role of a calibration grid. The difference between this conic and the usual calibration grids is that the umbilic is located in the plane at infinity, a plane that defines the affine geometry of three-space (it is invariant to affine transformations), and has only complex points. The umbilic defines the Euclidean geometry of three-space (it is invariant to Euclidean transformations). The process can be thought of as first recovering the projective, then affine, and finally Euclidean structure of the scene (Faugeras 1995). The scene is then defined up to an arbitrary rigid transformation and scaling of the 3-D space, that is, up to six parameters.

It is worth noting that this stratified way of solving the DSFM problem is a faint echo of the Erlangen program put forward by the two German mathematicians, Felix Klein and Herman Weyl, in 1872 in which they suggested the study of geometry from the viewpoint of invariance of geometrical figures to the action of groups of transformations (Mundy and Zisserman 1992).

See also

-- Olivier D. Faugeras

References

Faugeras, O. (1992). What can be seen in three dimensions with an uncalibrated stereo rig. In G. Sandini, Ed., Proceedings of the 2nd European Conference on Computer Vision. Vol. 588, Lecture Notes in Computer Science. Santa Margherita Ligure, Italy: Springer, pp. 563-578.

Faugeras, O. (1993). Three-Dimensional Computer Vision: A Geometric Viewpoint. Cambridge, MA: MIT Press.

Faugeras, O. (1995). Stratification of 3-D vision: Projective, affine, and metric representations. Journal of the Optical Society of America A.Optics and Image Science 12(3):465-484.

Faugeras, O., and B. Mourrain. (1995). About the correspondences of points between n images. In Proceedings of the Workshop on the Representation of Visual Scenes. Cambridge, MA: IEEE Computer Society.

Fischler, M., and R. Bolles. (1981). Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM 24:381-385.

Grimson, W. (1985). Computational experiments with a feature based stereo algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence 7(1):17-34.

Hartley, R. I. (1997). Lines and points in three views and the trifocal tensor. The International Journal of Computer Vision 22(2):125-140.

Hartley, R. I., and R. Gupta. (1993). Computing matched-epipolar projections. In Proceedings of the International Conference on Computer Vision and Pattern Recognition, New York: IEEE Computer Society, pp. 549-555.

Heyden, A. (1995). Geometry and Algebra of Multiple Projective Transformations. Ph.D. diss., Lund University, Sweden.

Heyden, A., and K. Åström. (1997). Euclidean reconstruction from image sequences with varying and unknown focal length and principal point. In Computer Vision and Pattern Recognition IEEE Computer Society Press, pp. 438-443.

Luong, Q.-T., and T. Viéville. (1994). Canonic representations for the geometries of multiple projective views. In J.-O. Eklundh, Ed., Proceedings of the 3rd European Conference on Computer Vision. Vol. 1, Lecture Notes in Computer Science. Stockholm: Springer, pp. 589-599.

Maybank, S. J., and O. D. Faugeras. (1992). A theory of self- calibration of a moving camera. The International Journal of Computer Vision 8(2):123-152.

Mundy, J. L., and A. Zisserman, Eds. (1992). Geometric Invariance in Computer Vision. Cambridge, MA: MIT Press.

Semple, J., and G. Kneebone. (1952/1979). Algebraic Projective Geometry. Oxford: Clarendon Press.

Shashua, A. (1995). Algebraic functions for recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 17(8):779-789.

Torr, P., and A. Zissermann. (1997). Performance characterization of fundamental matrix estimation under image degradation. Machine Vision and Applications 9:321-333.

Zhang, Z., R. Deriche, O. Faugeras, and Q.-T. Luong. (1995). A robust technique for matching two uncalibrated images through the recovery of the unknown epipolar geometry. Artificial Intel ligence Journal 78:87-119.