Statistical learning theory addresses a key question that arises when constructing predictive models from data -- how to decide whether a particular model is adequate or whether a different model would produce better predictions. Whereas classical statistics typically assumes that the form of the correct model is known and the objective is to estimate the model parameters, statistical learning theory presumes that the correct form is completely unknown and the goal is to identify the best possible model from a set of competing models. The models need not have the same mathematical form and none of them need be correct. The theory provides a sound statistical basis for assessing model adequacy under these circumstances, which are precisely the circumstances encountered in MACHINE LEARNING, PATTERN RECOGNITION, and exploratory data analysis.

Estimating the performance of competing
models is the central issue in statistical learning theory. Performance
is measured through the use of *loss functions.* The loss
*Q*(**z**, α) between
a data vector **z** and a specific model α (one
with values assigned to all parameters) is a score that indicates how
well α performs on **z**,
with lower scores indicating better performance. The squared-error
function for regression models, the 0/1 loss function for
classification models, and the negative log likelihood for other
more general statistical models are all examples of loss functions.
The choice of loss function depends on the nature of the modeling
problem.

From the point of view of UTILITY THEORY,α is a decision variable, **z** is
an outcome, and *Q*(**z**,
α) is the negative utility of the outcome
given the decision. Hence, if the statistical properties of the
data were already known, the optimum model would be the α that
minimizes the expected loss *R*(α):

where *F*(**z**) is the probability measure that
defines the true statistical properties of the
data. *R*(α) is also referred to as the *risk* of
α. In learning situations, however,
*F*(**z**) is unknown and one must choose a model
based on a set of observed data vectors
**z**_{i}, *i* = 1, . . . ,
*l*, that are assumed to be random samples of
*F*(**z**). The average loss
*R*_{emp}(α, *l*) on the observed data is
used as an empirical estimate of the expected loss, where

*R*_{emp}(α, `l`) is also referred to as
the *empirical risk* of α.

The fundamental question in statistical learning theory is the
following: under what conditions does minimizing
*R*_{emp}(α, `l`) yield models that also
minimize *R*(α), inasmuch as the latter is what we
actually want to accomplish? This question is answered by considering
the accuracy of the empirical loss estimate.

As in classical statistics, accuracy is expressed in terms of
confidence regions; that is, how far can
*R*_{emp}(α, `l`) be expected to deviate
from *R*(α), and with what probability? One of the
fundamental theorems of statistical learning theory shows that the
size of the confidence region is governed by the maximum difference
between the two losses over all models being considered:

where Λ is a set of competing models. The maximum difference dominates because of the phenomenon of overfitting.

Overfitting occurs when the best model relative to the training data
tends to perform significantly worse when applied to new data. This
mathematically corresponds to a situation in which the average loss
*R*_{emp}(α, `l`) substantially
underestimates the expected loss *R*(α). Although there
is always some probability that underestimation will occur for a fixed
model α, both the probability and the degree of underestimation
are increased by the fact that we explicitly search for the a that
minimizes *R*_{emp}(α, `l`). This
search biases the difference between *R*(α) and
*R*_{emp}(α, `l`) toward the maximum
difference among competing models. If the maximum difference does not
converge to zero as the number of data vectors `l` increases,
then overfitting will occur with probability one.

The core results in statistical learning theory are a series of
probability bounds developed by Vapnik and Chervonenkis (1971, 1981,
1991) that define small-sample confidence regions for the maximum
difference between *R*(α) and
*R*_{emp}(α, `l`). The confidence
regions differ from those obtained in classical statistics in three
respects. First, they do not assume that the models are
correct. Second, they are based on small-sample statistics and are not
asymptotic approximations. Third, a uniform method is used to take
into account the degree to which overfitting can occur for a given set
of competing models. This method is based on a measurement known as
the Vapnik-Chervonenkis (VC) dimension.

Conceptually speaking, the VC dimension of a set of models is the
maximum number of data vectors for which overfitting is virtually
guaranteed in the sense that one can always find a specific model that
fits the data exactly. For example, the VC dimension of the family of
linear discriminant functions with *n* parametric terms is
*n*, because *n* linear terms can be used to
discriminate exactly *n* points in general position for any
two-class labeling of the points. This conceptual definition of VC
dimension accurately reflects the formal definition in the case of 0/1
loss functions, wherein *Q*(**z**, α) = 0 if
a correctly predicts **z** and
*Q*(**z**, α) = 1 otherwise. However, the
formal definition is more general in that it considers arbitrary loss
functions and does not require exact fits.

In the probability bounds obtained by Vapnik and Chervonenkis, the
size of the confidence region is largely determined by the ratio of
the VC dimension *h* to the number of data vectors
`l`. For example, if *Q*(**z**, α)
is the 0/1 loss function used for classification models, then with
probability at least 1 - η,

where

Note that the ratio of *h* over `l` is the dominant
term in the definition of ε and, hence, in the size of the
confidence region for *R*(α).

V. N. Vapnik (1982, 1995, 1998) has
reported probability bounds for other families of loss functions
that yield analogous confidence regions based on VC dimension. Bounds
also exist for the special case in which the set of competing models
is finite (continuous parameters typically imply an infinite number
of specific models). These bounds avoid explicit calculation of
VC dimension and are useful in validation-set methods.
A remarkable property shared by all of the bounds is that they either
make no assumptions at all or very weak assumptions about underlying
probability distribution *F*(**z**). In addition,
they are valid for small sample sizes and they depend only on the
VC dimension of the set of competing models λ,
or on its size, and on the properties of the loss function *Q*(**z**,
α). All bounds are independent of the mathematical
forms of the models -- the VC dimension and/or the
number of specific models summarizes all relevant information. Thus,
the bounds are equally applicable to both nonlinear and nonparametric models,
and to combinations of dissimilar model families. This includes NEURAL NETWORKS, DECISION TREES and rules, regression trees
and rules, radial basis functions, BAYESIAN NETWORKS,
and so on.

When using statistical learning theory to identify the best model from
a set of competing models, the models must first be ordered according
to preference. The most preferable model that best explains the data
is then selected. The preference order corresponds to the notion of
learning bias found in machine learning. No restrictions are placed
on the order other than it must be fixed prior to model selection. The
ordering itself is referred to as a *structure* and the process
of selecting models is called *structural risk minimization.*

Structural risk minimization has two components: one is to determine a
cutoff point in the preference ordering, the other is to select the
best model from among those that occur before the cutoff. As the
cutoff point is advanced through the ordering, both the subset of
models that appear before the cutoff and the VC dimension of this
subset steadily increase. With more models to choose from, the minimum
average loss *R*_{emp}(α, `l`) for all
models α before the cutoff tends to decrease. However, the size
of the confidence region for *R*(α) tends to increase
because the size is governed by the VC dimension. The cutoff point is
selected by minimizing the upper bound on the confidence region for
*R*(α), with the corresponding α chosen as the most
suitable model given the available data. For example, for
classification problems one would choose the cutoff and the associated
model α so as to minimize the right hand side of the inequality
presented above for a desired setting of the confidence parameter
η.

The overall approach is illustrated by the graph in figure 1. The process balances the ability to find increasingly better fits to the data against the danger of overfitting and thereby selecting a poor model. The preference order provides the necessary structure in which to compare competing models. Judicious choice of the model order enables one to avoid overfitting even in high-dimensional spaces. For example, Vapnik and others (Cortes and Vapnik 1995; Vapnik 1995, 1998) order models within parametric families according to the magnitudes of the parameters. Each preference cutoff then limits the parameter magnitudes, which in turn limits the VC dimension of the corresponding subset of models. Reliable models can thus be obtained using structural risk minimization even when the number of data samples is orders of magnitude less than the number of parameters.

- PROBABILITY, FOUNDATIONS OF
- PROBABILISTIC REASONING
- STATISTICAL TECHNIQUES IN NATURAL LANGUAGE PROCESSING

- A Statistical Learning/Pattern Recognition Glossary
- Vapnik, The Nature of Statistical Learning Theory

Cortes, C., and V. N. Vapnik. (1995). Support vector machines. Machine Learning 20:273-297.

Vapnik, V. N., and A. J. Chervonenkis. (1971). On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and Its Applications 16:264-280. Original work published 1968 as Doklady Akademii Nauk USSR 181(4).

Vapnik, V. N., and A. J. Chervonenkis. (1981). Necessary and sufficient conditions for the uniform convergence of means to their expectations. Theory of Probability and Its Applications 26:532-553.

Vapnik, V. N., and A. J. Chervonenkis. (1991). The necessary and sufficient conditions for consistency of the method of empirical risk minimization. Pattern Recognition and Image Analysis 1:284-305. Original work published as Yearbook of the Academy of Sciences of the USSR on Recognition, Classification, and Forecasting 2 (1989).

Vapnik, V. N. (1982). Estimation of Dependencies Based on Empirical Data. New York: Springer.

Vapnik, V. N. (1995). The Nature of Statistical Learning Theory. New York: Springer.

Vapnik, V. N. (1998). Statistical Learning Theory. New York: Wiley.

Biggs, N., and M. H. G. Anthony. (1992). Computational Learning Theory: An Introduction. Cambridge Tracts in Theoretical Computer Science, No. 30. London: Cambridge University Press.

Devroye, L., L. Gyorfi, and G. Lugosi. (1996). A Probabilistic Theory of Pattern Recognition. New York: Springer.

Dudley, R. M. (1984). A course on empirical processes. In R. M. Dudley, H. Kunita, and F. Ledrappier, Eds., École d'Été de Probabilités de Sainte-Flour XII - 1982. Lecture Notes in Mathematics, vol. 1097. New York: Springer, pp. 1-142.

Kearns, M. J., and U. V. Vazirani. (1994). An Introduction to Computational Learning Theory. Cambridge, MA: MIT Press.

Parrondo, J. M. R., and C. Van den Broeck. (1993). Vapnik-Chervonenkis bounds for generalization. Journal of Physics A 26:2211-2223.

Pollard, D. (1984). Convergence of Stochastic Processes. New York: Springer.

Pollard, D. (1990). Empirical Processes: Theory and Applications. NSF-CBMS Regional Conference Series in Probability and Statistics, vol. 2. Hayward, CA: Institute of Mathematical Statistics.

Shawe-Taylor, J. S., A. Macintyre, and M. Jerrum, Eds. (1998). Special Issue on the Vapnik-Chervonenkis Dimension. Discrete Applied Mathematics 86(1).

Vidyasagar, M. (1997). A Theory of Learning and Generalization: With Applications to Neural Networks and Control Systems (Communications and Control Engineering). New York: Springer.