Computational learning theory is the study of a collection of mathematical models of machine learning, and has among its goals the development of new algorithms for learning from data, and the elucidation of computational and information-theoretic barriers to learning. As such, it is closely related to disciplines with similar aims, such as statistics and MACHINE LEARNING (see also NEURAL NETWORKS; PATTERN RECOGNITION AND FEEDFORWARD NETWORKS; and BAYESIAN LEARNING). However, it is perhaps coarsely distinguished from these other areas by an explicit and central concern for computational efficiency and the COMPUTATIONAL COMPLEXITY of learning problems.

Most of the problems receiving the
greatest scrutiny to date can be traced back to the seminal paper
of L. G. Valiant (1984), where a simple and appealing model is proposed
that emphasizes three important notions. First, learning is probabilistic:
a finite sample of data is generated randomly according to an unknown
process, thus necessitating that we tolerate some error, hopefully
quantifiable, in the hypothesis output by a learning algorithm.
Second, learning algorithms must be computationally efficient, in
the standard complexity-theoretic sense: given a sample of *m* observations
from the unknown random process, the execution time of a successful
learning algorithm will be bounded by a fixed polynomial in *m*.
Third, learning algorithms should be appropriately general: they
should process the finite sample to obtain a hypothesis with good generalization
ability under a reasonably large set of circumstances.

In Valiant's original paper,
the random process consisted of an unknown distribution or density *P* over
an input space *X*, and an unknown Boolean (two-valued) target
function *f* over *X*, chosen from a known class *F* of
such functions. The finite sample given to the learning algorithm
consists of pairs <*x, y*>, where *x* is
distributed according to *P* and *y = f(x)*.
The demand that learning algorithms be general is captured by the
fact that the distribution *P* is arbitrary, and the function
class *F* is typically too large to permit an exhaustive
search for a good match to the observed data. A typical example
sets *F* to be the class of all linear-threshold functions
(perceptrons) over *n*-dimensional real inputs. In this case,
the model would ask whether there is a learning algorithm that,
for any input dimension *n* and any desired error
e > 0, requires a sample size and execution time bounded
by fixed polynomials in *n* and *1/e,* and
produces (with high probability) a hypothesis function *h* such
that the probability that *h(x)**f(x)* is smaller
than e under *P*. Note that we demand that the hypothesis
function *h* generalize to unseen data (as represented by
the distribution *P*), and not simply fit the observed training data.

The last decade has yielded a wealth
of results and analyses in the model sketched above and related
variants. Many of the early papers demonstrated that simple and
natural learning problems can be computationally difficult for a
variety of interesting reasons. For instance, Pitt and Valiant (1988) showed
learning problems for which the "natural" choice for
the form of the hypothesis function *h* leads to an NP-hard
problem, but for which a more general hypothesis representation
leads to an efficient algorithm. Kearns and Valiant (1994) exhibited
close connections between hard learning problems and cryptography
by showing that several natural problems, including learning finite automata
and boolean formula, are computationally difficult regardless of
the method used to represent the hypothesis.

These results demonstrated that powerful
learning algorithms were unlikely to be developed within Valiant's
original framework without some modifications, and researchers turned
to a number of reasonable relaxations. One fruitful variant has been
to supplement the random sample given to the learning algorithm
with a mechanism to answer *
queries* -- that is, rather than simply passively receiving <*x,
y*> pairs drawn from some distribution, the algorithm
may now actively request the classification of any desired *x* under
the unknown target function *f*. With this additional mechanism, a
number of influential and elegant algorithms have been discovered,
including for finite automata by Angluin (1987, 1988; to be contrasted
with the intractability without the query mechanism, mentioned above)
and for decision trees by Kushelevitz and Mansour (1993).

One drawback to the query mechanism
is the difficulty of simulating such a mechanism in real machine
learning applications, where typically only passive observations
of the kind posited in Valiant's original model are available.
An alternative but perhaps more widely applicable relaxation of
this model is known as the *
weak learning* or *
boosting* model. Here it is assumed that we already have possession
of an algorithm that is efficient, but meets only very weak (but
still nontrivial) generalization guarantees. This is formalized
by asserting that the weak learning algorithm always outputs a hypothesis
whose error with respect to the unknown target function is slightly
better than "random guessing." The goal of a boosting
algorithm is then to combine the many mediocre hypotheses' output
by several executions of the weak learning algorithm into a single
hypothesis that is much better than random guessing. This is made
possible by assuming that the weak learning algorithm will perform
better than random guessing on many *different* distributions
on the inputs. Initially proposed to settle some rather theoretical
questions in Valiant's original model, the boosting framework
has recently resulted in new learning algorithms enjoying widespread
experimental success and influence (Freund and Schapire 1997), as
well as new analyses of some classic machine learning heuristics,
such as the CART and C4.5 decision tree programs (Kearns and Mansour 1996).

Although computational considerations are the primary distinguishing feature of computational learning theory, and have been the emphasis here, it should be noted that a significant fraction of the work and interest in the field is devoted to questions of a primarily statistical or information-theoretic nature. Thus, characterizations of the number of observations required by any algorithm (computationally efficient or otherwise) for good generalization have been the subject of intense and prolonged scrutiny, building on the foundational work of Vapnik (1982; see STATISTICAL LEARNING THEORY). Recent work has also sought similar characterizations without probabilistic assumptions on the generation of observations (Littlestone 1988), and has led to a series of related and experimentally successful algorithms.

A more extensive bibliography and an introduction to some of the central topics of computational learning theory is contained in Kearns and Vazirani 1994.

Angluin, D. (1987). Learning regular sets from queries and counterexamples. Information and Computation 75(2):87-106.

Angluin, D. (1988). Queries and concept learning. Machine Learning 2(4):319-342.

Blumer, A., A. Ehrenfeucht, D. Haussler, and M. Warmuth. (1989). Learnability and the Vapnik-Chervonenkis dimension. Journal of the ACM 36(4):929-965.

Freund, Y., and R. Schapire. (1997). A decision - theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences 55(1):119-139.

Kearns, M., and Y. Mansour. (1996). On the boosting ability of top-down decision tree learning algorithms. In Proceedings of the 28th Annual ACM Symposium on the Theory of Computing. Forthcoming in Journal of Computer and Systems Sciences.

Kearns, M., and L. G. Valiant. (1994). Cryptographic limitations on learning Boolean formulae and finite automata. Journal of the ACM 41(1):67-95.

Kearns, M., and U. Vazirani. (1994). An Introduction to Computational Learning Theory. Cambridge: MIT Press.

Kushelevitz, E., and Y. Mansour. (1993). Learning decision trees using the Fourier Spectrum. SIAM Journal on Computing 22(6):1331-1348.

Littlestone, N. (1988). Learning when irrelevant attributes abound: A new linear-threshold algorithm. Machine Learning 2:285-318.

Pitt, L., and L. G. Valiant. (1988). Computational limitations on learning from examples. Journal of the ACM 35:965-984.

Schapire, R. (1990). The strength of weak learnability. Machine Learning 5(2):197-227.

Valiant, L. G. (1984). A theory of the learnable. Communications of the ACM 27(11):1134-1142.

Vapnik, V. N. (1982). Estimation of Dependences Based on Empirical Data. New York: Springer-Verlag .