Computational Learning Theory

Computational learning theory is the study of a collection of mathematical models of machine learning, and has among its goals the development of new algorithms for learning from data, and the elucidation of computational and information-theoretic barriers to learning. As such, it is closely related to disciplines with similar aims, such as statistics and MACHINE LEARNING (see also NEURAL NETWORKS; PATTERN RECOGNITION AND FEEDFORWARD NETWORKS; and BAYESIAN LEARNING). However, it is perhaps coarsely distinguished from these other areas by an explicit and central concern for computational efficiency and the COMPUTATIONAL COMPLEXITY of learning problems.

Most of the problems receiving the greatest scrutiny to date can be traced back to the seminal paper of L. G. Valiant (1984), where a simple and appealing model is proposed that emphasizes three important notions. First, learning is probabilistic: a finite sample of data is generated randomly according to an unknown process, thus necessitating that we tolerate some error, hopefully quantifiable, in the hypothesis output by a learning algorithm. Second, learning algorithms must be computationally efficient, in the standard complexity-theoretic sense: given a sample of m observations from the unknown random process, the execution time of a successful learning algorithm will be bounded by a fixed polynomial in m. Third, learning algorithms should be appropriately general: they should process the finite sample to obtain a hypothesis with good generalization ability under a reasonably large set of circumstances.

In Valiant's original paper, the random process consisted of an unknown distribution or density P over an input space X, and an unknown Boolean (two-valued) target function f over X, chosen from a known class F of such functions. The finite sample given to the learning algorithm consists of pairs <x, y>, where x is distributed according to P and y = f(x). The demand that learning algorithms be general is captured by the fact that the distribution P is arbitrary, and the function class F is typically too large to permit an exhaustive search for a good match to the observed data. A typical example sets F to be the class of all linear-threshold functions (perceptrons) over n-dimensional real inputs. In this case, the model would ask whether there is a learning algorithm that, for any input dimension n and any desired error e > 0, requires a sample size and execution time bounded by fixed polynomials in n and 1/e, and produces (with high probability) a hypothesis function h such that the probability that h(x)f(x) is smaller than e under P. Note that we demand that the hypothesis function h generalize to unseen data (as represented by the distribution P), and not simply fit the observed training data.

The last decade has yielded a wealth of results and analyses in the model sketched above and related variants. Many of the early papers demonstrated that simple and natural learning problems can be computationally difficult for a variety of interesting reasons. For instance, Pitt and Valiant (1988) showed learning problems for which the "natural" choice for the form of the hypothesis function h leads to an NP-hard problem, but for which a more general hypothesis representation leads to an efficient algorithm. Kearns and Valiant (1994) exhibited close connections between hard learning problems and cryptography by showing that several natural problems, including learning finite automata and boolean formula, are computationally difficult regardless of the method used to represent the hypothesis.

These results demonstrated that powerful learning algorithms were unlikely to be developed within Valiant's original framework without some modifications, and researchers turned to a number of reasonable relaxations. One fruitful variant has been to supplement the random sample given to the learning algorithm with a mechanism to answer queries -- that is, rather than simply passively receiving <x, y> pairs drawn from some distribution, the algorithm may now actively request the classification of any desired x under the unknown target function f. With this additional mechanism, a number of influential and elegant algorithms have been discovered, including for finite automata by Angluin (1987, 1988; to be contrasted with the intractability without the query mechanism, mentioned above) and for decision trees by Kushelevitz and Mansour (1993).

One drawback to the query mechanism is the difficulty of simulating such a mechanism in real machine learning applications, where typically only passive observations of the kind posited in Valiant's original model are available. An alternative but perhaps more widely applicable relaxation of this model is known as the weak learning or boosting model. Here it is assumed that we already have possession of an algorithm that is efficient, but meets only very weak (but still nontrivial) generalization guarantees. This is formalized by asserting that the weak learning algorithm always outputs a hypothesis whose error with respect to the unknown target function is slightly better than "random guessing." The goal of a boosting algorithm is then to combine the many mediocre hypotheses' output by several executions of the weak learning algorithm into a single hypothesis that is much better than random guessing. This is made possible by assuming that the weak learning algorithm will perform better than random guessing on many different distributions on the inputs. Initially proposed to settle some rather theoretical questions in Valiant's original model, the boosting framework has recently resulted in new learning algorithms enjoying widespread experimental success and influence (Freund and Schapire 1997), as well as new analyses of some classic machine learning heuristics, such as the CART and C4.5 decision tree programs (Kearns and Mansour 1996).

Although computational considerations are the primary distinguishing feature of computational learning theory, and have been the emphasis here, it should be noted that a significant fraction of the work and interest in the field is devoted to questions of a primarily statistical or information-theoretic nature. Thus, characterizations of the number of observations required by any algorithm (computationally efficient or otherwise) for good generalization have been the subject of intense and prolonged scrutiny, building on the foundational work of Vapnik (1982; see STATISTICAL LEARNING THEORY). Recent work has also sought similar characterizations without probabilistic assumptions on the generation of observations (Littlestone 1988), and has led to a series of related and experimentally successful algorithms.

A more extensive bibliography and an introduction to some of the central topics of computational learning theory is contained in Kearns and Vazirani 1994.

See also

INDUCTION THEORY; INFORMATION THEORY; LEARNING SYSTEMS; STATISTICAL TECHNIQUES IN NATURAL LANGUAGE PROCESSING

Additional links

-- Michael Kearns

References

Angluin, D. (1987). Learning regular sets from queries and counterexamples. Information and Computation 75(2):87-106.

Angluin, D. (1988). Queries and concept learning. Machine Learning 2(4):319-342.

Blumer, A., A. Ehrenfeucht, D. Haussler, and M. Warmuth. (1989). Learnability and the Vapnik-Chervonenkis dimension. Journal of the ACM 36(4):929-965.

Freund, Y., and R. Schapire. (1997). A decision - theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences 55(1):119-139.

Kearns, M., and Y. Mansour. (1996). On the boosting ability of top-down decision tree learning algorithms. In Proceedings of the 28th Annual ACM Symposium on the Theory of Computing. Forthcoming in Journal of Computer and Systems Sciences.

Kearns, M., and L. G. Valiant. (1994). Cryptographic limitations on learning Boolean formulae and finite automata. Journal of the ACM 41(1):67-95.

Kearns, M., and U. Vazirani. (1994). An Introduction to Computational Learning Theory. Cambridge: MIT Press.

Kushelevitz, E., and Y. Mansour. (1993). Learning decision trees using the Fourier Spectrum. SIAM Journal on Computing 22(6):1331-1348.

Littlestone, N. (1988). Learning when irrelevant attributes abound: A new linear-threshold algorithm. Machine Learning 2:285-318.

Pitt, L., and L. G. Valiant. (1988). Computational limitations on learning from examples. Journal of the ACM 35:965-984.

Schapire, R. (1990). The strength of weak learnability. Machine Learning 5(2):197-227.

Valiant, L. G. (1984). A theory of the learnable. Communications of the ACM 27(11):1134-1142.

Vapnik, V. N. (1982). Estimation of Dependences Based on Empirical Data. New York: Springer-Verlag .