Machine Learning

The goal of machine learning is to build computer systems that can adapt and learn from their experience. Different learning techniques have been developed for different performance tasks. The primary tasks that have been investigated are SUPERVISED LEARNING for discrete decision-making, supervised learning for continuous prediction, REINFORCEMENT LEARNING for sequential decision making, and UNSUPERVISED LEARNING.

The best-understood task is one-shot decision making: the computer is given a description of an object (event, situation, etc.) and it must output a classification of that object. For example, an optical character recognizer must input a digitized image of a character and output the name of that character ("A" through "Z"). A machine learning approach to constructing such a system would begin by collecting training examples, each consisting of a digitized image of a character and the correct name of that character. These would be analyzed by a learning ALGORITHM to produce an optical character recognizer for classifying new images.

Machine learning algorithms search a space of candidate classifiers for one that performs well on the training examples and is expected to generalize well to new cases. Learning methods for classification problems include DECISION TREES, NEURAL NETWORKS, rule-learning algorithms (Cohen 1995), nearest-neighbor methods (Dasarathy 1991), and certain kinds of BAYESIAN NETWORKS.

There are four questions to answer when developing a machine learning system:

  1. How is the classifier represented?
  2. How are the examples represented?
  3. What objective function should be employed to evaluate candidate classifiers?
  4. What search algorithm should be used?

Let us illustrate these four questions using two of the most popular learning algorithms, C4.5 and backpropagation.

The C4.5 algorithm (Quinlan 1993) represents a classifier as a decision tree. Each example is represented as a vector of features. For example, one feature describing a printed character might be whether it has a long vertical line segment (such as the letters B, D, E, etc.).

Each node in the decision tree tests the value of one of the features and branches to one of its children, depending on the result of the test. A new example is classified by starting at the root of the tree and applying the test at that node. If the test is true, it branches to the left child; otherwise, it branches to the right child. The test at the child node is then applied, recursively, until one of the leaf nodes of the tree is reached. The leaf node gives the predicted classification of the example.

C4.5 searches the space of decision trees through a constructive search. It first considers all trees consisting of only a single root node and chooses one of those. Then it considers all trees having that root node and various left children, and chooses one of those, and so on. This process constructs the tree incrementally with the goal of finding the decision tree that minimizes the so-called pessimistic error, which is an estimate of classification error of the tree on new training examples. It is based on taking the upper endpoint of a confidence interval for the error of the tree (computed separately for each leaf).

Although C4.5 constructs its classifier, other learning algorithms begin with a complete classifier and modify it. For example, the backpropagation algorithm for learning neural networks begins with an initial neural network and computes the classification error of that network on the training data. It then makes small adjustments in the weights of the network to reduce this error. This process is repeated until the error is minimized.

There are two fundamentally different theories of machine learning. The classical theory takes the view that, before analyzing the training examples, the learning algorithm makes a "guess" about an appropriate space of classifiers to consider (e.g., it guesses that decision trees will be better than neural networks). The algorithm then searches the chosen space of classifiers hoping to find a good fit to the data. The Bayesian theory takes the view that the designer of a learning algorithm encodes all of his or her prior knowledge in the form of a prior probability distribution over the space of candidate classifiers. The learning algorithm then analyzes the training examples and computes the posterior probability distribution over the space of classifiers. In this view, the training data serve to reduce our remaining uncertainty about the unknown classifier.

These two theories lead to two different practical approaches. The first theory encourages the development of large, flexible hypothesis spaces (such as decision trees and neural networks) that can represent many different classifiers. The second theory encourages the development of representational systems that can readily express prior knowledge (such as Bayesian networks and other stochastic models).

The discussion thus far has focused on discrete classification, but the same issues arise for the second learning task: supervised learning for continuous prediction (also called "regression"). In this task, the computer is given a description of an object and it must output a real-valued quantity. For example, given a description of a prospective student (high school grade-point average, SAT scores, etc.), the system must predict the student's college grade-point average (GPA). The machine learning approach is the same: a collection of training examples describing students and their college GPAs is provided to the learning algorithm, which outputs a predictor to predict college GPA. Learning methods for continuous prediction include neural networks, regression trees (Breiman et al. 1984), linear and additive models (Hastie and Tibshirani 1990), and kernel regression methods (Cleveland and Devlin 1988). Classification and prediction are often called "supervised learning" tasks, because the training data include not only the input objects but also the corresponding output values (provided by a "supervisor").

We now turn from supervised learning to reinforcement learning tasks, most of which involve sequential decision making. In these tasks, each decision made by the computer affects subsequent decisions. For example, consider a computer-controlled robot attempting to navigate from a hospital kitchen to a patient's room. At each point in time, the computer must decide whether to move the robot forward, left, right, or backward. Each decision changes the location of the robot, so that the next decision will depend on previous decisions. After each decision, the environment provides a real-value reward. For example, the robot may receive a positive reward for delivering a meal to the correct patient and a negative reward for bumping into walls. The goal of the robot is to choose sequences of actions to maximize its long-term reward. This is very different from the standard supervised learning task, where each classification decision is completely independent of other decisions.

The final learning task we will discuss is unsupervised learning, where the computer is given a collection of objects and is asked to construct a model to explain the observed properties of these objects. No teacher provides desired outputs or rewards. For example, given a collection of astronomical objects, the learning system should group the objects into stars, planets, and galaxies and describe each group by its electromagnetic spectrum, distance from earth, and so on.

Although often called "cluster analysis" (Everitt 1993), unsupervised learning arises in a much wider range of tasks. Indeed, there is no single agreed-upon definition of unsupervised learning, but one useful formulation views unsupervised learning as density estimation. Define a probability distribution P(X) to be the probability that an object X will be observed. The goal of unsupervised learning is to find this probability distribution, given a sample of the Xs. This is typically accomplished by defining a family of possible stochastic models and choosing the model that best accounts for the data.

For example, one might model each group of astronomical objects as having a spectrum that is a multivariate normal distribution (centered at a "typical" spectrum). The probability distribution P(X) describing the whole collection of astronomical objects could then be modeled as a mixture of normal distributions -- one distribution for each group of objects. The learning process determines the number of groups and the mean and covariance matrix of each multivariate distribution. The Autoclass program discovered a new class of astronomical objects in just this way (Cheeseman et al. 1988).

Another widely used unsupervised learning model is the HIDDEN MARKOV MODEL (HMM). An HMM is a stochastic finite-state machine that generates strings. In speech recognition applications, the strings are speech signals, and one HMM is trained for each word in the vocabulary (Rabiner 1989).

Once a stochastic model has been fitted to a collection of objects, that model can be applied to classify new objects. Given a new astronomical object, we can determine which multivariate Gaussian is most likely to have generated it, and we can assign it to the corresponding group. Similarly, given a new speech signal, we can determine which HMM is most likely to have generated it, and thereby guess which word was spoken. A general algorithm schema for training both HMMs and mixture models is the expectation maximization (EM) algorithm (Dempster, Laird, and Rubin 1976).

See also

Additional links

-- Tom Dietterich

References

Breiman, L., J. H. Friedman, R. A. Olshen, and C. J. Stone. (1984). Classification and Regression Trees. Monterey, CA: Wads-worth.

Cheeseman, P., J. Kelly, M. Self, J. Stutz, W. Taylor, and D. Freeman. (1988). Autoclass: A Bayesian classification system. Proceedings of the Fifth International Conference on Machine Learning. San Francisco: Kaufmann, pp. 54-64.

Cleveland, W. S., and S. J. Devlin. (1988). Locally-weighted regression: An approach to regression analysis by local fitting. Journal of the American Statistical Association 83:596-610.

Cohen, W. W. (1995). Fast effective rule induction. In Proceedings of the Twelfth International Conference on Machine Learning. San Francisco: Kaufmann, pp. 115-123.

Dasarathy, B. V., Ed. (1991). Nearest Neighbor (NN) Norms: NN Pattern Classification Techniques. Los Alamitos, CA: IEEE Computer Society Press.

Dempster, A. P., N. M. Laird, and D. B. Rubin. (1976). Maximum likelihood from incomplete data via the EM algorithm. Proceedings of the Royal Statistical Society B39:1-38.

Everitt, B. (1993). Cluster Analysis. London: Halsted Press.

Hastie, T. J., and R. J. Tibshirani. (1990). Generalized Additive Models. London: Chapman and Hall.

Quinlan, J. R. (1993). C4.5: Programs for Empirical Learning. San Francisco: Kaufmann.

Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE 77(2):257-286.

Further Readings

Bishop, C. M. (1996). Neural Networks for Pattern Recognition. Oxford: Oxford University Press.

Mitchell, T. M. (1997). Machine Learning. New York: McGraw-Hill.