Decision Trees

A decision tree is a graphical representation of a procedure for classifying or evaluating an item of interest. For example, given a patient's symptoms, a decision tree could be used to determine the patient's likely diagnosis, or outcome, or recommended treatment. Figure 1 shows a decision tree for forecasting whether a patient will die from hepatitis, based on data from the University of California at Irvine repository (Murphy and Aha 1994). A decision tree represents a function that maps each element of its domain to an element of its range, which is typically a class label or numerical value. At each leaf of a decision tree, one finds an element of the range. At each internal node of the tree, one finds a test that has a small number of possible outcomes. By branching according to the outcome of each test, one arrives at a leaf that contains the class label or numerical value that corresponds to the item in hand. In the figure, each leaf shows the number of examples of each class that fall to that leaf. These leaves are usually not of one class, so one typically chooses the most frequently occurring class label.


Figure 1



A decision tree with a range of discrete (symbolic) class labels is called a classification tree, whereas a decision tree with a range of continuous (numeric) values is called a regression tree. A domain element is called an instance or an example or a case, or sometimes by another name appropriate to the context. An instance is represented as a conjunction of variable values. Each variable has its own domain of possible values, typically discrete or continuous. The space of all possible instances is defined by set of possible instances that one could generate using these variables and their possible values (the cross product).

Decision trees are attractive because they show clearly how to reach a decision, and because they are easy to construct automatically from labeled instances. Two well known programs for constructing decision trees are C4.5 (Quinlan 1993) and CART (Classification and Regression Tree) (Breiman et al. 1984). The tree shown in figure 1 was generated by the ITI (Incremental Tree Inducer) program (Utgoff, Berkman, and Clouse 1997). These programs usually make quick work of training data, constructing a tree in a matter of a few seconds to a few minutes. For those who prefer to see a list of rules, there is a simple conversion, which is available in the C4.5 program. For each leaf of the tree, place its label in the right-hand side of a rule. In the left-hand side, place the conjunction of all the conditions that would need to be true to reach that leaf from the root.

Decision trees are useful for automating decision processes that are part of an application program. For example, for the optical character recognition (OCR) task, one needs to map the optical representation of a symbol to a symbol name. The optical representation might be a grid of pixel values. The tree could attempt to map these pixel values to a symbol name. Alternatively, the designer of the system might include the computation of additional variables, also called features, that make the mapping process simpler. Decision trees are used in a large number of applications, and the number continues to grow as practitioners gain experience in using trees to model decision making processes. Present applications include various pixel classification tasks, language understanding tasks such as pronoun resolution, fault diagnosis, control decisions in search, and numerical function approximation.

A decision tree is typically constructed recursively in a top-down manner (Friedman 1977; Quinlan 1986). If a set of labeled instances is sufficiently pure, then the tree is a leaf, with the assigned label being that of the most frequently occurring class in that set. Otherwise, a test is constructed and placed into an internal node that constitutes the tree so far. The test defines a partition of the instances according to the outcome of the test as applied to each instance. A branch is created for each block of the partition, and for each block, a decision tree is constructed recursively.

One needs to define when a set of instances is to be considered sufficiently pure to constitute a leaf. One choice would be to require absolute purity, meaning that all the instances must be of the same class. Another choice would be to require that the class distribution be significantly lopsided, which is a less stringent form of the complete lopsidedness that one gets when the leaf is pure. This is also known as prepruning because one restricts the growth of the tree before it occurs.

One also needs a method for constructing and selecting a test to place at an internal node. If the test is to be based on just one variable, called a univariate test, then one needs to be able to enumerate possible tests based on that variable. If the variable is discrete, then the possible outcomes could be the possible values of that variable. Alternatively, a test could ask whether the variable has a particular value, making just two possible outcomes, as is the case in figure 1. If the variable is continuous, then some form of discretization needs to be done, so that only a manageable number of outcomes is possible. One can accomplish this by searching for a cutpoint, and then forming a test whether the variable value is less than the cutpoint, as shown in figure 1.

If the test is to be based on more than one variable, called a multivariate test, then one needs to be able to search quickly for a suitable test. This is often done by mapping the discrete variables to continuous variables, and then finding a good linear combination of those variables. A univariate test is also known as an axis-parallel split because in a geometric view of the instance space, the partition formed by a univariate test is parallel to the axes of the other variables. A multivariate test is also known as an oblique split because it need not have any particular characteristic relationship to the axes (Murthy, Kasif, and Salzberg 1994).

One must choose the best test from among those that are allowed at an internal node. This is typically done in a greedy manner by ranking the tests according to a heuristic function, and picking the test that is ranked best. Many heuristic tests have been suggested, and this problem is still being studied. For classification trees, most are based on entropy minimization. By picking a test that maximizes the purity of the blocks, one will probably obtain a smaller tree than otherwise, and researchers and practitioners alike have a longstanding preference for smaller trees. Popular heuristic functions include information gain, gain ratio, GINI, and Kolmogorov-Smirnoff distance. For regression trees, most tests are based on variance minimization. A test that minimizes the variance within the resulting blocks will also tend to produce a smaller tree than one would obtain otherwise.

It is quite possible that a tree will overfit the data. The tree may have more structure than is helpful because it is attempting to produce several purer blocks where one less pure block would result in higher accuracy on unlabeled instances (instance not used in training). This can come about due to inaccurate variable measurements or inaccurate label or value assignments. A host of postpruning methods are available that reduce the size of the tree after it has been grown. A simple method is to set aside some of the training instances, called the pruning set, before building the tree. Then after the tree has been built, do a postorder traversal of the tree, reducing each subtree to a leaf if the proposed leaf would not be significantly less accurate on the pruning set than the subtree it would replace. This issue of balancing the desire for purity with the desire for accuracy is also called the bias-variance trade-off. A smaller tree has higher bias because the partition is coarser, but lower variance because the leaves are each based on more training instances.

During the mid-1990s, researchers developed methods for using ensembles of decision trees to improve accuracy (Dietterich and Bakiri 1995; Kong and Dietterich 1995; Breiman 1996). To the extent that different decision trees for the same task make independent errors, a vote of the set of decision trees can correct the errors of the individual trees.

See also

Additional links

-- Paul Utgoff

References

Breiman, L. (1996). Bagging predictors. Machine Learning 24:123-140.

Breiman, L., J. H. Friedman, R. A. Olshen, and C. J. Stone. (1984). Classification and Regression Trees. Belmont, CA: Wadsworth International Group.

Dietterich, T. G., and G. Bakiri. (1995). Solving multiclass learning problems via error-correcting output codes. Journal of Artificial Intelligence 2:263-286.

Friedman, J. H. (1977). A recursive partitioning decision rule for nonparametric classification. IEEE Transactions on Computers, C-26:404-408.

Kong, E. B., and T. G. Dietterich. (1995). Error-correcting output coding corrects bias and variance. Machine Learning: Proceedings of the Twelfth International Conference. Tahoe City, CA: Morgan Kaufmann, pp. 313-321.

Murphy, P. M., and D. W. Aha. (1994). UCI Repository of Machine Learning Databases. Irvine, CA: University of California, Department of Information and Computer Science.

Murthy, S. K., S. Kasif, and S. Salzberg. (1994). A system for induction of oblique decision trees. Journal of Artificial Intelligence Research 2:1-32.

Quinlan, J. R. (1986). Induction of decision trees. Machine Learning 1:81-106.

Quinlan, J. R. (1993). C4.5: Programs for Machine Learning. San Mateo, CA: Morgan Kaufmann.

Utgoff, P. E., N. C. Berkman, and J. A. Clouse. (1997). Decision tree induction based on efficient tree restructuring. Machine Learning 29:5-44.

Further Readings

Brodley, C. E., and P. E. Utgoff. (1995). Multivariate decision trees. Machine Learning 19:45-77.

Buntine, W., and T. Niblett. (1992). A further comparison of splitting rules for decision-tree induction. Machine Learning 8:75-85.

Chou, P. A. (1991). Optimal partitioning for classification and regression trees. IEEE Transactions on Pattern Analysis and Machine Intelligence 13:340-354.

Draper, B. A., C. E. Brodley, and P. E. Utgoff. (1994). Goal-directed classification using linear machine decision trees. IEEE Transactions on Pattern Analysis and Machine Intelligence 16:888-893.

Jordan, M. I. (1994). A statistical approach to decision tree modeling. Machine Learning: Proceedings of the Eleventh International Conference. New Brunswick, NJ: Morgan Kaufmann, pp. 363-370.

Moret, B. M. E. (1982). Decision trees and diagrams. Computing Surveys 14:593-623.

Murphy, P., and M. Pazzani. (1994). Exploring the decision forest: An empirical investigation of Occam's Razor in decision tree induction. Journal of Artificial Intelligence Research 1:257-275.

Pagallo, G., and D. Haussler. (1990). Boolean feature discovery in empirical learning. Machine Learning 5,1:71-99.

Quinlan, J. R., and R. L. Rivest. (1989). Inferring decision trees using the minimum description length principle. Information and Computation 80:227-248.

Safavian, S. R., and D. Langrebe. (1991). A survey of decision tree classifier methodology. IEEE Transactions on Systems, Man and Cybernetics 21:660-674.

Utgoff, P. E. (1989). Incremental induction of decision trees. Machine Learning 4:161-186.

White, A. P., and W. Z. Liu. (1994). Bias in information-based measures in decision tree induction. Machine Learning 15:321-329.