Parsimony and Simplicity

The law of parsimony, or Ockham's razor (also spelled "Occam"), is named after William of Ockham (1285-1347/49). His statement that "entities are not to be multiplied beyond necessity" is notoriously vague. What counts as an entity? For what purpose are they necessary? Most agree that the aim of postulated entities is to represent reality, or to "get at the truth" in some sense. But when are entities postulated beyond necessity?

The role of parsimony and simplicity is important in all forms of INDUCTION, LEARNING, STATISTICAL LEARNING THEORY, and in the debate about RATIONALISM VS. EMPIRICISM. However, Ockham's razor is better known in scientific theorizing, where it has two aspects. First, there is the idea that one should not postulate entities that make no observable difference. For example, Gottfried Wilhelm Leibniz objected to Isaac Newton's absolute space because one absolute velocity of the solar system would produce the same observable behavior as any other absolute velocity.

The second aspect of Ockham's razor is that the number of postulated entities should be minimized. One of the earliest known examples was when Copernicus (1473-1543) argued in favor of his stationary-sun theory of planetary motion by arguing that it endowed one cause (the motion of the earth around the sun) with many effects (the apparent motions of the planets). In contrast, his predecessor (Ptolemy) unwittingly duplicated the Earth's motion many times in order to "explain" the same effects. Newton's version of Ockham's razor appeared in his first and second rules of philosophizing: "We are to admit no more causes of natural things than such as are both true and sufficient to explain their appearances. Therefore, to the same natural effects we must, as far as possible, assign the same causes."

Both aspects arise in modern empirical science, including psychology. When data are modeled by an equation or set of equations, they are usually designed so that all theoretical parameters can be uniquely estimated for the data using standard statistical estimation techniques (like the method of least squares, or maximum likelihood estimation). When this condition is satisfied, the parameters are said to be identifiable. This practice ensures the satisfaction of Ockham's razor in its first aspect. For example, suppose we have a set of data consisting in n pairs of (x, y) values: {(x1, y1), (x2, y2), . . . (xn, yn)}. The model yi = a + bxi, for i = 1, 2, . . . n, is identifiable because the parameters a and b can be uniquely estimated by sufficiently varied data. But a model like yi = a + (b + c)xi is not identifiable, because many pairs of values of b and c fit the data equally well. Different parameter values make no empirical difference.

Normally, this desideratum is so natural and commonsensical that nonidentifiable models are not used. B. F. Skinner resisted the introduction of intervening variables in BEHAVIORISM for this reason. However, they do arise in NEURAL NETWORKS (see also COGNITIVE MODELING, CONNECTIONIST). In the simplest possible two-layered network, one would have one input neuron, or node, with an activation x, one hidden node, with activation y, and an output node, with activation z, where the output is a function of the hidden node activation, which is in turn a function of the input activation. In a simple linear network, this would mean that z = a.y, and y = b.x, where a and b are the connection weights between the layers. The connection weights are the parameters of the model. But the hidden activations are not observed, and so the only testable consequence of the model is the input-output function, z = (ab)x. Different pairs of values of the parameters a and b lead to the same input-output function. Therefore, the model is not identifiable.

Perhaps the more difficult problem is to understand how to draw the line in cases in which extra parameters make a difference, but a very little difference. This is the second aspect of Ockham's razor. For example, how do we select between competing models like y = a + bx1 and y = a + bx1 + cx2, where the parameters a, b, and c are adjustable parameters that range over a set of possible values? Each equation represents a different model, which may be thought of as a family of curves. Under one common notion of simplicity, the first model is simpler than the second model because it has fewer adjustable parameters. Simplicity is measured by the size, or dimension, of the family of curves. (Note that, in this definition, models are of greater or lesser simplicity, but all curves are equally simple because their equations have zero adjustable parameters.)

How does one decide when an additional parameter makes "enough" of an empirical difference to justify its inclusion, or when an additional parameter is "beyond necessity"? If the choice is among models that fit the data equally well (where the fit of a model is given by the fit of its best case), then the answer is that simplicity should break the tie. But in practice, competing models do not fit equally well. For instance, when one model is a special case of, or nested in, another (as in the previous example), the more complex model will always fit better (if only because it is able to better fit the noise in the data).

So, the real question is: How much better must the complex model fit before we say that the extra parameter is necessary? Or, when should the better fit of the complex model be "explained away" as arising from the greater tendency of complex models to fit noise? How do we trade off fit with simplicity? That is the motivation for standard significance testing in statistics. Notice that significance testing does not always favor the simpler model. Nor is the practice motivated by any belief in the simplicity of nature. In fact, when enough data accumulates, the choice will favor the complex model eventually even if the added parameters have very small (but nonzero) values.

In recent years, there have been many new model selection criteria developed in statistics, all of which define simplicity in terms of the paucity of parameters, or the dimension of a model (see Forster and Sober 1994 for a nontechnical introduction). These include Akaike's Information Criterion (AIC; Akaike 1974, 1985), the Bayesian Information Criterion (BIC; Schwarz 1978), and MINIMUM DESCRIPTION LENGTH (MDL; Rissanen 1989). They trade off simplicity and fit a little differently, but all of them address the same problem as significance testing: Which of the estimated "curves" from competing models best represents reality? This work has led to a clear understanding of why this form of simplicity is relevant to that question.

However, the paucity of parameters is a limited notion. It does not mark a difference in simplicity between a wiggly curve and a straight curve. Nor does it capture the idea of simpler theories having fewer numbers of fundamental principles or laws. Nor does it reward the repeated use of equations of the same form. A natural response is to insist that there must be other kinds of simplicity or unification that are relevant to theory choice. But there are well-known problems in defining these alternative notions of simplicity (e.g., Priest 1976; Kitcher 1976). Moreover, there are no precise proposals about how these notions of simplicity are traded off with fit. Nor are there any compelling ideas about why such properties should count in favor of one theory being closer to the truth than another.

See also

Additional links

-- Malcolm R. Forster

References

Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, vol. AC-19:716-723.

Akaike, H. (1985). Prediction and entropy. In A. C. Atkinson and S. E. Fienberg, Eds., A Celebration of Statistics. New York: Springer, pp. 1-24.

Forster, M. R., and E. Sober. (1994). How to tell when simpler, more unified, or less ad hoc theories will provide more accurate predictions. British Journal for the Philosophy of Science 45:1-35.

Kitcher, P. (1976). Explanation, conjunction and unification. Journal of Philosophy 73 : 207-212.

Priest, G. (1976). Gruesome simplicity. Philosophy of Science 43:432-437.

Rissanen, J. (1989). Stochastic Complexity in Statistical Inquiry. Singapore: World Books.

Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics 6:461-465.

Further Readings

Forster, M. R. (1995). The curve-fitting problem. In R. Audi, Ed., The Cambridge Dictionary of Philosophy. Cambridge University Press.

Forster, M. R. (1999). The new science of simplicity. In H. A. Keuzenkamp, M. McAleer, and A. Zellner, Eds., Simplicity, Inference and Economertric Modelling. Cambridge: Cambridge University Press.

Gauch, H. G., Jr. (1993). Prediction, parsimony and noise. American Scientist 81:468-478.

Geman, S., E. Bienenstock, and R. Doursat. (1992). Neural networks and the bias/variance dilemma. Neural Computation 4:1-58.

Jefferys, W., and J. Berger. (1992). Ockham's razor and Bayesian analysis. American Scientist 80:64-72.

Popper, K. (1959). In The Logic of Scientific Discovery. London: Hutchinson.

Sakamoto, Y., M. Ishiguro, and G. Kitagawa. (1986). Akaike Information Criterion Statistics. Dordrecht: Kluwer.

Sober, E. (1990). Let's razor Ockham's razor. In D. Knowles, Ed., Explanation and Its Limits. Royal Institute of Philosophy supp. vol. 27. Cambridge: Cambridge University Press, pp. 73-94 .