*
Minimum message length* (MML) is a criterion for comparing competing
theories about, or inductive inferences from, a given body of data.
A very similar criterion, also to be described, is *minimum description length* (MDL).
The basic concept behind both criteria is an operational form of Occam's
razor (see PARSIMONY AND SIMPLICITY). A "good" theory
induced from some data should enable the data to be encoded briefly.
In human terms, a good theory introduces concepts, assumptions,
and inference rules or "laws" that, if taken as
true, allow much of the data to be deduced.

For example, suppose the data to be measurements of the forces applied to several physical bodies, and their resulting accelerations, where each body is subjected to a number of experiments using different forces. If we propose the concept that each body has a "mass," and the law that acceleration is given by force divided by mass, then the given data may be restated more briefly. If for each body, we state an assumed value for its mass, then for each experiment on that body we need only state the applied force because its acceleration can then be deduced. In practice, matters are a little more complicated: we cannot expect the measured acceleration in each case to equal the deduced value exactly because of inaccuracies of measurement (and because our proposed "law" is not quite right). Thus, for each experiment, the restatement of the data must include a statement of the small amount by which the measured acceleration differs from the deduced value, but if these corrections are sufficiently small, writing them out will need much less space than writing out the original data values.

Note that the restated data are unintelligible to a reader who does not know the "law" we have induced and the body masses we have estimated from the original data. Thus we insist that the restated data be preceded by a statement of whatever theory, laws, and quantities we have inferred and assumed in the restating.

This leads to the idea of a special
form of message for encoding data, termed an EXPLANATION.
An *
explanation* is a message that first states a theory (or hypothesis)
about the data, and perhaps also some values estimated for unobserved
concepts in the theory (e.g., the masses of the bodies in our example),
and that then states the data in a short encoding that assumes the
theory and values are correct. Equivalently, the second part of
the message states all those details of the data which cannot be deduced
from the first part of the message.

The MML principle is to prefer that theory which leads to the shortest explanation of the data. As the explanation contains a statement of the theory as well as the data encoded by assuming the theory, overly complex theories will not give the shortest explanations. A complex theory, by implying much about the data, gives a short second part, but a long first part. An overly simple theory needs only a short first part for its assertion, but has few or imprecise implications for the data, needing a long second part.

Message length can be quantified using INFORMATION THEORY: An event of probability P can be encoded in - log P binary digits, or bits, using base 2 logs. Hence the length of the second part of an explanation is just - log (probability of data given the theory). Also, if a Bayesian "prior probability" over theories is assumed, the length of the first part is - log (prior probability of theory). The shortest explanation then yields the theory of highest posterior probability, as do some other Bayesian methods. However, MML actually achieves a shorter message by choosing a code for theories that, in general, does not provide for all possible theories. Theories so similar that the available data cannot be expected to reliably distinguish between them are amalgamated and represented in the code by a single theory, credited with the sum of their prior probabilities. In particular, when a theory includes a real-valued parameter, only a subset of the possible values for the parameter is allowed for in the code. Although the prior over the parameter must be a density, this amalgamation gives a nonzero prior probability to each of the subset of parameter values. This makes MML invariant under transformations of the parameter space, unlike Bayesian maximum A-posterior density (MAP) methods.

MDL differs from MML in replacing
Bayesian coding of theories with coding deemed to be efficient and "natural." In
practice, there is usually little difference in the two approaches,
but in some cases MML gives better parameter estimates than MDL,
which usually relies on *maximum likelihood* estimates.

General theoretical results show MML optimally separates information about general patterns in the data, which appears in the first part, from patternless "noise," which appears in the second part. The method is statistically consistent: if the "true" model of the data is in the set of theories considered, enough data will reveal it. It also is not misleading: If there is no pattern in the data, no theory-based explanation will be shorter than the original statement of the data, thus no theory will be inferred. Successful applications of MML in MACHINE LEARNING include clustering, DECISION TREES, factor analysis, ARMA processes, function estimation, and BAYESIAN NETWORKS.

Message length can also be quantified via Kolmogorov complexity: The information in a given binary string is the length of the shortest input to a TURING machine causing output of the given string. This approach is equivalent to MML, with the set of possible "theories" including all computable functions. Practical applications, while very general, are limited by the undecidability of the "halting problem," although such quantification may provide a basis for a descriptive account of scientific process, which seems to evolve and retain those theories best able to give short explanations.

- On Beauty and Minimum Description Length
- The Minimum Description Length Principle Applied to Feature Learning and Analogical Mapping

Barron, A. R., and T. M. Cover. (1991). Minimum complexity density estimation. IEEE Transactions on Information Theory 37:1034-1054.

Dowe, D. L., K. B. Korb, and J. J. Oliver, Eds. (1996). ISIS: Information, Statistics and Induction in Science. Singapore: World Scientific.

Kolmogorov, A. N. (1965). Three approaches to the quantitative definition of information. Problems of Information Transmission 1:4-7.

Li, M., and P. M. B. Vitanyi. (1997). An Introduction to Kolmagorov Complexity and its Applications. 2nd ed. New York: Springer.

Rissanen, J. (1978). Modeling by shortest data description. Automatica 14:465-471.

Rissanen, J., C. S. Wallace, and P. R. Freeman. (1987). Stochastic complexity: Estimation and inference by compact coding. J. Statist. Soc. B49(3):223-265.

Rissanen, J. (1989). Stochastic Complexity in Statistical Inquiry. Singapore: World Scientific.

Solomonoff, R. J. (1964). A formal theory of inductive inference, parts 1 and 2. Information and Control 7:1-22, 224 - 254.

Wallace, C. S., and D. M. Boulton. (1968). An information mea sure for classification. Computer Journal 11:185-194.