A grammar
is a definition of a set of linguistic structures, where the linguistic
structures could be sequences of words (sentences), or "sound-meaning" pairs
(that is, pairs <*s,m*> where *s* is
a representation of phonetic properties and *m* is a representation
of semantic properties), or pairs <*t,p*> where *t* is
a tree and *p* is a probability of occurrence in a discourse.
A *formal grammar,* then, is a grammar that is completely
clear and unambiguous. Obviously, this account of what qualifies
as "formal" is neither formal nor rigorous, but
in practice there is little dispute.

It might seem that formalization would always be desirable in linguistic theory, but there is little point in spelling out the details of informal hypotheses when their weaknesses can readily be ascertained and addressed without working out the details. In fact, there is considerable variation in the degree to which empirical proposals about human grammars are formalized, and there are disputes in the literature about how much formalization is appropriate at this stage of linguistic theory (Pullum 1989; Chomsky 1990).

Given this controversy, and given
the preliminary and changing nature of linguistic theories, formal
studies of grammar have been most significant when they have focused
not on the details of any particular grammar, but rather on the
fundamental properties of various *kinds* of grammars. Taking
this abstract, metagrammatical approach, formal studies have identified
a number of basic properties of grammars that raise new questions
about human languages.

One basic division among the various
ways of defining sets of linguistic structures classifies them as *generative* or *constraint-based.* A GENERATIVE GRAMMAR defines a set of structures by providing some
basic elements and applying rules to derive new elements. Again,
there are two basic ways of doing this. The first approach, common
in "formal language theory" involves beginning
with a "category" like "sentence," applying
rules that define what parts the sentence has, what parts those
parts have, and so on until the sentence has been specified all
the way down to the level of words. This style of language definition
has proven to be very useful, and many fundamental results have
been established (Harrison 1978; Rozenberg and Salomaa 1997).

A second "bottom-up" approach, more common in CATEGORIAL GRAMMAR and some related traditions, involves starting with some lexical items ("generators") and then applying rules to assemble them into more complex structures. This style of language definition comes from LOGIC and algebra, where certain sets are similarly defined by "closing" a set of basic elements with respect to some generating relations. In these formal grammars, the structures of the defined language are analogous to the theorems of formal logic in that they are derived from some specified basic elements by rigorously specified rules. A natural step from this idea is to treat a grammar explicitly as a logic (Lambek 1958; Moortgat 1997).

Unlike the generative methods, which
define a language by applying rules to a set of initial elements
of some kind, a *constraint grammar* specifies a set by saying
what properties the elements of that set must have. In this sort
of definition, the structures in the language are not like the (generated,
enumerable) theorems of a logic, but more like the sentences that
could possibly be true (the "satisfiable" sentences
of a logic). This approach to grammar is particularly prominent
in linguistic traditions like HEAD-DRIVEN PHRASE STRUCTURE GRAMMAR (Pollard and Sag 1994). However, most linguistic
theories use both generative and constraint-based specifications
of structure.

Recently, linguists have also shown interest in a special variety of constraint grammar that is sometimes called "over-constrained." These grammars impose constraints that cannot all be met, and the interest then is in the linguistic structures that meet the constraints of the grammar to the greatest degree, the structures that are "most economical" or "most optimal" in some sense. Systems of this kind have been studied in a variety of contexts, but have recently become prominent in OPTIMALITY THEORY. Parts of the transformational grammar tradition called MINIMALISM have a fundamentally similar character, with constraints that can be violated when there is no better option.

The various different kinds of formal grammars have all been studied formally, particularly with regard to the complexities of sets they can define, and with regard to the succinctness of the definitions of those sets.

The study of various kinds of formal grammars has led linguists to consider whether we can determine, in advance of knowing in full detail what the grammars for human languages are like, whether human languages have one or another of the fundamental properties that are well understood in the context of artificial languages. Regardless of how a set of linguistic structures is defined, whether by a generative grammar or constraint grammar or over- constrained grammar, we can consider questions like the following about the complexity of the defined sets of "grammatical" structures. (Of course, such questions apply only to formal grammars for which some significant criterion of "grammaticality" can be formulated, which is a controversial empirical question.)

*Is the set of linguistic structures
finite?* Although there are obvious practical limitations on
the lengths of sentences that any human will ever pronounce, these
bounds do not seem to be linguistic in nature, but rather derive
from limitations in our life span, requirements for sleep, and so
on. As far as the grammars of natural languages go, there seems
to be no longest sentence, and consequently no maximally complex linguistic
structure, and we can conclude that human languages are infinite.
This assumption makes all of the following questions more difficult,
because it means that the languages that people speak contain structures
that no human will ever produce or understand. This basic point
is also one of the basic motivations for the competence/performance
distinction.

*Is the set of linguistic structures "recursive"?* That
is, is there an algorithm that can effectively decide whether a given
structure is in the set or not? This question is more interesting
than it looks, because the mere fact that humans use languages does
not show they are recursive (Matthews 1979). However, there seems
to be no good reason to assume that languages are not recursive.

*Is the set of linguistic structures
one that is recognized by a finite computing device?* As Chomsky
(1956) pointed out, as far as the principles of language are concerned,
not only are there sentences that are too long for a human to pronounce,
but there are sentences that require more memory to recognize than
humans have. To recognize an arbitrary sentence of a human language
requires infinite memory, just as computing arbitrary multiplication
problems does. (In a range of important cases, the complexity of
grammatical descriptions of formal languages corresponds to the
complexity of the machines needed to recognize those languages,
as we see here. The study of formal languages overlaps extensively
with the theory of AUTOMATA.)

*Is the set of linguistic structures
generated by a "context free grammar"?* Because "context
free grammars" can define languages that cannot be accepted
by a finite machine, this more technical question has received a
lot of attention, particularly in studies of the syntax of human
languages. The received view is that the sentences of natural languages
are not generally definable by context free grammars (Chomsky 1956; Savitch
et al. 1987). So the human languages seem to be more complex than
context free languages, but still recursive. There have been a number
of attempts to pin the matter down more precisely (Joshi 1985; Vijay-Shanker
and Weir 1994).

*Is the set of linguistic structures "efficiently
parsable"?* That is, roughly, can the structures be
computed efficiently from their spoken or written forms? For human
languages, the answer appears to be negative, as argued for example
in Barton, Berwick, and Ristad (1987) and in Ristad (1993), but
the matter remains controversial. Questions like this one are at
the foundations of work in NATURAL LANGUAGE PROCESSING.

*Is a set of formal languages "learnable"?* Results
in formal studies of learning often depend on the particular sets
that are learned (in some precise sense of "learned"),
and so formal grammars play a fundamental role in LEARNING SYSTEMS; COMPUTATIONAL LEARNING THEORY; SUPERVISED LEARNING; UNSUPERVISED LEARNING; MACHINE LEARNING; STATISTICAL LEARNING THEORY.

Barton, E., R. C. Berwick, and E. S. Ristad (1987). Computational Complexity and Natural Language. Cambridge, MA: MIT Press.

Chomsky, N. (1956). Three models for the description of language. IRE Transactions on Information Theory IT-2:113-124.

Chomsky, N. (1990). On formalization and formal linguistics. Natural Language and Linguistic Theory 8:143-147.

Harrison, M. A. (1978). Introduction to Formal Language Theory. Reading, MA: Addison-Wesley.

Joshi, A. (1985). How much context-sensitivity is necessary for characterizing structural descriptions? In D. Dowty, L. Karttunen, and A. Zwicky, Eds., Natural Language Processing: Theoretical, Computational and Psychological Perspectives. New York: Cambridge University Press.

Lambek, J. (1958). The mathematics of sentence structure. American Mathematical Monthly 65:154-170.

Matthews, R. (1979). Do the grammatical sentences of language form a recursive set? Synthese 40:209-224.

Moortgat, M. (1997). Categorial type logics. In J. van Benthem and A. ter Meulen, Eds., Handbook of Logic and Language. New York: Elsevier.

Pollard, C., and I. Sag. (1994). Head-driven Phrase Structure Grammar. Chicago: University of Chicago Press.

Pullum, G. K. (1989). Formal linguistics meets the boojum. Natural Language and Linguistic Theory 7:137-143.

Ristad, E. S. (1993). The Language Complexity Game. Cambridge, MA: MIT Press.

Rozenberg, G., and A. Salomaa, Eds. (1997). Handbook of Formal Languages. New York: Springer.

Savitch, W. J., E. Bach, W. Marsh, and G. Safran-Naveh, Eds. (1987). The Formal Complexity of Natural Language. Boston: Reidel.

Vijay-Shanker, K., and D. Weir (1994). The equivalence of four extensions of context free grammar formalisms. Mathematical Systems Theory 27:511-545.

Harrison, M. A. (1978). Introduction to Formal Language Theory. Reading, MA: Addison-Wesley.

Hopcroft, J. E., and J. D. Ullman. (1979). Introduction to Automata Theory, Languages and Computation. Reading, MA: Addison-Wesley.

Moll, R. N., M. A. Arbib, and A. J. Kfoury (1988). An Introduction to Formal Language Theory. New York: Springer.

On Formal Properties of Human Languages Barton, E., R. C. Berwick, and E. S. Ristad (1987). Computational Complexity and Natural Language. Cambridge, MA: MIT Press.

Savitch, W. J., E. Bach, W. Marsh, and G. Safran-Naveh, Eds. (1987). The Formal Complexity of Natural Language. Boston: Reidel.

Basic Mathematical Properties of Some Linguistic Theories Perrault, R. C. (1983). On the mathematical properties of linguistic theories. Proceedings of the Association for Computational Linguistics 21:98-105.

Recent Work on Grammars-as-Logic Moortgat, M. (1997). Categorial type logics. In J. van Benthem and A. ter Meulen, Eds., Handbook of Logic and Language. New York: Elsevier.

Recent Work on Over-Constrained Systems Jampel, M., E. Freunder, and M. Maher, Eds. (1996). Over-Constrained Systems. New York: Springer.

Prince, A., and P. Smolensky. (1993). Optimality Theory: Constraint Interaction in Generative Grammar. Technical Report 2. Center for Cognitive Science, Rutgers University.