Formal Grammars

A grammar is a definition of a set of linguistic structures, where the linguistic structures could be sequences of words (sentences), or "sound-meaning" pairs (that is, pairs <s,m> where s is a representation of phonetic properties and m is a representation of semantic properties), or pairs <t,p> where t is a tree and p is a probability of occurrence in a discourse. A formal grammar, then, is a grammar that is completely clear and unambiguous. Obviously, this account of what qualifies as "formal" is neither formal nor rigorous, but in practice there is little dispute.

It might seem that formalization would always be desirable in linguistic theory, but there is little point in spelling out the details of informal hypotheses when their weaknesses can readily be ascertained and addressed without working out the details. In fact, there is considerable variation in the degree to which empirical proposals about human grammars are formalized, and there are disputes in the literature about how much formalization is appropriate at this stage of linguistic theory (Pullum 1989; Chomsky 1990).

Given this controversy, and given the preliminary and changing nature of linguistic theories, formal studies of grammar have been most significant when they have focused not on the details of any particular grammar, but rather on the fundamental properties of various kinds of grammars. Taking this abstract, metagrammatical approach, formal studies have identified a number of basic properties of grammars that raise new questions about human languages.

One basic division among the various ways of defining sets of linguistic structures classifies them as generative or constraint-based. A GENERATIVE GRAMMAR defines a set of structures by providing some basic elements and applying rules to derive new elements. Again, there are two basic ways of doing this. The first approach, common in "formal language theory" involves beginning with a "category" like "sentence," applying rules that define what parts the sentence has, what parts those parts have, and so on until the sentence has been specified all the way down to the level of words. This style of language definition has proven to be very useful, and many fundamental results have been established (Harrison 1978; Rozenberg and Salomaa 1997).

A second "bottom-up" approach, more common in CATEGORIAL GRAMMAR and some related traditions, involves starting with some lexical items ("generators") and then applying rules to assemble them into more complex structures. This style of language definition comes from LOGIC and algebra, where certain sets are similarly defined by "closing" a set of basic elements with respect to some generating relations. In these formal grammars, the structures of the defined language are analogous to the theorems of formal logic in that they are derived from some specified basic elements by rigorously specified rules. A natural step from this idea is to treat a grammar explicitly as a logic (Lambek 1958; Moortgat 1997).

Unlike the generative methods, which define a language by applying rules to a set of initial elements of some kind, a constraint grammar specifies a set by saying what properties the elements of that set must have. In this sort of definition, the structures in the language are not like the (generated, enumerable) theorems of a logic, but more like the sentences that could possibly be true (the "satisfiable" sentences of a logic). This approach to grammar is particularly prominent in linguistic traditions like HEAD-DRIVEN PHRASE STRUCTURE GRAMMAR (Pollard and Sag 1994). However, most linguistic theories use both generative and constraint-based specifications of structure.

Recently, linguists have also shown interest in a special variety of constraint grammar that is sometimes called "over-constrained." These grammars impose constraints that cannot all be met, and the interest then is in the linguistic structures that meet the constraints of the grammar to the greatest degree, the structures that are "most economical" or "most optimal" in some sense. Systems of this kind have been studied in a variety of contexts, but have recently become prominent in OPTIMALITY THEORY. Parts of the transformational grammar tradition called MINIMALISM have a fundamentally similar character, with constraints that can be violated when there is no better option.

The various different kinds of formal grammars have all been studied formally, particularly with regard to the complexities of sets they can define, and with regard to the succinctness of the definitions of those sets.

The study of various kinds of formal grammars has led linguists to consider whether we can determine, in advance of knowing in full detail what the grammars for human languages are like, whether human languages have one or another of the fundamental properties that are well understood in the context of artificial languages. Regardless of how a set of linguistic structures is defined, whether by a generative grammar or constraint grammar or over- constrained grammar, we can consider questions like the following about the complexity of the defined sets of "grammatical" structures. (Of course, such questions apply only to formal grammars for which some significant criterion of "grammaticality" can be formulated, which is a controversial empirical question.)

Is the set of linguistic structures finite? Although there are obvious practical limitations on the lengths of sentences that any human will ever pronounce, these bounds do not seem to be linguistic in nature, but rather derive from limitations in our life span, requirements for sleep, and so on. As far as the grammars of natural languages go, there seems to be no longest sentence, and consequently no maximally complex linguistic structure, and we can conclude that human languages are infinite. This assumption makes all of the following questions more difficult, because it means that the languages that people speak contain structures that no human will ever produce or understand. This basic point is also one of the basic motivations for the competence/performance distinction.

Is the set of linguistic structures "recursive"? That is, is there an algorithm that can effectively decide whether a given structure is in the set or not? This question is more interesting than it looks, because the mere fact that humans use languages does not show they are recursive (Matthews 1979). However, there seems to be no good reason to assume that languages are not recursive.

Is the set of linguistic structures one that is recognized by a finite computing device? As Chomsky (1956) pointed out, as far as the principles of language are concerned, not only are there sentences that are too long for a human to pronounce, but there are sentences that require more memory to recognize than humans have. To recognize an arbitrary sentence of a human language requires infinite memory, just as computing arbitrary multiplication problems does. (In a range of important cases, the complexity of grammatical descriptions of formal languages corresponds to the complexity of the machines needed to recognize those languages, as we see here. The study of formal languages overlaps extensively with the theory of AUTOMATA.)

Is the set of linguistic structures generated by a "context free grammar"? Because "context free grammars" can define languages that cannot be accepted by a finite machine, this more technical question has received a lot of attention, particularly in studies of the syntax of human languages. The received view is that the sentences of natural languages are not generally definable by context free grammars (Chomsky 1956; Savitch et al. 1987). So the human languages seem to be more complex than context free languages, but still recursive. There have been a number of attempts to pin the matter down more precisely (Joshi 1985; Vijay-Shanker and Weir 1994).

Is the set of linguistic structures "efficiently parsable"? That is, roughly, can the structures be computed efficiently from their spoken or written forms? For human languages, the answer appears to be negative, as argued for example in Barton, Berwick, and Ristad (1987) and in Ristad (1993), but the matter remains controversial. Questions like this one are at the foundations of work in NATURAL LANGUAGE PROCESSING.

Is a set of formal languages "learnable"? Results in formal studies of learning often depend on the particular sets that are learned (in some precise sense of "learned"), and so formal grammars play a fundamental role in LEARNING SYSTEMS; COMPUTATIONAL LEARNING THEORY; SUPERVISED LEARNING; UNSUPERVISED LEARNING; MACHINE LEARNING; STATISTICAL LEARNING THEORY.

Additional links

-- Edward Stabler

References

Barton, E., R. C. Berwick, and E. S. Ristad (1987). Computational Complexity and Natural Language. Cambridge, MA: MIT Press.

Chomsky, N. (1956). Three models for the description of language. IRE Transactions on Information Theory IT-2:113-124.

Chomsky, N. (1990). On formalization and formal linguistics. Natural Language and Linguistic Theory 8:143-147.

Harrison, M. A. (1978). Introduction to Formal Language Theory. Reading, MA: Addison-Wesley.

Joshi, A. (1985). How much context-sensitivity is necessary for characterizing structural descriptions? In D. Dowty, L. Karttunen, and A. Zwicky, Eds., Natural Language Processing: Theoretical, Computational and Psychological Perspectives. New York: Cambridge University Press.

Lambek, J. (1958). The mathematics of sentence structure. American Mathematical Monthly 65:154-170.

Matthews, R. (1979). Do the grammatical sentences of language form a recursive set? Synthese 40:209-224.

Moortgat, M. (1997). Categorial type logics. In J. van Benthem and A. ter Meulen, Eds., Handbook of Logic and Language. New York: Elsevier.

Pollard, C., and I. Sag. (1994). Head-driven Phrase Structure Grammar. Chicago: University of Chicago Press.

Pullum, G. K. (1989). Formal linguistics meets the boojum. Natural Language and Linguistic Theory 7:137-143.

Ristad, E. S. (1993). The Language Complexity Game. Cambridge, MA: MIT Press.

Rozenberg, G., and A. Salomaa, Eds. (1997). Handbook of Formal Languages. New York: Springer.

Savitch, W. J., E. Bach, W. Marsh, and G. Safran-Naveh, Eds. (1987). The Formal Complexity of Natural Language. Boston: Reidel.

Vijay-Shanker, K., and D. Weir (1994). The equivalence of four extensions of context free grammar formalisms. Mathematical Systems Theory 27:511-545.

Further Readings

Harrison, M. A. (1978). Introduction to Formal Language Theory. Reading, MA: Addison-Wesley.

Hopcroft, J. E., and J. D. Ullman. (1979). Introduction to Automata Theory, Languages and Computation. Reading, MA: Addison-Wesley.

Moll, R. N., M. A. Arbib, and A. J. Kfoury (1988). An Introduction to Formal Language Theory. New York: Springer.

On Formal Properties of Human Languages Barton, E., R. C. Berwick, and E. S. Ristad (1987). Computational Complexity and Natural Language. Cambridge, MA: MIT Press.

Savitch, W. J., E. Bach, W. Marsh, and G. Safran-Naveh, Eds. (1987). The Formal Complexity of Natural Language. Boston: Reidel.

Basic Mathematical Properties of Some Linguistic Theories Perrault, R. C. (1983). On the mathematical properties of linguistic theories. Proceedings of the Association for Computational Linguistics 21:98-105.

Recent Work on Grammars-as-Logic Moortgat, M. (1997). Categorial type logics. In J. van Benthem and A. ter Meulen, Eds., Handbook of Logic and Language. New York: Elsevier.

Recent Work on Over-Constrained Systems Jampel, M., E. Freunder, and M. Maher, Eds. (1996). Over-Constrained Systems. New York: Springer.

Prince, A., and P. Smolensky. (1993). Optimality Theory: Constraint Interaction in Generative Grammar. Technical Report 2. Center for Cognitive Science, Rutgers University.