Computational linguistics (CL; also called natural language processing, or NLP) is concerned with (1) the study of computational models of the structure and function of language, its use, and its acquisition; and (2) the design, development, and implementation of a wide range of systems such as SPEECH RECOGNITION, language understanding and NATURAL LANGUAGE GENERATION. CL applications include interfaces to databases, text processing, and message understanding, multilingual interfaces as aids for foreign language correspondences, web pages, and speech-to-speech translation in limited domains. On the theoretical side, CL uses computational modeling to investigate syntax, semantics, pragmatics (that is, certain aspects of the relationship of the speaker and the hearer, or of the user and the system in the case of a CL system), and discourse aspects of language. These investigations are interdisciplinary and involve concepts from artificial intelligence (AI), linguistics, logic, and psychology. By connecting the closely interrelated fields of computer science and linguistics, CL plays a key role in cognitive science.
Because it is impossible to cover the whole range of theoretical and practical issues in CL in the limited space available, only a few related topics are discussed here in some detail, to give the reader a sense of important issues. Fortunately, there exists a comprehensive source, documenting all the major aspects of CL (Survey of the State of Art in Human Language Technology forthcoming).
Almost every NLP system has a grammar and an associated parser. A grammar is a finite specification of a potentially infinite number of sentences, and a parser for the grammar is an ALGORITHM that analyzes a sentence and, if possible, assigns one or more grammar-based structural descriptions to the sentence. The structural descriptions are necessary for further processing -- for example, for semantic interpretation.
Many CL systems are based on context-free grammars (CFGs). One such grammar, G, consists of a finite set of nonterminals (for example, S: sentence; NP: noun phrase; VP: verb phrase; V: verb; ADV: adverb), a finite set of terminals, or lexical items, and a finite set of rewrite rules of the form A → W, where A is a nonterminal and W is a string of nonterminals and terminals. S is a special nonterminal called the "start symbol."
CFGs are inadequate and need to be augmented for a variety of reasons. For one, the information associated with a phrase (a string of terminals) is not just the atomic symbols used as nonterminals, but a complex bundle of information (sets of attribute-value pairs, called "feature structures") that needs to be associated with strings. Moreover, appropriate structures and operations for combining them are needed, together with a CFG skeleton. For another, the string-combining operation in a CFG is concatenation, (for example, if u and v are strings, v concatenated with u gives the string w = uv, that is, u followed by v), and more complex string-combining as well as tree-combining operations are needed to describe various linguistic phenomena. Finally, the categories in a grammar need to be augmented by associating them with feature structures, a set of attribute-value pairs that are then combined in an operation called "unification." A variety of grammars such as generalized phrase structure grammar (GPSG), HEAD-DRIVEN PHRASE STRUCTURE GRAMMAR (HPSG), and LEXICAL FUNCTIONAL GRAMMAR (LFG) are essentially CFG-based unification grammars (Bresnan and Kaplan 1983; Gazdar et al. 1985; Pollard and Sag 1994).
Computational grammars need to describe a wide range of dependencies among the different elements in the grammar. Some of these dependencies involve agreement features, such as person, number, and gender. For example, in English, the verb agrees with the subject in person and number. Others involve verb subcategorization, in which each verb specifies one (or more) subcategorization frames for its complements. For instance, sleep (as in Harry sleeps) does not require any complement, while like (as in Harry likes peanuts) requires one complement; give (as in Harry gives Susan a flower) requires two complements; and so forth. Sometimes the dependent elements do not appear in their normal positions. In Whoi did John invite ei, where ei is a stand-in for whoi, whoi is the filler for the gap ei (see WH-MOVEMENT). The filler and the gap need not be at a fixed distance. Thus in Whoi did Bill ask John to invite ei, the filler and the gap are more distant than in the previous sentence. Sometimes the dependencies are nested. In German, for example, one could have Hansi Peterj Mariek schwimmenk lassenj sahi (Hans saw Peter make Marie swim), where the nouns (arguments) and verbs are in nested order, as the subscripts indicate. And sometimes they are crossed. In Dutch, for example, one could have Jani Pietj Mariek zagi latenj zwemmenk (Jan saw Piet make Marie swim). There are, of course, situations where the dependencies have more complex patterns. Precise statements of such dependencies and the domains over which they operate constitute the major activity in the specification of a grammar. Computational modeling of these dependencies is one of the key areas in CL. Many (for example, the crossed dependencies discussed above) cannot be described by context-free grammars and require grammars with larger domains of locality. One such class is constituted by the so-called mildly context-sensitive grammars (Joshi and Schabes 1997). Two grammar formalisms in this class are tree-adjoining grammars (TAGs) and combinatory categorial grammars (CCGs; Joshi and Schabes 1997; Steedman 1997; see also CATEGORIAL GRAMMAR). TAGs are also unification-based grammars.
Parsing sentences according to different grammars is another important research area in CL. Indeed, parsing algorithms are known for almost all context-free grammars used in CL. The time required to parse a sentence of length n is at most Kn3, where K depends on the size of the grammar and can become very large in the worst case. Most parsers perform much better than the worst case on typical sentences, however, even though there are no computational results, as yet, to characterize their behavior on typical sentences. Grammars that are more powerful than CFGs are, of course, harder to parse, as far as the worst case is concerned. Like CFGs, mildly context-sensitive grammars can all be parsed in polynomial time, although the exponent for n is 6 instead of 3. A crucial problem in parsing is not just to get all possible parses for a sentence but to rank the parses according to some criteria. A grammar can be combined with statistical information to provide this ranking.
Thus far, we have assumed that the parser only handles complete sentences and either succeeds or fails to find the parses. In practice, we want the parser to be flexible. It should be able to handle fragments of sentences, or if it fails, it should fail gracefully, providing as much analysis as possible for as many fragments of the sentence as possible, even if it cannot glue all the pieces together.
Finally, even though the actual grammars in major CL systems are large, their coverage is not adequate. Building a grammar by hand soon reaches its limit in coping with free text (say, text from a newspaper). Increasing attention is being paid both to automatically acquiring grammars from a large corpus and to statistical parsers that produce parses directly based on the training data of the parsed annotated corpora .
Although the idea of somehow combining structural and statistical information was already suggested in the late 1950s, it is only now, when we have formal frameworks suitable for combining structural and statistical information in a principled manner and can use very large corpora to reliably estimate the statistics needed to deduce linguistic structure, that this idea has blossomed in CL.
STATISTICAL TECHNIQUES IN NATURAL LANGUAGE PROCESSING in conjunction with large corpora (raw texts or texts annotated in various ways) have also been used to automatically acquire linguistic information about MORPHOLOGY (prefixes, suffixes, inflected forms), subcategorization, semantic classes (classification of nouns based on what predicates they go with; compound nouns such as jet engines, stock market prices; classification of verbs, for example, to know, or events, for example, to look; and so on), and, of course, grammatical structure itself. Such results have opened up a new direction of research in CL, often described as "corpus-based CL."
It should be clear from the previous discussion that, for the development of corpus-based CL, very large quantities of data are required (the Brown corpus from the 1960s is about 1 million words). Researchers estimate that about 100 million words will be required for some tasks. The technologies that will benefit from corpus-based NLP include speech recognition, SPEECH SYNTHESIS, MACHINE TRANSLATION, full-text information retrieval, and message understanding, among others. The need for establishing very large text and speech databases, annotated in various ways, is now well understood. One such database is the Linguistic Data Consortium (LDC), a national and international resource supported by federal and industrial grants, and useful also for psycholinguistic research (especially in experimental design for controlling lexical and grammatical frequency biases; see http://www.ldc.upenn.edu.
Bresnan, J., and R. M. Kaplan. (1983). The Mental Representation of Grammatical Relations. Cambridge, MA: MIT Press.
Gazdar, G., E. Klein, G. K. Pullum, and I. A. Sag. (1985). Generalized Phrase Structure Grammar. Cambridge, MA: Harvard University Press.
Joshi, A. K., and Y. Schabes. (1997). Tree-adjoining grammars. In A. Salomma and G. Rozenberg, Eds., Handbook of Formal Languages and Automata. Berlin: Springer.
Pollard, C., and I. A. Sag. (1994). Head-Driven Phrase Structure Grammar. Chicago: University of Chicago Press.
Steedman, M. (1997). Surface Structure and Interpretation. Cambridge, MA: MIT Press.
Survey of the State of Art in Human Language Technology. (Forthcoming). Cambridge: Cambridge University Press. A survey sponsored by the National Science Foundation, the Directorate XIII-E of the Commission of the European Communities, and the Center for Spoken Language Understanding, Oregon Graduate Institute, Corvalis. Currently available at http://www.cse.ogi.edu/CSLU/HLTsurvey .