Natural Language Processing

Natural language processing is a subfield of artificial intelligence involving the development and use of computational models to process language. Within this, there are two general areas of research: comprehension, which deals with processes that extract information from language (e.g., natural language understanding, information retrieval), and generation, which deals with processes of conveying information using language. Traditionally, work dealing with speech has been considered separate fields of SPEECH RECOGNITION and SPEECH SYNTHESIS. We will continue with this separation here, and the issues of mapping sound to words and words to sound will not be considered further.

There are two main motivations underlying work in this area. The first is the technological goal of producing automated systems that perform various language-related tasks, such as building automated interactive systems (e.g., automated telephone-operator services) or systems to scan databases of documents to articles on a certain topic (e.g., finding relevant pages on the world wide web). The second, the one most relevant to cognitive science, seeks to better understand how language comprehension and generation occurs in humans. Rather than performing experiments on humans as done in psycholinguistics, or developing theories that account for the data with a focus on handling possible counterexamples as in linguistics and philosophy, researchers in natural language processing test theories by building explicit computational models to see how well they behave. Most research in the field is still more in the exploratory stage of this endeavor and trying to construct "existence proofs" (i.e., find any mechanism that can understand language within limited scenarios), rather than building computational models and comparing them to human performance. But once such existence-proof systems are completed, the stage will be set for more detailed comparative study between human and computational models. Whatever the motivation behind the work in this area, however, computational models have provided the inspiration and starting point for much work in psycholinguistics and linguistics in the last twenty years.

Although there is a diverse set of methods used in natural language processing, the techniques can generally be broadly classified in three general approaches: statistical methods, structural/pattern-based methods and reasoning-based methods. It is important to note that these approaches are not mutually exclusive. In fact, the most comprehensive models combine all three techniques. The approaches differ in the kind of processing tasks they can perform and in the degree to which systems require handcrafted rules as opposed to automatic training/learning from language data. A good source that gives an overview of the field involving all three approaches is Allen 1995.

Statistical methods involve using large corpora of language data to compute statistical properties such as word co-occurrence and sequence information (see also STATISTICAL TECHNIQUES IN NATURAL LANGUAGE PROCESSING). For instance, a bigram statistic captures the probability of a word with certain properties following a word with other properties. This information can be estimated from a corpus that is labeled with the properties needed, and used to predict what properties a word might have based on its preceding context. Although limited, bigram models can be surprisingly effective in many tasks. For instance, bigram models involving part of speech labels (e.g., noun, verb) can typically accurately predict the right part of speech for over 95 percent of words in general text. Statistical models are not restricted to part of speech tagging, however, and they have been used for semantic disambiguation, structural disambiguation (e.g., prepositional phrase attachment), and many other properties. Much of the initial work in statistical language modeling was performed for automatic speech-recognition systems, where good word prediction can double the word-recognition accuracy rate. The techniques have also proved effective in tasks such as information retrieval and producing rough "first-cut" drafts in machine translation. A big advantage to statistical techniques is that they can be automatically trained from language corpora. The challenge for statistical models concerns how to capture higher level structure, such as semantic information, and structural properties, such as sentence structure. In general, the most successful approaches to these problems involve combining statistical approaches with other approaches. A good introduction to statistical approaches is Charniak 1993.

Structural and pattern-based approaches have the closest connection to traditional linguistic models. These approaches involve defining structural properties of language, such as defining FORMAL GRAMMARS for natural languages. Active research issues include the design of grammatical formalisms to capture natural language structure yet retain good computational properties, and the design of efficient parsing algorithms to interpret sentences with respect to a grammar. Structural approaches are not limited solely to syntax, however. Many more practical systems use semantically based grammars, where the primitive units in the grammar are semantic classes rather than syntactic. And other approaches dispense with fully analyzing sentence structure altogether, using simpler patterns of lexical, syntactic and semantic information that match sentence fragments. Such techniques are especially useful in limited-domain speech-driven applications where errors in the input can be expected. Because the domain is limited, certain phrases (e.g., a prepositional phrase) may have only one interpretation possible in the application. Structural models also appear at the DISCOURSE level, where models are developed that capture the interrelationships between sentences and build models of topic flow. Structural models provide a capability for detailed analysis of linguistic phenomena, but the more detailed the analysis, the more one must rely on hand-constructed rules rather than automatic training from data. An excellent collection of papers on structural approaches, though missing recent work, is Grosz, Sparck Jones, and Webber 1986.

Reasoning-based approaches involve encoding knowledge and reasoning processes and use these to interpret language. This work has much in common with work in KNOWLEDGE REPRESENTATION as well as work in the philosophy of language. The idea here is that the interpretation of language is highly dependent on the context in which the language appears. By trying to capture the knowledge a human may have in a situation, and model common-sense reasoning, problems such as word sense and sentence- structure disambiguation, analysis of referring expressions, and the recognition of the intentions behind language can be addressed. These techniques become crucial in discourse, whether it be extended text that needs to be understood or a dialogue that needs to be engaged in. Most dialogue-based systems use a speech-act - based approach to language and computational models of PLANNING and plan recognition to define a conversational agent. Specifically, such systems first attempt to recognize the intentions underlying the utterances they hear, and then plan their own utterances based on their goals and knowledge (including what was just recognized about the other agent). The advantage of this approach is that is provides a mechanism for contextual interpretation of language. The disadvantage is the complexity of the models required to define the conversational agent. Two good sources for work in this area are Cohen, Morgan, and Pollack 1990 and Carberry 1991.

There are many applications for natural language processing research, which can be roughly categorized into three main areas:

Information Extraction and Retrieval Given that much of human knowledge is encoded in textual form, work in this area attempts to analyze such information automatically and develop methods for retrieving information as needed. The most obvious application area today is in developing internet web browsers, where one wants to find web pages that contain specific information. While most web-based techniques today involve little more than sophisticated keyword matching, there is considerable research in using more sophisticated techniques, such as classifying the information in documents based on their statistical properties (e.g., how often certain word patterns appear) as well as techniques that use robust parsing techniques to extract information. A good survey of applications for information retrieval can be found in Lewis and Sparck Jones (1996). Many of the researchers in this area have participated in annual evaluations and present their work at the MUC conferences (Chincor, Hirschman, and Lewis 1993).

Machine Translation Given the great demand for translation services, automatic translation of text and speech (in simultaneous translation) is a critical application area. This is one area where there is an active market for commercial products, although the most useful products to date have aimed to enhance human translators rather than to replace them. They provide automated dictionary/translation aids and provide rough initial translations that can be post-edited. In applications where the content is stylized, such as technical and user manuals for products, it is becoming feasible to produce reasonable-quality translations automatically. A good reference for the machine-translation area is Hutchins and Somers 1992.

Human-Machine Interfaces Given the increased availability of computers in all aspects of everyday life, there are immense opportunities for defining language-based interfaces. A prime area for commercial application is in telephone applications for customer service, replacing the touch-tone menu-driven interfaces with speech-driven language-based interfaces. Even the simplest applications, such as a ten-word automated operator service for long-distance calls, can save companies millions of dollars a year. Another important but longer-term application concerns the computer interface itself, replacing current interfaces with multimedia language-based interfaces that enhance the usability and accessibility of personal computers for the general public. Although general systems are a long way off, it will soon be feasible to define such interfaces for limited applications.

Although natural language processing is an area of great practical importance and commercial application, it is important to remember that its main contribution to cognitive science will remain the powerful metaphor that the computer provides for understanding human language processing. It allows us to specify models at a level of detail that would otherwise be unimaginable. We are now at the stage where end-to-end models of conversational agents can be constructed in simple domains. Work in this area will continue to further our knowledge of language processing and suggest novel ideas for experimentation.

See also

Additional links

-- James Allen


Allen, J. F. (1995). Natural Language Understanding. 2nd ed. Menlo Park, CA: Benjamin-Cummings.

Carberry, S. (1991). Plan Recognition in Natural Language Dialogue. Cambridge, MA: MIT Press.

Charniak, E. (1993). Statistical Language Learning. Cambridge, MA: MIT Press.

Cohen, P., J. Morgan, and M. Pollack. (1990). Communication and Intention. Cambridge, MA: MIT Press.

Chincor, N., L. Hirschman, and D. Lewis. (1993). Evaluation of message understanding systems. Computational Linguistics 19(3):409-450.

Grosz, B., K. Sparck Jones, and B. Webber (1986). Readings in Natural Language Processing. San Francisco: Kaufmann.

Hutchins, W. J., and H. Somers. (1992). An Introduction to Machine Translation. New York: Academic Press.

Lewis, D. D., and K. Sparck Jones. (1996). Natural language processing for information retrieval. Communications of the ACM 39(1):92-101.