Statistical Techniques in Natural Language Processing

Statistical natural language processing is concerned with the creation of computer programs that can perform language-processing tasks by virtue of information gathered from (typically large) corpora. Usually this information is in the form of statistics, but it may be distilled into other forms, such as DECISION TREES, dictionary entries, or various kinds of rules (see also NATURAL LANGUAGE PROCESSING). The techniques have been applied to (or have correlates in) areas as diverse as lexicography and document retrieval (Church and Mercer 1993). Here, however, we concentrate on subareas closer to cognitive science.

Most statistical language programs sacrifice depth or accuracy to breadth of coverage. That is, statistical NLP programs typically work on most everything one throws at them, but they do not provide very deep analyses and/or may make occasional mistakes. This has the salutary effect of making it easier to compare competing techniques since they will work on the same body of data (i.e., "everything"). They work as well as they do because one can get a great deal of leverage from relatively weak (and easily collected) pieces of information -- combined with the fact that, at least at the shallower linguistic depths, computers seem better at collecting linguistic phenomena than people are at thinking them up.

Historically, the push behind statistical techniques came primarily from the speech community that discovered in the early 1970s that programs automatically trained from corpora worked better than their hand-tooled versions. These programs exploited the training properties of HIDDEN MARKOV MODELS (HMMs; Levinson, Rabiner, and Sondhi 1983). These models can be thought of as finite state machines in which the transitions between states have probabilities. There are well-understood mathematical techniques for training them -- adjusting the probabilities to better fit the observed data. The "hidden" in their name comes from the fact that in such models one cannot know the sequence of states that produced the output (or, equivalently, accepted the input), but there are linear-time techniques for finding the most probable of such state sequences. The trick is then to identify states with the properties one wishes to infer (e.g., the word uttered).

These speech programs also typically incorporate language models, assignments of probabilities to all sequences of words in the language. The most popular such model is the simple but remarkably accurate trigram model (Jelenik 1985) in which the probability of the next word is conditioned on just the two previous words. The language model enables the speech recognizer to pick the best word in context when the speech signal analysis by itself is insufficient (Church and Mercer 1993).

An early successful application of HMM technology to language tasks was the HMM "taggers," programs that try to assign the correct part of speech to each word in a text. For example, in the can will rust the program should identify can as a noun (not a modal verb) and will as a modal verb (not a noun). In these programs, the states of the HMM correspond to the parts of speech, so that finding the most probable sequence of states when the HMM is driven by the input gives us the desired tagging (Church 1988). Other schemes compile the statistics into rules (Brill 1995). Moderately well-crafted programs achieve about 96 percent accuracy. For comparison, human taggers are consistent with one another at a 98 percent level.

Moving up the linguistic food chain, statistical techniques are now being applied to wide-coverage syntactic parsing -- determining the syntactic structure of a sentence. Such parsers are able to come up with at least a semi- reasonable structure for, say, every sentence on the front page of today's New York Times. In this area, statistical parsers rule the roost. One simple idea here is the probabilistic context-free grammar (PCFG; Charniak 1993). PCFGs are context-free grammars in which each rule is assigned a probability. Alternatively, they can be thought of as the context- free generalization of HMMs. PCFGs can be parsed using the standard techniques applied to their nonprobabilistic brethren, and the most probable parse for a sentence can be found in n-cubed time, where n is the length of the sentence.

Statistical parsing researchers have greatly benefited from moderate-size corpora of hand-parsed sentences, the most notable being the one-million-word Penn tree-bank (Marcus, Santorini, and Marcinkiewicz 1993). For example, one very simple technique for "learning" a PCFG grammar is to read it directly off the parse trees in the corpus. One estimates the rule probabilities by counting how often each rule is used and dividing this by the counts of productions for the same part of speech (Charniak 1996). Programs using this technique, or others using similar sources of information, parse at about a 75 percent level of accuracy. That is, 75 percent of the constituents produced are correct, where correct means that the constituent starts at the correct place, ends at the correct place, and has the right "label" (e.g., "noun phrase"). Researchers are exploring techniques that exploit more of the information in the hand-parsed corpora. One popular idea involves keeping track of the "head" of a constituent, the constituent's most important lexical item (e.g., the main noun in a noun phrase), and then conditioning probabilities on this head. These techniques are now achieving an 87 percent accuracy level (Charniak 1997; Collins 1997).

Statistical word-sense disambiguation is yet another thriving area. Here we can exploit the fact that very different senses of a word (e.g., river bank vs. savings bank) typically occur in texts concerning quite different topics, and this in turn is signaled by the occurrence of different content words in the surrounding text. So if we see "leaf" and "water" near "bank," river bank is a good guess. Thus, to a first approximation, statistical word-sense disambiguation programs work by taking a window of, say, the 100 words on either side of the ambiguous word and determining the probability of seeing these words if we assume that the ambiguous word is in one sense versus the other. Thus, to restate formally what we said informally about "bank," if the words in the window include "leaf" and "water" but do not include, say, "money" or "loan," then the probability of the words is higher under the river bank interpretation. The hard part in all of this is collecting the statistics needed without access to a large corpus in which the words are marked with their senses. Researchers have solved this in a variety of interesting ways: using extra sources of information such as bilingual corpora (Gale, Church, and Yarowsky 1992), looking for clusters in probability space that indicate an identifiable sense (Schütze 1992), or using dictionary definitions as key words to start the process (Yarowsky 1995).

One potential application of these ideas is MACHINE TRANSLATION. In preparation for such work, there is now an active research community looking at "parallel corpora" -- articles and their translations side by side. One problem here is bringing the alignment down from the article level to the sentence level (Brown, Lai, and Mercer 1991) and another is creating bilingual lexica (Brown et al. 1993). Statistical techniques in both these areas (frequently done in parallel) are quite accurate.

See also

Additional links

-- Eugene Charniak

References

Brill, E. (1995). Unsupervised learning of disambiguation rules for part of speech tagging. Proceedings of the Third Workshop on Very Large Corpora, 1-13.

Brown, P., J. Lai, and R. Mercer. (1991). Aligning sentences in parallel corpora. Proceedings of the Association for Computational LinguisticsACL: 169-176.

Brown, P. F., S. A. Della Pietra, V. J. Della Pietra, and R. L. Mercer. (1993). The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics 192:263-312.

Charniak, E. (1993). Statistical Language Learning. Cambridge, MA: MIT Press.

Charniak, E. (1996). Tree-bank grammars. Proceedings of the Thirteenth National Conference on Artificial Intelligence. Menlo Park, CA: AAAI Press/MIT Press, pp. 1031-1036.

Charniak, E. (1997). Statistical parsing with a context-free grammar and word statistics. Proceedings of the Fourteenth National Conference on Artificial Intelligence. Menlo Park, CA: AAAI Press/MIT Press, pp. 598-603.

Church, K. W. (1988). A stochastic parts program and noun phrase parser for unrestricted text. Second Conference on Applied Natural Language Processing ACL, pp. 136-143.

Church, K. W., and R. L. Mercer. (1993). Introduction to the special issue on computational linguistics and large corpora. Computational Linguistics 19:1-24.

Collins, M. J. (1997). Three generative lexicalised models for statistical parsing. Proceedings of the Thirty-fifth Annual Meeting of the ACL, pp. 16-23.

Gale, W. A., K. W. Church, and D. Yarowsky. (1992). A method for disambiguating word senses in a large corpus. Computers and the Humanities.

Jelinek, F. (1985). Self-organized language modeling for speech recognition. Yorktown Heights, NY: IBM T. J. Watson Research Center, Continuous Speech Recognition Group.

Levinson, S. E., L. R. Rabiner, and M. M. Sondhi. (1983). An introduction to the application of the theory of probabilistic functions of a Markov process to automatic speech recognition. The Bell System Technical Journal 62(4):1035-1074.

Marcus, M. P., B. Santorini, and M. A. Marcinkiewicz. (1993). Building a large annotated corpus of English: The Penn treebank. Computational Linguistics 19:313-330.

Schütze, H. (1992). Dimensions of meaning. Proceedings of Supercomputing 92.

Yarowsky, D. (1995) Unsupervised word-sense disambiguation rivaling supervised methods. Proceedings of the Thirty-third Annual Meeting of the Association for Computational Linguistics.

Further Readings

Brown, P .F., V. J. Della Pietra, P. V. DeSouza, J. C. Lai, and R. L. Mercer. (1992). Class-based n-gram models of natural language. Computational Linguistics 18(4):467-479.

Caraballo, S., and E. Charniak. (Forthcoming). New figures of merit for best-first probabilistic chart parsing. Computational Linguistics.

Cardie, C. (1992). Corpus-based acquisition of relative pronoun disambiguation heuristics. Proceedings of the Thirtieth Annual Meeting of the ACL, pp. 216-223.

Francis, W. N., and H. Kucera. (1982). Frequency Analysis of English Usage: Lexicon and Grammar. Boston: Houghton Mifflin.

Goodman, J. (1996). Parsing algorithms and metrics. Proceedings of the Thirty-fourth Annual Meeting of the ACL, pp. 177-183.

Jelinek, F., and J. D. Lafferty. (1991). Computation of the probability of initial substring generation by stochastic context-free grammars. Computational Linguistics 17:315-324.

Lappin, S., and H. J. Leass. (1994). An algorithm for pronominal anaphora resolution. Computational Linguistics 20:535-561.

Lauer, M. (1995). Corpus statistics meet the noun compound: Some empirical results. Proceedings the Thirty-third Annual Meeting of the ACL, pp. 47-55.

Pereira, F., N. Tishby, and L. Lee. (1993). Distributional clustering of English words. Proceedings of the Thirty-first Annual Meeting of the ACL.

Riloff, E. (1996). Automatically generating extraction patterns from untagged text. Proceedings of the Thirteenth National Conference on Artificial Intelligence. Menlo Park, CA: AAAI Press/MIT Press, pp. 1044-1049 .