Computational Lexicons

A computational lexicon has traditionally been viewed as a repository of lexical information for specific tasks, such as parsing, generation, or translation. From this viewpoint, it must contain two types of knowledge: (1) knowledge needed for syntactic analysis and synthesis, and (2) knowledge needed for semantic interpretation. More recently, the definition of a computational lexicon has undergone major revision as the fields of COMPUTATIONAL LINGUISTICS and semantics have matured. In particular, two new trends have driven the design concerns of researchers:

Two new approaches to modeling the structure of the LEXICON have recently emerged in computational linguistics: (1) theoretical studies of how computations take place in the mental lexicon; (2) developments of computational models of information as structured in lexical databases. The differences between the computational study of the lexicon and more traditional linguistic approaches can be summarized as follows:

Lexical representations must be explicit.
The knowledge contained in them must be sufficiently detailed to support one or more processing applications.
The global structure of the lexicon must be modeled.
Real lexicons are complex knowledge bases, and hence the structures relating entire words are as important as those relating components of words. Furthermore, lexical entries consisting of more than one orthographic word (collocations, idioms, and compounds) must also be represented.
The lexicon must provide sufficient coverage of its domain.
Real lexicons can typically contain up to 400,000 entries. For example, a typical configuration might be: verbs (5K), nouns (30K), adjectives (5K), adverbs (<1K), logical terms (<1K), rhetorical terms (<1K), compounds (2K), proper names (300K), and various sublanguage terms.
Computational lexicons must be evaluable.
Computational lexicons are typically evaluated in terms of: (i) coverage: both breadth of the lexicon and depth of lexical information; (ii) extensibility: how easily can information be added to the lexical entry? How readily is new information made consistent with the other lexical structures? (iii) utility: how useful are the lexical entries for specific tasks and applications?

Viewed independently of any specific application and evaluated in terms of its relevance to cognitive science, the recent work on computational lexicons makes several important points. The first is that the lexical and interlexical structures employed in computational studies have provided some of the most complete descriptions of the lexical bases of natural languages. Besides the broad descriptive coverage of these lexicons, the architectural decisions involved in these systems have important linguistic and psychological consequences. For example, the legitimacy and usefulness of many theoretical constructions and abstract descriptions can be tested and verified by attempting to instantiate them in as complete and robust a lexicon as possible. Of course, completeness doesn"t ensure correctness nor does it ensure a particularly interesting lexicon from a theoretical point of view, but explicit representations do reveal the limitations of a given analytical framework.

Content of a Single Lexical Entry

Although there are many competing views on the exact structure of lexical entries, there are some important common assumptions about the content of a lexical entry. It is generally agreed that there are three necessary components to the structure of a lexical item: orthographic and morphological information; i.e. how the word is spelled and what forms it appears in; syntactic information; for instance, what part of speech the word is; and semantic information; i.e., what representation the word translates to.

Syntactic information may be divided into the subtypes of category and subcategory. Category information includes traditional categories such as noun, verb, adjective, adverb, and preposition. While most systems agree on these "major" categories, there are often great differences in the ways they classify "minor" categories, such as conjunctions, quantifier elements, determiners, etc.

Subcategory information is information that divides syntactic categories into subclasses. This sort of information may be usefully separated into two types, contextual features and inherent features. The former are features that may be defined in terms of the contexts in which a given lexical entry may occur. Subcategorization information marks the local legitimate context for a word to appear in a syntactic structure. For example, the verb devour is never intransitive in English and requires a direct object; hence the lexicon tags the verb with a subcategorization requiring an NP object. Another type of context encoding is collocational information, where patterns that are not fully productive in the grammar can be tagged. For example, the adjective heavy as applied to drinker and smoker is collocational and not freely productive in nature.

Inherent features are features of lexical entries that cannot, or cannot easily, be reduced to a contextual definition. They include such features as count/mass (e.g., pebble vs. water), abstract, animate, human, and so on.

Semantic information can also be separated into two subcategories, base semantic typing and selectional typing. While the former identifies the broad semantic class that a lexical item belongs to (such as event, proposition, predicate), the latter class specifies the semantic features of arguments and adjuncts to the lexical item.

Global Structure of the Lexicon

From the discussion above, the entries in a lexicon would appear to encode only concepts such as category information, selectional restrictions, number, type and case roles of arguments, and so forth. While the utility of this kind of information is beyond doubt, the emphasis on the individual entry misses out on the issue of global lexical organization. This is not to dismiss ongoing work that does focus precisely on this issue; for instance, attempts to relate grammatical alternations with semantic classes (e.g., Levin 1993).

One obvious way to organize lexical knowledge is by means of lexical inheritance mechanisms. In fact, much recent work has focused on how to provide shared data structures for syntactic and morphological knowledge (Flickinger, Pollard, and Wasow 1985). Evans and Gazdar (1990) provide a formal characterization of how to perform inferences in a language for multiple and default inheritance of linguistic knowledge. The language developed for that purpose, DATR, uses value-terminated attribute trees to encode lexical information. Taking a different approach, Briscoe, dePaiva, and Copestake (1993) describe a rich system of types for allowing default mechanisms into lexical type descriptions.

Along a similar line, Pustejovsky and Boguraev (1993) describe a theory of shared semantic information based on orthogonal typed inheritance principles, where there are several distinct levels of semantic description for a lexical item. In particular, a set of semantic roles called qualia structure is relevant to just this issue. These roles specify the purpose (telic), origin (agentive), basic form (formal), and constitution (const) of the lexical item. In this view, a lexical item inherits information according to the qualia structure it carries. In this view, multiple inheritance can be largely avoided because the qualia constrain the types of concepts that can be put together. For example, the predicates cat and pet refer to formal and telic qualia, respectively.

The Computational Lexicon as Knowledge Base

The interplay of the lexical needs of current language processing frameworks and contemporary lexical semantic theories very much influences the direction of computational dictionary analysis research for lexical acquisition. Given the increasingly prominent place the lexicon is assigned -- in linguistic theories, in language processing technology, and in domain descriptions -- it is no accident, nor is it mere rhetoric, that the term "lexical knowledge base" has become a widely accepted one. Researchers use it to refer to a large-scale repository of lexical information, which incorporates more than just "static" descriptions of words, for example, clusters of properties and associated values. A lexical knowledge base should state constraints on word behavior, dependence of word interpretation on context, and distribution of linguistic generalizations.

A lexicon is essentially a dynamic object, as it incorporates, in addition to its information types, the ability to perform inference over them and thus induce word meaning in context. This is what a computational lexicon is: a theoretically sound and computationally useful resource for real application tasks and for gaining insights into human cognitive abilities.

See also

Additional links

-- James Pustejovsky


Briscoe, T., V. de Paiva, and A. Copestake, Eds. (1993). Inheri-tance, Defaults, and the Lexicon. Cambridge: Cambridge University Press.

Evans, R., and G. Gazdar. (1990). Inference in DATR. Proceedings of the Fourth European ACL Conference, April 10-12, 1989, Manchester, England.

Flickinger, D., C. Pollard, and T. Wasow. (1985). Structure - sharing in lexical representation. Proceedings of 23rd Annual Meeting of the ACL, Chicago, IL, pp. 262-267.

Grimshaw, J. (1990). Argument Structure. Cambridge, MA: MIT Press.

Guthrie, L., J. Pustejovsky, Y. Wilks, and B. Slator. (1996). The role of lexicons in natural language processing. Communications of the ACM 39:1.

Levin, B. (1993). Towards a Lexical Organization of English Verbs. Chicago: University of Chicago Press.

Miller, G. WordNet: an on-line lexical database. International Journal of Lexicography 3:235-312.

Pollard, C., and I. Sag. (1987). Information - Based Syntax and Semantics. CSLI Lecture Notes Number 13. Stanford, CA: CSLI.

Pustejovsky, J., and P. Boguraev. (1993). Lexical knowledge representation and natural language processing. Artificial Intelligence 63:193-223.

Further Readings

Atkins, B. (1990). Building a lexicon: Reconciling anisomorphic sense differentiations in machine-readable dictionaries. Paper presented at BBN Symposium: Natural Language in the 90s -- Language and Action in the World, Cambridge, MA.

Boguraev, B., and E. Briscoe. (1989). Computational Lexicography for Natural Language Processing. Longman, Harlow and London.

Boguraev, B., and J. Pustejovsky. (1996). Corpus Processing for Lexical Acquisition. Cambridge, MA: Bradford Books/MIT Press.

Briscoe, E., A. Copestake, and B. Boguraev. (1990). Enjoy the paper: Lexical semantics via lexicology. Proceedings of 13th International Conference on Computational Linguistics, Hel sinki, Finland, pp. 42-47.

Calzolari, N. (1992). Acquiring and representing semantic information in a lexical knowledge base. In J. Pustejovsky and S. Bergler, Eds., Lexical Semantics and Knowledge Representation. New York: Springer Verlag.

Copestake, A., and E. Briscoe. (1992). Lexical operations in a unification - based framework. In J. Pustejovsky and S. Bergler, Eds., Lexical Semantics and Knowledge Representation. New York: Springer Verlag.

Evens, M. (1987). Relational Models of the Lexicon. Cambridge: Cambridge University Press.

Grishman, R., and J. Sterling. (1992). Acquisition of selectional patterns. Proceedings of the 14th International Conf. on Computational Linguistics (COLING 92), Nantes, France.

Hirst, G. (1987). Semantic Interpretation and the Resolution of Ambiguity. Cambridge: Cambridge University Press.

Hobbs, J., W. Croft, T. Davies, D. Edwards, and K. Laws. (1987). Commonsense metaphysics and lexical semantics. Computational Linguistics 13:241-250.

Ingria, R., B. Boguraev, and J. Pustejovsky. (1992). Dictionary/Lexicon. In Stuart Shapiro, Ed., Encyclopedia of Artificial Intelligence. 2nd ed. New York: Wiley.

Miller, G. (1991). The Science of Words. Scientific American Library.

Pustejovsky, J. (1992). Lexical semantics. In Stuart Shapiro, Ed., Encyclopedia of Artificial Intelligence. 2nd ed. New York: Wiley.

Pustejovsky, J. (1995). The Generative Lexicon. Cambridge, MA: MIT Press.

Pustejovsky, J., S. Bergler, and P. Anick. (1993). Lexical semantic techniques for corpus analysis. Computational Linguistics 19 (2).

Salton, G. (1991). Developments in automatic text retrieval. Science 253: 974.

Weinreich, U. (1972). Explorations in Semantic Theory. The Hague: Mouton.

Wilks, Y. (1975). An intelligent analyzer and understander for English. Communications of the ACM 18:264-274.

Wilks, Y., D. Fass, C-M. Guo, J. McDonald, T. Plate, and B. Slator. (1989). A tractable machine dictionary as a resource for computational semantics. In B. Boguraev and E. Briscoe, Eds., Computational Lexicography for Natural Language Processing. Longman, Harlow and London, pp. 193-228.