MITECS: Computational Lexicons

A computational lexicon has traditionally been viewed as a repository of lexical information for specific tasks, such as parsing, generation, or translation. From this viewpoint, it must contain two types of knowledge: (1) knowledge needed for syntactic analysis and synthesis, and (2) knowledge needed for semantic interpretation. More recently, the definition of a computational lexicon has undergone major revision as the fields of COMPUTATIONAL LINGUISTICS and semantics have matured. In particular, two new trends have driven the design concerns of researchers:

Attempts at closer integration of compositional semantic operations with the lexical information structures that bear them
A serious concern with how lexical types reflect the underlying ontological categories of the systems being modeled

Two new approaches to modeling the structure of the LEXICON have recently emerged in computational linguistics: (1) theoretical studies of how computations take place in the mental lexicon; (2) developments of computational models of information as structured in lexical databases. The differences between the computational study of the lexicon and more traditional linguistic approaches can be summarized as follows:

Viewed independently of any specific application and evaluated in terms of its relevance to cognitive science, the recent work on computational lexicons makes several important points. The first is that the lexical and interlexical structures employed in computational studies have provided some of the most complete descriptions of the lexical bases of natural languages. Besides the broad descriptive coverage of these lexicons, the architectural decisions involved in these systems have important linguistic and psychological consequences. For example, the legitimacy and usefulness of many theoretical constructions and abstract descriptions can be tested and verified by attempting to instantiate them in as complete and robust a lexicon as possible. Of course, completeness doesn"t ensure correctness nor does it ensure a particularly interesting lexicon from a theoretical point of view, but explicit representations do reveal the limitations of a given analytical framework.

Content of a Single Lexical Entry

Although there are many competing views on the exact structure of lexical entries, there are some important common assumptions about the content of a lexical entry. It is generally agreed that there are three necessary components to the structure of a lexical item: orthographic and morphological information; i.e. how the word is spelled and what forms it appears in; syntactic information; for instance, what part of speech the word is; and semantic information; i.e., what representation the word translates to.

Syntactic information may be divided into the subtypes of category and subcategory. Category information includes traditional categories such as noun, verb, adjective, adverb, and preposition. While most systems agree on these "major" categories, there are often great differences in the ways they classify "minor" categories, such as conjunctions, quantifier elements, determiners, etc.

Subcategory information is information that divides syntactic categories into subclasses. This sort of information may be usefully separated into two types, contextual features and inherent features. The former are features that may be defined in terms of the contexts in which a given lexical entry may occur. Subcategorization information marks the local legitimate context for a word to appear in a syntactic structure. For example, the verb devour is never intransitive in English and requires a direct object; hence the lexicon tags the verb with a subcategorization requiring an NP object. Another type of context encoding is collocational information, where patterns that are not fully productive in the grammar can be tagged. For example, the adjective heavy as applied to drinker and smoker is collocational and not freely productive in nature.

Inherent features are features of lexical entries that cannot, or cannot easily, be reduced to a contextual definition. They include such features as count/mass (e.g., pebble vs. water), abstract, animate, human, and so on.

Semantic information can also be separated into two subcategories, base semantic typing and selectional typing. While the former identifies the broad semantic class that a lexical item belongs to (such as event, proposition, predicate), the latter class specifies the semantic features of arguments and adjuncts to the lexical item.

Global Structure of the Lexicon

From the discussion above, the entries in a lexicon would appear to encode only concepts such as category information, selectional restrictions, number, type and case roles of arguments, and so forth. While the utility of this kind of information is beyond doubt, the emphasis on the individual entry misses out on the issue of global lexical organization. This is not to dismiss ongoing work that does focus precisely on this issue; for instance, attempts to relate grammatical alternations with semantic classes (e.g., Levin 1993).

One obvious way to organize lexical knowledge is by means of lexical inheritance mechanisms. In fact, much recent work has focused on how to provide shared data structures for syntactic and morphological knowledge (Flickinger, Pollard, and Wasow 1985). Evans and Gazdar (1990) provide a formal characterization of how to perform inferences in a language for multiple and default inheritance of linguistic knowledge. The language developed for that purpose, DATR, uses value-terminated attribute trees to encode lexical information. Taking a different approach, Briscoe, dePaiva, and Copestake (1993) describe a rich system of types for allowing default mechanisms into lexical type descriptions.

Along a similar line, Pustejovsky and Boguraev (1993) describe a theory of shared semantic information based on orthogonal typed inheritance principles, where there are several distinct levels of semantic description for a lexical item. In particular, a set of semantic roles called qualia structure is relevant to just this issue. These roles specify the purpose (telic), origin (agentive), basic form (formal), and constitution (const) of the lexical item. In this view, a lexical item inherits information according to the qualia structure it carries. In this view, multiple inheritance can be largely avoided because the qualia constrain the types of concepts that can be put together. For example, the predicates cat and pet refer to formal and telic qualia, respectively.

The Computational Lexicon as Knowledge Base

The interplay of the lexical needs of current language processing frameworks and contemporary lexical semantic theories very much influences the direction of computational dictionary analysis research for lexical acquisition. Given the increasingly prominent place the lexicon is assigned -- in linguistic theories, in language processing technology, and in domain descriptions -- it is no accident, nor is it mere rhetoric, that the term "lexical knowledge base" has become a widely accepted one. Researchers use it to refer to a large-scale repository of lexical information, which incorporates more than just "static" descriptions of words, for example, clusters of properties and associated values. A lexical knowledge base should state constraints on word behavior, dependence of word interpretation on context, and distribution of linguistic generalizations.

A lexicon is essentially a dynamic object, as it incorporates, in addition to its information types, the ability to perform inference over them and thus induce word meaning in context. This is what a computational lexicon is: a theoretically sound and computationally useful resource for real application tasks and for gaining insights into human cognitive abilities.

Computational Lexicons

Content of a Single Lexical Entry

Global Structure of the Lexicon

The Computational Lexicon as Knowledge Base

See also

Additional links

References

Further Readings