Connectionist Approaches to Language

In research on theoretical and COMPUTATIONAL LINGUISTICS and NATURAL LANGUAGE PROCESSING, the dominant formal approaches to language have traditionally been theories of RULES AND REPRESENTATIONS. These theories assume an underlying symbolic COGNITIVE ARCHITECTURE based in discrete mathematics, the theory of algorithms for manipulating symbolic data structures such as strings (e.g., of phonemes; see PHONOLOGY), trees (e.g., of nested syntactic phrases; see SYNTAX), graphs (e.g., of conceptual structures deployed in SEMANTICS), and feature structures (e.g., of phonological, syntactic, and semantic properties of nested phrases or their designations). In contrast, connectionist computation is based in the continuous mathematics of NEURAL NETWORKS: the theory of numerical vectors and tensors (e.g., of activation values), matrices (e.g., of connection weights), differential equations (e.g., for the dynamics of spreading activation or learning), probability and statistics (e.g., for analysis of inductive and statistical inference). How can linguistic phenomena traditionally analyzed with discrete symbolic computation be analyzed with continuous connectionist computation? Two quite different strategies have been pursued for facing this challenge.

The dominant, model-centered strategy proceeds as follows (see COGNITIVE MODELING, CONNECTIONIST; see also STATISTICAL TECHNIQUES IN NATURAL LANGUAGE PROCESSING): specific data illustrating some interesting linguistic phenomena are identified; certain general connectionist principles are hypothesized to account for these data; a concrete instantiation of these principles in a particular connectionist network -- the model -- is selected; computer simulation is used to test the adequacy of the model in accounting for the data; and, if the network employs learning, the network configuration resulting from learning is analyzed to discern the nature of the account that has been learned.

For instance, a historically pivotal model (Rumelhart and McClelland 1986) addressed the data on children's overgeneralization of the regular past tense inflection of irregular verbs; connectionist induction from the statistical preponderance of regular inflection was hypothesized to account for these data; a network incorporating a particular representation of phonological strings and a simple learning rule was proposed; simulations of this model documented considerable but not complete success at learning to inflect irregular, regular, and novel stems; and limited post hoc analysis was performed of the structure acquired by the network which was responsible for its performance.

The second, principle-centered, strategy approaches language by directly deploying general connectionist principles, without the intervention of a particular network model. Selected connectionist principles are used to directly derive a novel and general linguistic formalism, and this formalism is then used directly for the analysis of particular linguistic phenomena. An example is the "harmonic grammar" formalism (Legendre, Miyata, and Smolensky 1990), in which a grammar is a set of violable or "soft" constraints on the well-formedness of linguistic structures, each with a numerical strength: the grammatical structures are those that simultaneously best satisfy the constraints. As discussed below, this formalism is a consequence of general mathematical principles that can be shown to govern the abstract, high-level properties of the representation and processing of information in certain classes of connectionist systems.

These two connectionist approaches to language are complementary. Although the principle-centered approach is independent from many of the details needed to define a concrete connectionist model, it can exploit only relatively basic connectionist principles. With the exception of the simplest cases, the general emergent cognitive properties of the dynamics of a large number of interacting low-level connectionist variables are not yet characterizable by mathematical analysis -- detailed computer simulation of concrete networks is required.

We now consider several connectionist computational principles and their potential linguistic implications. These principles divide into those pertaining to the learning, the processing, and the representational components of connectionist theory.

Connectionist Inductive Learning Principles

These provide one class of solution to the problem of how the large numbers of interactions among independent connectionist units can be orchestrated so that their emergent effect is the computation of an interesting linguistic function. Such functions include those relating a verb stem with its past tense (MORPHOLOGY); orthographic with phonological representations of a word (READING and VISUAL WORD RECOGNITION); and a string of words with a representation of its meaning.

Many learning principles have been used to investigate what types of linguistic structure can be induced from examples. SUPERVISED LEARNING techniques learn to compute a given input or output function by adapting the weights of the network during experience with training examples so as to minimize a measure of the overall output error (e.g., each training example might be a pair consisting of a verb stem and its past tense form). UNSUPERVISED LEARNING methods extract regularities from training data without explicit information about the regularities to be learned, for example, a network trained to predict the next letter in an unsegmented stream of text extracts aspects of the distributional structure arising from the repetition of a fixed set of words, enabling the trained network to segment the stream (Elman 1990).

A trained network capable of computing, to some degree, a linguistically relevant function has acquired a certain degree of internal structure, manifest in the behavior of the learned network (e.g., its pattern of generalization to novel inputs), or more directly discernible under analysis of the learned connection weights. The final network structure is jointly the product of the linguistic structure of the training examples and the a priori structure explicitly and implicitly provided the model via the selection of architectural parameters. Linguistically relevant a priori structure includes what is implicit in the representation of inputs and outputs, the pattern of connectivity of the network, and the performance measure that is optimized during learning.

Trained networks have acquired many types of linguistically relevant structure, including nonmonotonic or "U-shaped" development (Rumelhart and McClelland 1986); categorical perception; developmental spurts (Elman et al. 1996); functional modularity (behavioral dissociations in intact or internally damaged networks; Plaut and Shallice 1994); localization of different functions to different spatial portions of the network (Jacobs, Jordan, and Barto 1991); finite-state, machinelike structure corresponding to a learned grammar (Touretzky 1991). Before a consensus can be reached on the implications of learned structure for POVERTY OF THE STIMULUS ARGUMENTS and the INNATENESS OF LANGUAGE, researchers will have to demonstrate incontrovertibly that models lacking grammatical knowledge in their a priori structure can acquire such knowledge (Elman et al. 1996; Pinker and Mehler 1988; Seidenberg 1997). In addition to this model-based research, recent formal work in COMPUTATIONAL LEARNING THEORY based in mathematical statistics has made considerable progress in the area of inductive learning, including connectionist methods, formally relating the justifiability of induction to general a priori limits on the learner's hypothesis space, and quantitatively relating the number of adjustable parameters in a network architecture to the number of training examples needed for good generalization (with high probability) to novel examples (Smolensky, Mozer, and Rumelhart 1996).

Connectionist Processing Principles

The potential linguistic implications of connectionist principles go well beyond learning and the RATIONALISM VS. EMPIRICISM debate. The processing component of connectionist theory includes several relevant principles. For example, in place of serial stages of processing, a connectionist principle that might be dubbed "parallel modularity" hypothesizes that informationally distinct modules (e.g., phonological, orthographic, syntactic, and semantic knowledge) are separate subnetworks operating in parallel with each other, under continuous exchange of information through interface subnetworks (e.g., Plaut and Shallice 1994).

Another processing principle concerns the transformations of activity patterns from one layer of units to the next: In the processing of an input, the influence exerted by a previously stored item is proportional to both the frequency of presentation of the stored item and its "similarity" to the input, where "similarity" of activity patterns is measured by a training-set - dependent metric (see PATTERN RECOGNITION AND FEEDFORWARD NETWORKS). While such frequency- and similarity-sensitive processing is readily termed associative, it must be recognized that "similarity" is defined relative to the internal activation pattern encoding of the entire set of items. This encoding may itself be sensitive to the contextual or structural role of an item (Smolensky 1990); it may be sensitive to certain complex combinations of features of its content, and insensitive altogether to other content features. For example, a representation may encode the syntactic role and category of a word as well as its phonological and semantic content, and the relevant "similarity" metric may be strongly sensitive to the syntactic information, while being completely insensitive to the phonological and semantic information.

A class of RECURRENT NETWORKS with feedback connections is subject to the following principle: The network's activation state space contains a finite set of attractor states, each surrounded by a "basin of attraction"; any input pattern lying in a given basin will eventually produce the corresponding attractor state as its output (see DYNAMIC APPROACHES TO COGNITION). This principle relates a continuous space of possible input patterns and a continuous processing mechanism to a discrete set of outputs, providing the basis for many connectionist accounts of categorical perception, categorical retrieval of lexical items from memory, and categorization processes generally. For example, the pronunciation model of Plaut et al. (1996) acquires a combinatorially structured set of output attractors encoding phonological strings including monosyllabic English words, and an input encoding a letter string yields an output activation pattern that is an attractor for a corresponding pronunciation.

A related principle governing processing in a class of recurrent networks characterizes the output of the network as an optimal activation pattern: among those patterns containing the given input pattern, the output is the pattern that maximizes a numerical well-formedness measure, harmony, or that minimizes "energy" (see also CONSTRAINT SATISFACTION). This principle has been used in combination with the following one to derive a general grammar formalism, harmonic grammar, described above as an illustration of principle-centered research. Harmonic grammar is a precursor to OPTIMALITY THEORY (Prince and Smolensky 1993), which adds further strong restrictions on what constitutes a possible human grammar. These include the universality of grammatical constraints, and the requirement that the strengths of the constraints be such as to entail "strict domination": the cost of violating one constraint can never be exceeded by any amount of violation of weaker constraints.

Connectionist Representational Principles

Research on the representational component of connectionist theory has focused on statistically based analyses of internal representations learned by networks trained on linguistic data, and on techniques for representing, in numerical activation patterns, information structured by linear precedence, attribute/value, and dominance relations (e.g., Smolensky 1990; see BINDING PROBLEM). While this research shows how complex linguistic representations may be realized, processed, and learned in connectionist networks, contributions to the theory of linguistic representation remain largely a future prospect.

See also

Additional links

-- Paul Smolensky


Elman, J. L. (1990). Finding structure in time. Cognitive Science 14:179-211.

Elman, J., E. Bates, M. H. Johnson, A. Karmiloff-Smith, D. Parisi, and K. Plunkett. (1996). Rethinking Innateness: A Connectionist Perspective on Development. Cambridge, MA: MIT Press.

Jacobs, R. A., M. I. Jordan, and A. G. Barto. (1991). Task decomposition through competition in a modular connectionist architecture: The what and where vision tasks. Cognitive Science 15:219-250.

Legendre, G., Y. Miyata, and P. Smolensky. (1990). Harmonic grammar: A formal multi-level connectionist theory of linguistic well-formedness: Theoretical foundations. In Proceedings of the Twelfth Annual Conference of the Cognitive Science Society. Cambridge, MA, pp. 388-395.

Pinker, S., and J. Mehler. (1988). Connections and Symbols. Cambridge, MA: MIT Press.

Plaut, D., and T. Shallice. (1994). Connectionist Modelling in Cognitive Neuropsychology: A Case Study. Hillsdale, NJ: Erlbaum.

Plaut, D. C., J. L. McClelland, M. S. Seidenberg, and K. Patterson. (1996). Understanding normal and impaired word reading: Computational principles in quasi-regular domains. Psychological Review 103:56-115.

Prince, A., and P. Smolensky. (1993). Optimality Theory: Constraint Interaction in Generative Grammar. RuCCS Technical Report 2, Rutgers Center for Cognitive Science, Rutgers University, Piscataway, NJ, and Department of Computer Science, University of Colorado at Boulder.

Rumelhart, D., and J. L. McClelland. (1986). On learning the past tenses of English verbs. In J. L. McClelland, D. E. Rumelhart, and the PDP Research Group, Parallel Distributed Processing: Explorations in the Microstructure of Cognition. Vol. 2, Psychological and Biological Models. Cambridge, MA: MIT Press, pp. 216-271.

Seidenberg, M. (1997). Language acquisition and use: Learning and applying probabilistic constraints. Science 275:1599-1603.

Smolensky, P. (1990). Tensor product variable binding and the representation of symbolic structures in connectionist networks. Artificial Intelligence 46:159-216.

Smolensky, P., M. C. Mozer, and D. E. Rumelhart, Eds. (1996). Mathematical Perspectives on Neural Networks. Mahwah, NJ: Erlbaum.

Touretzky, D. S., Ed. (1991). Machine Learning 7(2/3). Special issue on connectionist approaches to language learning.

Further Readings

Elman, J. L. (1993). Learning and development in neural networks: The importance of starting small. Cognition 48:71-99.

Goldsmith, J., Ed. (1993). The Last Phonological Rule: Reflections on Constraints and Derivations. Chicago: University of Chicago Press.

Hare, M., and J. L. Elman. (1994). Learning and morphological change. Cognition 49.

Hinton, G. E. (1991). Connectionist Symbol Processing. Cambridge, MA: MIT Press.

Miikkulainen, R. (1993). Subsymbolic Natural Language Processing: An Integrated Model of Scripts, Lexicon, and Memory. Cambridge, MA: MIT Press.

Plunkett, K., and V. Marchman. (1993). From rote learning to system building: Acquiring verb morphology in children and connectionist nets. Cognition 48:21-69.

Sharkey, N., Ed. (1992). Connectionist Natural Language Processing. Dordrecht: Kluwer.

Wheeler, D. W., and D. S. Touretzky. (1993). A connectionist implementation of cognitive phonology. In J. Goldsmith, Ed., The Last Phonological Rule: Reflections on Constraints and Derivations. Chicago: University of Chicago Press, pp. 146-172.