Speech Perception

The ability to comprehend spoken language derives from the operation of a highly complex set of perceptual, cognitive, and linguistic processes that permit the listener to recover the meaning of an utterance when listening to speech. The domain of speech perception concerns the earliest stages of this processing, during which the listener maps the time-varying acoustic signal of speech onto a set of discrete linguistic representations. These representations are typically (though not universally) construed in terms of sequences of phonetic segments -- consonants and vowels -- that form the words of the language. For example, the word keep is composed of three phonetic segments: an initial consonant (in phonetic notation, symbolized as /k/), a medial vowel (/i/), and a final consonant (/p/). Each phonetic segment can itself be described in terms of values on a small set of DISTINCTIVE FEATURES that recombine to form the set of segments in any given language. For example, at a featural level, the segment /k/ can be described as a voiceless stop consonant with a velar place of articulation; this segment contrasts minimally with /p/, which is also a voiceless stop consonant but has a labial place of articulation. Within this framework, the central issue in speech perception is how listeners are able to recover the phonetic structure -- the sequences of featurally defined phonetic segments -- when listening to speech, so that they can recognize the individual words that were produced and, ultimately, comprehend the meaning of the spoken utterance.

Mirroring the interdisciplinary nature of cognitive science itself, the study of speech perception has a long tradition of drawing from many diverse fields, most notably experimental psychology, linguistics, speech and hearing science, acoustics, and engineering. More than five decades of research from these disciplines have yielded a vast amount of information on the nature of the speech signal and the way in which listeners process it to derive the phonetic structure of the utterance.

One of the fundamental discoveries of this research is that there is not a simple one-to-one mapping between the phonetically relevant acoustic properties of speech and the phonetic structure of an utterance (though see Stevens and Blumstein 1981 for an alternative view). Many factors contribute to this complexity in mapping. One of the primary factors, called coarticulation, derives from the fact that when speakers talk, they do not produce the phonetic segments of a given word (such as keep) sequentially, one at a time (e.g., /k/, then /i/, then /p/; Liberman et al. 1967). Rather, phonetic segments are coarticulated, with the articulatory gestures for given segments overlapping in time; for example, the gestures for /i/ and even /p/ are in the process of being implemented during the ARTICULATION of /k/. Coarticulation allows speakers to produce sequences of segments rapidly, but it results in two major complications in the mapping between acoustic signal and phonetic structure. The first complication, called the segmentation problem, is that any given stretch of the acoustic signal contains, in parallel, information for more than one phonetic segment. Thus it is not possible to divide the acoustic signal into discrete "chunks" that correspond to individual phonetic segments. The second complication, called the lack of invariance problem, is that the precise form of a given acoustic property important for specifying a phonetic segment itself changes as a function of phonetic context (i.e., as a function of which segments precede and follow the target segment). So, for example, the form of critical acoustic information for /k/ is different when /k/ is followed by /i/ as in keep compared to when it is followed by /u/ as in cool. To complicate matters further, many factors other than coarticulation also alter the precise form of the acoustic properties specifying phonetic segments; among the most prominent of these are changes in speaker (Nearey 1989) and speaking rate (Miller 1981). Moreover, given the nature of the articulatory process, it is nearly always the case that phonetic contrasts are specified not by a single property of the acoustic signal but by multiple acoustic properties (Lisker 1986) .

Given this considerable (though, importantly, highly systematic) complexity in the mapping between acoustic signal and phonetic structure, the listener must have some means of "unpacking" the highly encoded, context-dependent speech signal. Indeed, there is now considerable evidence that listeners are exquisitely sensitive to such factors as acoustic-phonetic context, speaker, and speaking rate, and that they take into account the acoustic consequences of variation due to these factors when mapping the acoustic signal onto phonetic structure (for review, see Nygaard and Pisoni 1995). Just how this is accomplished, however, remains unknown, and current theoretical approaches are quite diverse (for review, see Remez 1994). One long- standing debate, for example, focuses on whether phonetic perception is accomplished by a modular, specialized, speech-specific mechanism that computes the intended phonetic gestures of the speaker and thereby recovers the phonetic structure of the utterance (Liberman and Mattingly 1985) or whether some form of general perceptual and/or cognitive processing is sufficient to accomplish the mapping from acoustic signal to phonetic structure, even given the complexity involved (Diehl and Kluender 1989; Pastore 1987; see also Fowler 1986).

Research on speech perception has not only revealed that the mapping between acoustic and phonetic levels is complex, but it has also shown that phonetic perception is itself influenced by input from higher-order linguistic levels, most notably information from the LEXICON. A classic example of this lexical influence is the finding that potentially ambiguous phonetic segments are typically identified so as to create real words of the language rather than nonwords (Ganong 1980). For example, a stimulus with acoustic information that is potentially ambiguous for stimulus- initial /b/ versus /p/ will be identified (under certain conditions) as /b/ in the context of -eef and as /p/ in the context of -eace, thus creating the real word beef (as opposed to peef) and the real word peace (as opposed to beace). Results such as these underscore the close tie between the processes underlying phonetic perception and those responsible for SPOKEN WORD RECOGNITION (lexical access). However, although the influence of lexical information on phonetic perception is well established, there is currently considerable controversy over the nature of this influence. One major alternative, in line with autonomous, modular models of perception, is that lexical factors operate independently of the initial acoustic-phonetic analysis to influence the final percept (e.g., Cutler et al. 1987). Another major alternative, in line with interactive approaches, is that lexical information plays a direct role in the initial acoustic-phonetic mapping per se (e.g., McClelland and Elman 1986). As in other domains of cognitive science, providing clear-cut empirical support for modular versus interactive models has proven to be extremely difficult (see Miller and Eimas 1995).

Finally, yet another major finding of research on speech perception is that the ability to map the acoustic signal onto linguistic representations has its origins in the perceptual processing of early infancy (see PHONOLOGY, ACQUISITION OF). It is now known that infants come to the task of speech perception with highly sophisticated abilities to process the speech signal (for review, see Jusczyk 1995). This includes the ability to distinguish nearly all (if not all) of the phonetic contrasts used in the world's languages and the ability to categorize variants of speech sounds in a linguistically relevant manner (Eimas et al. 1971). For example, young infants will spontaneously group together instances of a given vowel (e.g., /i/) that are produced by different speakers and hence are quite distinctive acoustically (Kuhl 1979). These initial abilities of the infant to perceive speech become tuned in accord with the sound structure of the native language over the course of development, such that infants gradually change from "language-general" to "language- specific" perceivers of speech. This attunement process begins very early -- for example, within days of birth, infants show a preference for their native language (Mehler et al. 1988). It continues to unfold in a complex manner over the course of development, with major changes occurring during the first year of life (Best 1994; Jusczyk 1993; Kuhl 1993; Werker and Pegg 1992). Understanding the nature of the earliest abilities of infants to perceive speech, the way in which these abilities become tuned in the course of learning a particular language, and the role of this attunement process in the overall course of LANGUAGE ACQUISITION, remains a major challenge in the study of spoken-language processing.

Additional links

UCSC - Perceptual Science Laboratory

-- Joanne L. Miller

References

Best, C. T. (1994). The emergence of native-language phonological influences in infants: A perceptual assimilation model. In J. C. Goodman and H. C. Nusbaum, Eds., The Development of Speech Perception: The Transition from Speech Sounds to Spoken Words. Cambridge, MA: MIT Press, pp. 167-224.

Cutler, A., J. Mehler, D. Norris, and J. Segui. (1987). Phoneme identification and the lexicon. Cognitive Psychology 19:141-177.

Diehl, R. L., and K. R. Kluender. (1989). On the objects of speech perception. Ecological Psychology 1:121-144.

Eimas, P. D., E. R. Siqueland, P. Jusczyk, and J. Vigorito. (1971). Speech perception in infants. Science 171:303-306.

Fowler, C. A. (1986). An event approach to the study of speech perception from a direct-realist perspective. Journal of Phonetics 14:3-28.

Ganong, W. F., III. (1980). Phonetic categorization in auditory word perception. Journal of Experimental Psychology: Human Perception and Performance 6:110-125.

Jusczyk, P. W. (1993). From general to language-specific capacities: the WRAPSA model of how speech perception develops. Journal of Phonetics 21:3-28.

Jusczyk, P. W. (1995). Language acquisition: Speech sounds and the beginning of phonology. In J. L. Miller and P. D. Eimas, Eds., Speech, Language, and Communication. San Diego, CA: Academic Press, pp. 263-301.

Kuhl, P. K. (1979). Speech perception in early infancy: Perceptual constancy for spectrally dissimilar vowel categories. Journal of the Acoustical Society of America 66:1668-1679.

Kuhl, P. K. (1993). Innate predispositions and the effects of experience in speech perception: The native language magnet theory. In B. de Boysson-Bardies, S. de Schonen, P. Jusczyk, P. McNeilage, and J. Morton, Eds., Developmental Neurocognition: Speech and Face Processing in the First Year of Life. Dordrecht: Kluwer, pp. 259-274.

Liberman, A. M., F. S. Cooper, D. P. Shankweiler, and M. Studdert-Kennedy. (1967). Perception of the speech code. Psychological Review 74:431-461.

Liberman, A. M., and I. G. Mattingly. (1985). The motor theory of speech perception revised. Cognition 21:1-36.

Lisker, L. (1986). "Voicing" in English: A catalogue of acoustic features signaling /b/ versus /p/ in trochees. Language and Speech 29:3-11.

McClelland, J. L., and J. L. Elman. (1986). The TRACE model of speech perception. Cognitive Psychology 18:1-86.

Mehler, J., P. Jusczyk, G. Lambertz, N. Halsted, J. Bertoncini, and C. Amiel-Tison. (1988). A precursor of language acquisition in young infants. Cognition 29:143-178.

Miller, J. L. (1981). Effects of speaking rate on segmental distinctions. In P. D. Eimas and J. L. Miller, Eds., Perspectives on the Study of Speech. Hillsdale, NJ: Erlbaum, pp. 39-74.

Miller, J. L., and P. D. Eimas. (1995). Speech perception: From signal to word. Annual Review of Psychology 46:467-492.

Nearey, T. M. (1989). Static, dynamic, and relational properties in vowel perception. Journal of the Acoustical Society of America 85:2088-2113.

Nygaard, L. C., and D. B. Pisoni. (1995). Speech perception: New directions in research and theory. In J. L. Miller and P. D. Eimas, Eds., Speech, Language, and Communication. San Diego, CA: Academic Press, pp. 63-96.

Pastore, R. E. (1987). Categorical perception: Some psychophysical models. In S. Harnad, Ed., Categorical Perception: The Groundwork of Cognition. Cambridge: Cambridge University Press, pp. 29-52.

Remez, R. E. (1994). A guide to research on the perception of speech. In M. A. Gernsbacher, Ed., Handbook of Psycholinguistics. San Diego, CA: Academic Press, pp. 145-172.

Stevens, K. N., and S. E. Blumstein. (1981). The search for invariant acoustic correlates of phonetic features. In P. D. Eimas and J. L. Miller, Eds., Perspectives on the Study of Speech. Hillsdale, NJ: Erlbaum, pp. 1-38.

Werker, J. F., and J. E. Pegg. (1992). Infant speech perception and phonological acquisition. In C. A. Ferguson, L. Menn, and C. Stoel-Gammon, Eds., Phonological Development: Models, Research, Implications. Timonium, MD: York Press, pp. 285-311.

Speech Perception

See also

Additional links

References

Further Readings