Prosody and Intonation

The term prosody refers to the grouping and relative prominence of the elements making up the speech signal. One reflex of prosody is the perceived rhythm of the speech. Prosodic structure may be described formally by a hierarchical structure in which the smallest units are the internal components of the syllable and the largest is the intonation phrase. Units of intermediate scale include the syllable, the metrical foot, and the prosodic word (Selkirk 1984; Hayes 1995).

Intonation refers to phrase-level characteristics of the melody of the voice. Intonation is used by speakers to mark the pragmatic force of the information in an utterance. The alignment of the intonation contour with the words is constrained by the prosody, with intonational events falling on the most prominent elements of the prosodic structure and at the edges. As a result, intonational events can often provide information to the listener about the prosodic structure, in addition to carrying a pragmatic message. The term intonation is often used, by extension, to refer to systematic characteristics of the voice melody at larger scales, such as the discourse segment or the paragraph (Beckman and   Pierrehumbert 1986; Pierrehumbert and Hirschberg 1990; Ladd 1996).

The primary phonetic correlate of intonation is the fundamental frequency of the voice (F0), which is perceived as pitch and which arises from the rate of vibration of the vocal folds. The F0 is determined by the configuration of the larynx, the subglottal pressure, and the degree of oral closure (Clark and Yallop 1990; Titze 1994). Articulatory maneuvers that change the rate of vibration of the vocal folds also affect the exact shape of the glottal waveform and hence the voice timbre (or voice quality). Perceived voice quality is probably used in perception to assist in the identification of intonation patterns (Pierrehumbert 1997). Intonation is not the only source of F0 variation. Speech segments also have systematic effects on F0. However, the largest segmental effects are on the time-frequency scale of the smaller intonational effects. Thus, F0 contours can be roughly viewed as a superposition of segmental factors on the intonationally determined contour.

Many experimental studies show that prosody affects all aspects of the speech signal (see Papers in Laboratory Phonology and references cited there). In general, elements found in prosodically prominent positions are more forcefully and fully articulated than elements in prosodically weak positions. The space of acoustic contrasts is therefore expanded in strong positions compared to that in weak positions. Edges of prosodic units also affect phonetic outcomes. Consonantal articulations tend to be strengthened at initial edges of prosodic words and intonation phrases. Final syllables of words and intonation phrases are regularly lengthened. An extensive literature on isochrony addresses the possibility that speech has a steady beat with a constant interval between the stresses. This literature has established that interstress intervals in fact vary widely as a function of the material comprising the interval. However, when the principle determinants of duration are controlled for, evidence of a tendency towards isochrony is reported in some studies.

Contextual effects related to prosody are substantial and rank with speech style and speaker characteristics as sources of variation in the realization of phonemes and DISTINCTIVE FEATURES. The variation is great enough that a token of one phoneme in one prosodic position can be identical to a token of a different phoneme in some other prosodic position. For example, in American English, a phrase-final /z/ is virtually identical to a medial /s/. Similarly, a 20-story building in Evanston, Illinois, provides an example of a "tall building," but it would be an example of a "short building" if it were in downtown Chicago. That is, the context-dependence of the phonetic realizations of phonemes is similar in character to context-dependence found in other domains, and it provides an example of the abstractness and adaptability of human cognition.

Because of intense research activity over the last two decades, the phonological theory of prosody and intonation is now well developed. It characterizes the cognitive structures that must be viewed as implicitly present in the minds of speakers in order to explain their use of prosody and intonation in both speech production and SPEECH PERCEPTION.

The central concepts of prosodic theory are the prosodic units (the syllable, the foot, the intonation phrase, etc.) and the relations defined among these units. The units are temporally ordered. Bigger units dominate smaller ones. Within each unit, a relationship of strength is available that singles out one element as more prominent than the other elements of the same type in the group. Strength is inherited through the hierarchy; the head syllable of an intonation phrase can be defined as the strongest syllable in the strongest foot in the strongest word in that phrase. Although it is generally agreed that prosodic structures are hierarchical, they contrast with syntactic structures in making much less use of recursion. In SYNTAX, we find clauses embedded within other clauses, but in PHONOLOGY, we do not find syllables embedded within other syllables. The only serious candidate for a recursive node in prosody is the prosodic word, and scholars do not agree about whether this node is recursive or not. As a consequence of the relative flatness of the prosodic structure, syntactic structures are flattened when the prosodic phrasing is computed. For example, in sentence (1), a recursive syntactic structure corresponds to a prosodic structure in which three intonation phrases are on a par with each other.

(1) This is the cat % that ate the rat % that stole the cheese.

The intonation system of English has been extensively studied. Points of agreement among many researchers in the English-speaking countries have recently been codified in the ToBI transcription standard, for which on-line training materials are available (Pitrelli, Beckman, and Hirschberg 1994; Beckman and Ayers Elam 1994/97). According to this standard, intonation contours may be "spelled" using three basic tonal elements: low tone (L), high tone (H), and downstepped high tone (!H). !H represents the combination of high tone with a compression and lowering of the pitch range; a sequence of !Hs generates an F0 contour with a descending staircase. Pitch accents, which mark prominently stressed syllables in the phrase, are made up of these elements. The nuclear accent is defined as the accent on the main stress of the entire phrase. The prenuclear accents fall on prominent syllables earlier in the phrase. Every complete utterance must have (at least one) nuclear accent, but some utterances lack prenuclear accents. In addition to the pitch accents, each contour has boundary tones which mark the edges of the intonation phrase.

All languages have prosody and intonation, but there are many important differences among the systems found in various languages. They differ in the total inventory of intonational patterns and in the pragmatic meanings assigned to particular patterns. Languages with lexical tone (whether tone languages such as Mandarin or classic pitch accent languages such as Japanese) tend to have somewhat simpler intonational systems than English, presumably because much of the F0 contour is taken up with providing phonetic expression of the tones in the words (see Pierrehumbert and Beckman 1988; Hayes and Lahiri 1991; Hayes 1995; Myers 1996).

In the prosodic domain, languages differ in the constraints they impose on the composition of the various units. At the phrasal level, they differ in how they set up the correspondance between intonational phrases and syntactic and semantic structures. Some languages tend to locate prosodic breaks after a syntactic head, whereas others tend to locate breaks before. Some languages (such as English) permit the main prominence to be located anywhere in the phrase (for the purpose of highlighting or foregrounding particular words). Other languages make little or no use of variable placement of prominence within the phrase, instead moving new information to fixed prosodically prominent positions. Turning to smaller prosodic units, some languages permit syllables with complicated consonant clusters and others do not (Goldsmith 1990). Languages also differ in foot structure (Hayes 1995) and in the salience or importance of the different prosodic units (Beckman 1995). For example, in English the foot structure conspicuously shapes the lexical inventory and greatly affects how phonemes are pronounced. Foot structure exists in Japanese but smaller units (the syllable and the mora) vary much less with position in the foot and, as a result, exhibit a robustness that they lack in English.

In considering the contribution of prosody to interpretation, it is useful to separate prosodic structure within the word from prosodic structure above the word level. Prosodic structure within the word (i.e., syllable and foot structure) is an important factor in lexical access, shaping the segmentation strategy in each language and the set of active competitors for any given word at any given time (Cutler 1995). Prosodic structure above the word level (phrasing and phrasal prominence) reflects syntax, SEMANTICS, and DISCOURSE structure. As a result, it has repercussions for syntactic parsing, for the scope of operators such as "only," "even," and "not," for the understood reference of pronouns, and for the topic/comment structure of the discourse (Jackendoff 1972; Terken and Nooteboom 1987; Hirschberg and Ward 1991).

Intonation contours function as independent pragmatic morphemes. According to Pierrehumbert and Hirschberg (1990), the contour indicates the relationship of each utterance to the mutual beliefs that are developed and modified in the course of a conversation. For example, an H accent marks an intended addition to the mutual beliefs, whereas L accents mark information that is marked as salient but not to be added. The tremendous variety of understood meanings of patterns in context arises from the interplay of these factors with the goals and assumptions of the interlocutors. (For other treatments of the pragmatic meaning of intonational morphemes, see Gussenhoven 1983, Ward and Hirschberg 1985, and Morel 1995.)

Intonation and prosody are obligatory. Every single utterance has a prosodic analysis and represents some choice of intonation pattern, just as it represents some choice of phonemes and syllables. In experimental studies with aural stimuli, it is not possible to avoid or omit the contribution of intonation by using a monotone F0 contour. Similarly, experiments on words "in isolation" are in fact using words which are phrase-initial, phrase-final, and under main stress in the phrase (if the stimuli are well formed), because linguistic structure requires that every utterance no matter how short be a full intonation phrase. As a result, words produced "in isolation" also carry a complete phrasal melody. Results of experiments on words in isolation often show artifacts of this prosodic positioning and fail to generalize to words in running speech, which most often constitute only a part of a full intonation phrase.

The outcomes of experiments on syntactic processing, scope, and reference resolution in running speech are likely to be affected by the phrasal prosody of the stimuli. It is therefore desirable to control for this factor and to use an established transcriptional standard to report the prosody of the stimuli actually used. Orthogonal variation of the word string and the prosodic pattern may be used to factor out the prosodic and nonprosodic factors in the domain under investigation.

Experimental work on intonational meaning is challenging because the meanings by their very nature are highly variable with context. Judgments of intonational meaning obtained for materials out of context are variable and difficult to interpret because they are affected by the subjects' uncontrolled imaginings of what the context might be. However, very good results have been achieved with experimental studies in which subjects evaluate the felicity of particular patterns for specified discourse contexts or the understood force of patterns as they are presented in context. With a careful eye to the discourse context, experimental work on intonational meaning is one of the more feasible and promising areas for experimental work in PRAGMATICS.

See also

Additional links

-- Janet Pierrehumbert


Beckman, M. E. (1995). On blending and the mora. Papers in Laboratory Phonology 4:157-167.

Beckman, M. E., and G. Ayers Elam. (1994/1997). Guide to ToBI Labelling. Electronic text and accompanying audio example files available at

Beckman, M. E., and J. Pierrehumbert. (1986). Intonational structure in Japanese and English. Phonology Yearbook 3:15-70.

Clark, J. E., and C. Yallop. (1990). An Introduction to Phonetics and Phonology. Oxford: Blackwell.

Cutler, A. (1995). Spoken word recognition and production. In J. Miller and P. Eimas, Eds., Speech, Language, and Communication. New York: Academic Press, pp. 97-136.

Goldsmith, J. (1990). Autosegmental and Metrical Phonology. Oxford: Blackwell.

Gussenhoven, G. (1983). On the Grammar and Semantics of Sentence Accents. Publications in the Linguistics Sciences 16. Dordrecht: Foris.

Hayes, B. (1995). Metrical Stress Theory: Principles and Case Studies. Chicago: University of Chicago Press.

Hayes, B., and A. Lahiri. (1991). Bengali intonational phonology. Natural Language and Linguistic Theory 9:47-96.

Hirschberg, J., and G. Ward. (1991). Accent and bound anaphora. Cognitive Linguistics 2:101-121.

Jackendoff, R. (1972). Semantic Interpretation in Generative Grammar. Cambridge, MA: MIT Press.

Ladd, D. R. (1996). Intonational Phonology. Cambridge: Cambridge University Press.

Morel, M.-A. (1995). Valeur énonciative des variations de hauteur mélodique en français. French Language Studies 5:189-202. Cambridge: Cambridge University Press.

Myers, S. (1996). Boundary tones and the phonetic implementation of tone in Chichewa. Studies in African Linguistics 25:29-60.

Papers in Laboratory Phonology. Cambridge: Cambridge University Press. Vol. 1, (1990). J. Kingston and M. E. Beckman, Eds.; Vol. 2, (1992). G. Docherty and D. R. Ladd, Eds.; Vol. 3, (1994). P. Keating, Ed.; Vol. 4, (1995). B. Connell and A. Arvaniti, Eds.; Vol. 5, Forthcoming, Broe and J. Pierrehumbert, Eds.; Vol 6, Forthcoming, Ogden and Local, Eds.

Pierrehumbert, J., (1997). Consequences of intonation for the voice Source. In S. Kiritani, H. Hirose, and H. Fujisaki, Eds., Speech Production and Language, Speech Research 13. Berlin: Mouton, pp. 111-131.

Pierrehumbert, J., and M. E. Beckman. (1988). Japanese Tone Structure. Cambridge, MA: MIT Press.

Pierrehumbert, J., and J. Hirschberg. (1990). The meaning of intonation contours in the interpretation of discourse. In P. Cohen, J. Morgan, and M. Pollack, Eds., Plans and Intentions in Communication. Cambridge, MA: MIT Press, pp. 271-312.

Pitrelli, J. F., M. E. Beckman, and J. Hirschberg. (1994). Evaluation of prosodic transcription labeling reliability in the ToBI framework. Proceedings of the International Conference on Spoken Language Processing, Yokohama, Japan.

Selkirk E. O. (1984). Phonology and Syntax. Cambridge, MA: MIT Press.

Terken, J. M. B., and S. G. Nooteboom. (1987). Opposite effects of accentuation and deaccentuation on verification latencies for given and new information. Language and Cognitive Processes 2:145-163.

Titze, I. (1994). Principles of Voice Production. Englewood Cliffs, NJ: Prentice-Hall.

Ward, G., and J. Hirschberg. (1985). Implicating uncertainty: The pragmatics of fall-rise. Language 61:747-776.

Further Readings

Bird, S. (1995). Computational Phonology: A Constraint-Based Approach. Cambridge: Cambridge University Press.

Grice, M., and R. Benzmueller. (1997). Transcribing German intonation with GToBI.

Horne, M., Ed. (1998). Prosody: Theory and Experiment. Studies Presented to Gosta Bruce. Dordrecht: Kluwer.

Ladd, D. R. (1980). The Structure of Intonational Meaning. Bloomington, IN: University of Indiana Press.

Levelt, W. J. M. (1989). Speaking: From Intention to Articulation. Cambridge, MA: MIT Press.

Pierrehumbert, J., and S. Steele. (1990). Categories of tonal alignment in English. Phonetica 46:181-196.

Venditti, J. (1995). Japanese ToBI Labelling Guidelines. Also in K. Ainsworth-Darnell and M. D'Imperio, Eds., Ohio State Working Papers in Linguistics. 50:127-162.