Speech Synthesis

The history of "speaking machines" goes back at least to the work of Wolfgang von Kempelen in 1791, but until the advent of the digital computer all such devices required a human operator to "play" them, rather like a musical instrument. Perhaps the best known machine of this sort was Homer Dudley's VODER, which was demonstrated at the 1939 World's Fair.

Modern speech synthesis programs, of course, can produce speechlike output on the basis of symbolic input, with no further intervention. When the symbolic input to such a program is ordinary text, the program is often called a text-to-speech (TTS) system. Most TTS systems can be viewed as having two fairly distinct halves: a first stage that analyzes the text and transforms it into some form of annotated phonetic transcription, and a second stage, which is often thought of as synthesis proper, which produces a sound wave from the phonetic transcription.

Programs that generate their own sentences, for example, automated information systems, can produce synthesizer input directly and avoid the difficulties of textual analysis. There is, at the beginning of 1998, no standard format for synthesizer input, and most systems have their own ad hoc notations. There is a move toward the development of standardized speech markup languages, on the model of text markup languages like LaTex and HTML, but considerable work remains to be done.

Text analysis in TTS systems serves two primary purposes: (1) specifying the pronunciations of individual words and (2) gathering information to guide phrasing and placement of pitch accents (see PROSODY AND INTONATION).

Word pronunciations can be looked up in dictionaries, generated by spelling-to-sound rules, or produced through a combination of the two. The feasibility of relying on spelling- to-sound rules varies from language to language. Any language will need at least a small dictionary of exceptions. English spelling is sufficiently problematic that current practice is to have a dictionary with tens of thousands -- or even hundreds of thousands -- of entries, and to use rules only for words that do not occur in the dictionary and cannot be formed by regular morphological processes from words that do occur. Systems vary in the extent to which they use morphology. Some systems attempt to store all forms of all words that they may be called on to pronounce. The MITalk system had a dictionary of orthographic word fragments called "morphs" and applied rules to specify the ways in which their pronunciations were affected when they were combined into words.

The parsing and morphological analysis (see NATURAL LANGUAGE PROCESSING and MORPHOLOGY) techniques used in text processing for text-to-speech are similar to those used elsewhere in computational linguistics. One reason for parsing text in text-to-speech is that the part of speech assignment performed in the course of parsing can disambiguate homographs -- forms like the verb to lead and the noun lead, or the present and past tenses of the verb to read, which are spelled the same but pronounced differently. The other main reason is that it is possible to formulate default rules for placement of pitch accents and phrase boundaries on the basis of syntax. On the basis of such rules, markers can be placed in the annotated phonetic output of the text analysis stage that instruct the synthesis component to vary vocal pitch and introduce correlates of phrasing, such as pauses and lengthening of sounds at the ends of phrases. Such default rules tend to yield the rather unnatural and mechanical effect generally associated with synthetic speech, and improving the quality of synthetic prosody is one of the major items on the research agenda for speech synthesis.

Synthesis proper can itself be broken into two stages, the first of which produces a numerical/physical description of a sound wave, and the second of which converts the description to sound. In some cases, the sound is stored in the computer as a digitized wave form, to be played out through a general purpose digital to analog converter, whereas in other cases, the numerical/physical description is fed to special purpose hardware, which plays the sound directly without storing a waveform.

Synthesizers can be distinguished in two primary ways according to the nature of the numerical/physical description they employ, and the manner in which they construct it. These distinctions are acoustic vs. articulatory and stored unit vs. target interpolation. The two distinctions are largely independent and any of the four possibilities they offer is in principle possible, but in practice articulatory synthesizers use target interpolation.

The acoustic vs. articulatory distinction depends on whether numerical/physical descriptions describe sounds or vocal tract shapes. In the acoustic case, converting the description to sound essentially means creating a sound that fits the description, and there are usually efficient algorithms to do the job. In the second case, the computation involves simulating the propagation of sound through a vocal tract of the given description. This requires a great deal more computation. Articulatory synthesis remains a research activity and is not used in practical applications. Formant synthesis, linear prediction synthesis, sinewave synthesis, and waveform concatenation are common forms of acoustic synthesis. The acoustic vs. articulatory distinction is to some extent blurred by systems such as YorkTalk that arrive at a (formant based) acoustic description by way of an intermediate articulatory feature level.

In target interpolation, descriptions of complete utterances are built up by establishing target values to be achieved during phonetic units (see PHONETICS) and then, as the term suggests, interpolating between the targets. The first synthesizers capable of producing intelligible renditions of a wide range of sentences, such as the JSRU synthesizer of John Holmes, Ignatius Mattingly, and John Shearme, and KlattTalk, were of this type, with formant values as targets.

The transitions between relatively stable regions of sound in natural speech are in fact very complex and difficult to model through interpolation. An alternative is to store the descriptions of whole stretches of speech, including the difficult transitions. This is the basis of stored unit synthesis. One popular unit is the diphone, which is essentially the transition between one stable region and the next. Many of the present generation of good quality commercial synthesizers use diphones. Other units in current use are demi syllables, syllables plus affixes, and units of varying length chosen on the spot from a large speech database to build a particular utterance.

Nearly all systems adopt the simplifying assumption that the aspects of speech dependent on the activity of the larynx -- vocal pitch and much of what is often considered voice quality -- can be modeled independently from the aspects dependent on mouth shape, which determine what phonemes, and hence what words are produced. In articulatory synthesis, this comes naturally in the form of separate modeling of separate articulators. In acoustic synthesis it is often done using Gunnar Fant's source-filter model of speech production, where the speech is modeled as the result of passing a sound corresponding to the larynx contribution through a filter corresponding to the mouth shape. Formant synthesis and linear prediction are based on the source filter model, but the more recently developed PSOLA and sinewave methods are not.

See also

Additional links

-- Stephen Isard


Allen, J., S. Hunnicut, and D. H. Klatt. (1987). From Text to Speech: The MITalk System. Cambridge: Cambridge University Press.

Bailly, G., and C. Benoit, Eds. (1992). Talking Machines, Theories, Models and Designs. Amsterdam: Elsevier Science/North- Holland.

Dutoit, T. (1997). An Introduction to Text-to-Speech Synthesis. Dordrecht: Kluwer.

Holmes, J. N., I. G. Mattingly, and J. N. Shearme. (1964). Speech synthesis by rule. Language and Speech 7:127-143.

Klatt, D. H. (1980). Software for a cascade/parallel formant synthesizer. Journal of the Acoustical Society of America 67:971-995.

Klatt, D. H. (1987). Review of Text-To-Speech Conversion for English. Journal of the Acoustical Society of America 82:737-793.

Linggard, R. (1985). Electronic Synthesis of Speech. Cambridge: Cambridge University Press.

O'Shaughnessy, D. (1987). Speech Communication. Reading, MA: Addison Wesley.

Sproat, R., Ed. (1998). Multilingual Text-to-Speech Synthesis: The Bell Labs Approach. Dordrecht: Kluwer.

van Santen, J. P. H., R. W. Sproat, J. P. Olive, and J. Hirschberg. (1996). Progress in Speech Synthesis. New York: Springer.

Witten, I. H. (1982). Principles of Computer Speech. New York: Academic Press .