Contents:
Many speaking toys have appeared, under the impulse of the innovative 'Magic Spell' from Texas Instruments. The poor quality available inevitably restrains the educational ambition of such products. High Quality synthesis at affordable prices might well change this. In some cases, oral information is more efficient than written messages. The appeal is stronger, while the attention may still focus on other visual sources of information.
Hence the idea of incorporating speech synthesizers in measurement or control systems. In the long run, the development of high quality TTS systems is a necessary step as is the enhancement of speech recognizers towards more complete means of communication between men and computers. Multimedia is a first but promising move in this direction. Fundamental and applied research. TTS synthesizers possess a very peculiar feature which makes them wonderful laboratory tools for linguists: Consequently, they allow to investigate the efficiency of intonative and rhythmic models.
A particular type of TTS systems, which are based on a description of the vocal tract through its resonant frequencies its formants and denoted as formant synthesizers , has also been extensively used by phoneticians to study speech in terms of acoustical rules. In this manner, for instance, articulatory constraints have been enlightened and formally described. From now on, it should be clear that a reading machine would hardly adopt a processing scheme as the one naturally taken up by humans, whether it was for language analysis or for speech production itself.
Vocal sounds are inherently governed by the partial differential equations of fluid mechanics, applied in a dynamic case since our lung pressure, glottis tension, and vocal and nasal tracts configuration evolve with time. These are controlled by our cortex, which takes advantage of the power of its parallel structure to extract the essence of the text read: Even though, in the current state of the engineering art, building a Text-To-Speech synthesizer on such intricate models is almost scientifically conceivable intensive research on articulatory synthesis, neural networks, and semantic analysis give evidence of it , it would result anyway in a machine with a very high degree of possibly avoidable complexity, which is not always compatible with economical criteria.
After all, flies do not flap their wings! Figure 1 introduces the functional diagram of a very general TTS synthesizer. As for human reading, it comprises a Natural Language Processing module NLP , capable of producing a phonetic transcription of the text read, together with the desired intonation and rhythm often termed as prosody , and a Digital Signal Processing module DSP , which transforms the symbolic information it receives into speech.
But the formalisms and algorithms applied often manage, thanks to a judicious use of mathematical and linguistic knowledge of developers, to short-circuit certain processing steps. This is occasionally achieved at the expense of some restrictions on the text to pronounce, or results in some reduction of the "emotional dynamics" of the synthetic voice at least in comparison with human performances , but it generally allows to solve the problem in real time with limited memory requirements.
One immediately notices that, in addition with the expected letter-to-sound and prosody generation blocks, it comprises a morpho-syntactic analyser, underlying the need for some syntactic processing in a high quality Text-To-Speech system. Indeed, being able to reduce a given sentence into something like the sequence of its parts-of-speech, and to further describe it in the form of a syntax tree, which unveils its internal structure, is required for at least two reasons:. A pre-processing module, which organizes the input sentences into manageable lists of words.
It identifies numbers, abbreviations, acronyms and idiomatics and transforms them into full text when needed.
An important problem is encountered as soon as the character level: It can be solved, to some extent, with elementary regular grammars. A morphological analysis module, the task of which is to propose all possible part of speech categories for each word taken individually, on the basis of their spelling. Inflected, derived, and compound words are decomposed into their elementery graphemic units their morphs by simple regular grammars exploiting lexicons of stems and affixes see the CNET TTS conversion program for French [Larreur et al.
Most text-to-speech (TTS) systems do not generate semantic representations of their input texts, as processes for. Cloud Text-to-Speech converts text into 18+ natural-sounding voices, in a Our neural networks were built based on Google's speech synthesis expertise.
The contextual analysis module considers words in their context, which allows it to reduce the list of their possible part of speech categories to a very restricted number of highly probable hypotheses, given the corresponding possible parts of speech of neighbouring words. Finally, a syntactic-prosodic parser, which examines the remaining search space and finds the text structure i. A poem of the Dutch high school teacher and linguist G. Trenite surveys this problem in an amusing way.
It desperately ends with:. Finally, which rimes with "enough", Though, through, plough, cough, hough, or tough? Hiccough has the sound of "cup", My advice is The Letter-To-Sound LTS module is responsible for the automatic determination of the phonetic transcription of the incoming text. It thus seems, at first sight, that its task is as simple as performing the equivalent of a dictionary look-up! From a deeper examination, however, one quickly realizes that most words appear in genuine speech with several phonetic transcriptions, many of which are not even mentioned in pronunciation dictionaries.
Clearly, points 1 and 2 heavily rely on a preliminary morphosyntactic and possibly semantic analysis of the sentences to read. To a lesser extent, it also happens to be the case for point 3 as well, since reduction processes are not only a matter of context-sensitive phonation, but they also rely on morphological structure and on word grouping, that is on morphosyntax. It is then possible to organize the task of the LTS module in many ways Fig.
Dictionary-based solutions consist of storing a maximum of phonological knowledge into a lexicon. In order to keep its size reasonably small, entries are generally restricted to morphemes, and the pronunciation of surface forms is accounted for by inflectional, derivational, and compounding morphophonemic rules which describe how the phonetic transcriptions of their morphemic constituents are modified when they are combined into words.
Morphemes that cannot be found in the lexicon are transcribed by rule.
After a first phonemic transcription of each word has been obtained, some phonetic post-processing is generally applied, so as to account for coarticulatory smoothing phenomena. A rather different strategy is adopted in rule-based transcription systems, which transfer most of the phonological competence of dictionaries into a set of letter-to-sound or grapheme-to-phoneme rules. This time, only those words that are pronounced in such a particular way that they constitute a rule on their own are stored in an exceptions dictionary.
Notice that, since many exceptions are found in the most frequent words, a reasonably small exceptions dictionary can account for a large fraction of the words in a running text. It has been argued in the early days of powerful dictionary-based methods that they were inherently capable of achieving higher accuracy than letter-to-sound rules [Coker et al 90], given the availability of very large phonetic dictionaries on computers. Clearly, some trade-off is inescapable. Besides, the compromise is language-dependent, given the obvious differences in the reliability of letter-to-sound correspondences for different languages.
The term prosody refers to certain properties of the speech signal which are related to audible changes in pitch, loudness, syllable length. Prosodic features have specific functions in speech communication see Fig. The most apparent effect of prosody is that of focus. For instance, there are certain pitch events which make a syllable stand out within the utterance, and indirectly the word or syntactic group it belongs to will be highlighted as an important or new component in the meaning of that utterance.
The presence of a focus marking may have various effects, such as contrast, depending on the place where it occurs, or the semantic context of the utterance. Different kinds of information provided by intonation lines indicate pitch movements; solid lines indicate stress. Relationships between words saw-yesterday; I-yesterday; I-him c. Finality top or continuation bottom , as it appears on the last syllable; d. Segmentation of the sentence into groups of syllables. Although maybe less obvious, there are other, more systematic or general functions.
Prosodic features create a segmentation of the speech chain into groups of syllables, or, put the other way round, they give rise to the grouping of syllables and words into larger chunks. Moreover, there are prosodic features which indicate relationships between such groups, indicating that two or more groups of syllables are linked in some way. This grouping effect is hierarchical, although not necessarily identical to the syntactic structuring of the utterance.
Does this mean that TTS systems are doomed to a mere robot-like intonation until a brilliant computational linguist announces a working semantic-pragmatic analyzer for unrestricted text i. There are various reasons to think not, provided one accepts an important restriction on the naturalness of the synthetic voice, i. Neutral intonation does not express unusual emphasis, contrastive stress or stylistic effects: This approach removes the necessity for reference to context or world knowledge while retaining ambitious linguistic goals.
The key idea is that the "correct" syntactic structure, the one that precisely requires some semantic and pragmatic insight, is not essential for producing such a prosody [see also O'Shaughnessy 90]. With these considerations in mind, it is not surprising that commercially developed TTS system have emphasized coverage rather than linguistic sophistication, by concentrating their efforts on text analysis strategies aimed to segment the surface structure of incoming sentences, as opposed to their syntactically, semantically, and pragmatically related deep structure.
The resulting syntactic-prosodic descriptions organize sentences in terms of prosodic groups strongly related to phrases and therefore also termed as minor or intermediate phrases , but with a very limited amount of embedding, typically a single level for these minor phrases as parts of higher-order prosodic phrases also termed as major or intonational phrases, which can be seen as a prosodic-syntactic equivalent for clauses and a second one for these major phrases as parts of sentences, to the extent that the related major phrase boundaries can be safely obtained from relatively simple text analysis methods.
In other words, they focus on obtaining an acceptable segmentation and translate it into the continuation or finality marks of Fig.
Liberman and Church [], for instance, have recently reported on such a very crude algorithm, termed as the chinks 'n chunks algorithm, in which prosodic phrases which they call f-groups are accounted for by the simple regular rule:. They show that this approach produces efficient grouping in most cases, slightly better actually than the simpler decomposition into sequences of function and content words, as shown in the example below:.
Once the syntactic-prosodic structure of a sentence has been derived, it is used to obtain the precise duration of each phoneme and of silences , as well as the intonation to apply on them. This last step, however, is not straightforward either. It requires to formalize a lot of phonetic or phonological knowledge, either obtained from experts or automatically acquired from data with statistical methods. More information on this can be found in [Dutoit 96]. Intuitively, the operations involved in the DSP module are the computer analogue of dynamically controlling the articulatory muscles and the vibratory frequency of the vocal folds so that the output signal matches the input requirements.
In order to do it properly, the DSP module should obviously, in some way, take articulatory constraints into account, since it has been known for a long time that phonetic transitions are more important than stable states for the understanding of speech [Libermann 59].
This, in turn, can be basically achieved in two ways:. Two main classes of TTS systems have emerged from this alternative, which quickly turned into synthesis philosophies given the divergences they present in their means and objectives: Rule-based synthesizers are mostly in favour with phoneticians and phonologists, as they constitute a cognitive, generative approach of the phonation mechanism. The broad spreading of the Klatt synthesizer [Klatt 80], for instance, is principally due to its invaluable assistance in the study of the characteristics of natural speech, by analytic listening of rule-synthesized speech.
What is more, the existence of relationships between articulatory parameters and the inputs of the Klatt model make it a practical tool for investigating physiological constraints [Stevens 90].
For historical and practical reasons mainly the need for a physical interpretability of the model , rule synthesizers always appear in the form of formant synthesizers. These describe speech as the dynamic evolution of up to 60 parameters [Stevens 90], mostly related to formant and anti-formant frequencies and bandwidths together with glottal waveforms. Clearly, the large number of coupled parameters complicates the analysis stage and tends to produce analysis errors. What is more, formant frequencies and bandwidths are inherently difficult to estimate from speech data.
The need for intensive trials and errors in order to cope with analysis errors, makes them time-consuming systems to develop several years are commonplace. Yet, the synthesis quality achieved up to now reveals typical buzzyness problems, which originate from the rules themselves: Rule-based synthesizers remain, however, a potentially powerful approach to speech synthesis. They allow, for instance, to study speaker-dependent voice features so that switching from one synthetic voice into another can be achieved with the help of specialized rules in the rule database. Following the same idea, synthesis-by-rule seems to be a natural way of handling the articulatory aspects of changes in speaking styles as opposed to their prosodic counterpart, which can be accounted for by concatenation-based synthesizers as well.
S system [O'Shaughnessy 84] for French. As opposed to rule-based ones, concatenative synthesizers possess a very limited knowledge of the data they handle: This clearly appears in figure 6, where all the operations that could indifferently be used in the context of a music synthesizer i. A series of preliminary stages have to be fulfilled before the synthesizer can produce its first utterance. At first, segments are chosen so as to minimize future concatenation problems.
A combination of diphones i. When a complete list of segments has emerged, a corresponding list of words is carefully completed, in such a way that each segment appears at least once twice is better, for security. Unfavourable positions, like inside stressed syllables or in strongly reduced i. A corpus is then digitally recorded and stored, and the elected segments are spotted, either manually with the help of signal visualization tools, or automatically thanks to segmentation algorithms, the decisions of which are checked and corrected interactively.
A segment database finally centralizes the results, in the form of the segment names, waveforms, durations, and internal sub-splittings. In the case of diphones, for example, the position of the border between phones should be stored, so as to be able to modify the duration of one half-phone without affecting the length of the other one.
Preference is given to splitting at full stop, question mark, colon or semi-colon after that split is performed by the nearest comma and falling back from that the nearest space between words. Free For Non-Commercial Use. Free WordPress Speech Plugin. Free Website Voice Agent. Do you want to use ResponsiveVoice for a non-commercial personal or non-profit project? Just add the attribution to your site: ResponsiveVoice used under Non-Commercial License. Browser and Device Support. This is the easiest way to use the spoken word in your app or website. Voice Enable Your site in 3 minutes.