Is it hard for software speech synthesisers to handle IPA? If so, why?

24

9

Yesterday on ELU, the IPA sequence ˌoʊkeɪˈhiːɹjəˌgoʊ was posted in a comment. I'm not very familiar with IPA, so I thought the easiest way to "decode" that would be through a software speech synthesiser.

I already use text-to-speech (TTS) in my own plugin for the Foobar2000 music player. But I couldn't quickly see how to tell the underlying software that the input text was IPA, rather than standard English, so I went looking for an online version.

After several minutes Googling, all I could find was this demo from AT&T - which, as it turned out, wasn't accurate enough for me to figure out what was being "said" by my target string above.

I'd always assumed that if IPA is supposed to be a "generic" way of representing phonemes, it should be the ideal input format for TTS software. But if that's so, why was it so difficult to find even something that produced what seems to me an inadequate rendering?

It seems unlikely to me that very few software developers are interested in TTS, so I'm guessing there's some reason why it's difficult for TTS software to handle IPA. Can anyone tell me why this might be so? Or is there maybe some other explanation as to why such software is at the very least "uncommon"?

FumbleFingers

Posted 2013-01-05T14:56:27.227

Reputation: 522

2Interesting question. For what it's worth, I think he wrote "OK, here you go". The "you" appears as because it's a weak form.Alenanno 2013-01-05T15:27:37.303

8IPA is specifically not a "generic" way of representing phonemes. It's a specific way of representing phones. I.e, it's phonetic, not phonemic. Phonemes inside slashes are "localized" sets of speech sounds (phones), often picked more for tradition than accuracy. The IPA is the ISO of phonetic transcription; there's a precise description behind every one. For convenience, IPA symbols are often used in phonemic transcriptions, but just as often they are confusing (like IPA [j], which is the sound of the English phoneme /y/).jlawler 2013-01-05T15:43:32.417

My suggestion -- get and go through the exercises in J.C. Catford's A Practical Introduction to Phonetics, which is full of little experiments you can do to teach yourself phonetics.jlawler 2013-01-05T15:45:31.657

@Alenanno: Your interpretation is in fact correct, but the mere fact that you can say I think there implies that even you can't be certain (whether you got there by looking at the IPA characters, by passing them through AT&T's online facility, or both). I heard it as Oh-key. Hi, Jay-go, but that didn't make sense so I decided it must be Okay. Hi J-Lo.FumbleFingers 2013-01-05T15:47:35.673

1

As to the software question, I'd imagine it wouldn't be hard to program, but there'd be no US market because nobody here ever learns it. Hence a modest proposal, which I slip to schoolchildren when their teachers aren't looking.

jlawler 2013-01-05T15:49:03.533

@FumbleFingers: Would you like me to take it apart in detail?jlawler 2013-01-05T15:50:08.773

@jlawler: Perhaps that wasn't a very good way of putting it. I meant that for any given "accent", I'd supposed that the vast majority of IPA representations would be at least consistent. Meaning that in principle the software could just apply different "accent templates" to an single IPA text string in order to come out with accurate renditions in different accents. In answer to your offer to dissect this one in detail - I would very much like the detail, please!FumbleFingers 2013-01-05T15:52:15.477

@FumbleFingers No, I'm sure that he wrote that. I didn't mean "I think" to be like "I guess".Alenanno 2013-01-05T15:52:42.247

1Here's a chat room before the site software starts nagging me to create it anyway.FumbleFingers 2013-01-05T15:57:26.200

Speech synthesis doesn't deal in text input, it deals in generating and outputting sound. The internal representation could be anything and in my limited experience is usually not anything like IPA. What the question should be asking about is "text-to-speech". Then most such systems only deal in one or a small number of languages. Also even if thy did accept IPA input there is not, contrary to popular opinion, only one way correct way to use IPA. Linguists choose and tweak a subset of it to fit their language or task.hippietrail 2013-03-04T22:53:20.897

Answers

14

At the phoneme level, a text-to-speech system is usually tied to the phoneme set of the language a voice was built from. This is usually because modern speech synthesizers (MBROLA, Cepstral, AT&T, etc.) use recorded voice samples in a diphone database (sampling phoneme pairs from the recorded data). This allows them to sound more natural.

As a result of this, the phoneme set is restricted to the voice's language. Thus, handling IPA phonemes not in the target voice is one problem you need to solve in interpreting IPA phoneme strings.

This is further complicated as the underlying phoneme model is likely a simplistic one that does not model phoneme features (e.g. voiceless bilabial fricative) that can be mapped from one phoneme scheme to another.

The eSpeak text-to-speech engine comes close to this in that its model is based around phoneme features and it can support a large number of languages, but it has taken the choice of using a language per voice.

I am in the process of building the Cainteoir text-to-speech engine that will use phoneme features to model phonemes, providing transcription schemes to map to/from that model and separating language and voice. That is, each language and voice specify the phonemes they support, allowing a voice to speak different languages and each language to be used by multiple voices.

The other problem comes in identifying what text is IPA and what is not. This is complex as /jes/ could be IPA or could be jes in italics. Also, gin could be IPA or could be a word depending on context. This can be seen in the way different text-to-speech systems handle roman numerals (I, II, III, IV, V, ...): consider "Chapter I was the Chapter I was reading." and "Henry IV was placed on an IV drip."

Text-to-speech programs will often support SSML (Speech Synthesis Markup Language) that allows you to write <phoneme alphabet="ipa" ph="jes">yes</phoneme> to specify IPA phonemes. Different text-to-speech programs will handle this differently, for example Cepstral only supports their Arpabet-like phonemes.

UPDATE:

The text-to-speech engine will express phonemes in their own phoneme transcription scheme that expresses the phonemes the voice supports. For Cepstral this is based on Arpabet (the US voice uses the CMU pronunciation dictionary with lower case phonemes); for MBROLA this is based on SAMPA (with each voice supporting the corresponding language-based SAMPA phonemes and some additional phonemes); for eSpeak this is based on Kirshenbaum.

The problem then becomes (a) reading the IPA text and (b) mapping the phonemes to the phonemes supported by the voice. A lot of text-to-speech engines don't bother as they do not have the need to support this (dictionary entries are specified in the voice's phoneme transcription scheme).

The approach I am taking is to express a phoneme transcription scheme as a collection of text => features mappings such as /s/ {vls,alv,frc} (ipa: https://raw.github.com/rhdunn/cainteoir-engine/master/data/phonemes/ipa.phon, ascii-ipa: https://raw.github.com/rhdunn/cainteoir-engine/master/data/phonemes/ascii-ipa.phon).

The idea here is to store the phonemes internally as a sequence of feature groups, allowing transcription => features => transcription. This then supports, for example, MBROLA voices reading Unicode-based IPA transcriptions. Here, each voice will have their own features => transcription mapping that maps several feature groups to the same transcription. For example, an English voice could pronounce /r/ and /ɹ/ as /ɹ/ as {vcd,alv,trl} and {vcd,alv,apr} could map to the same phoneme /r/.

UPDATE 2:

Using ˌoʊkeɪˈhiːɹjəˌgoʊ as an example, the transcription for eSpeak is:

$ espeak -v en --ipa "[[,oUkeIh'i@j@g,oU]]"
ˌəʊkeɪhˈiəjəɡˌəʊ

for British English, and:

$ espeak -v en-US --ipa "[[,oUkeIh'i:rj@g,oU]]"
ˌoʊkeɪhˈiːrjəɡˌoʊ

for American English. You can also get it to output the MBROLA phonemes, e.g.:

$ espeak -v mb-de5-en --pho "[[,oUkeIh'i@j@g,oU]]"
@ w k E j h i: @ j @ g @ w

The MBROLA phonemes are actually listed one per line with duration and pitch contour information, but I have excluded that for brevity.

Therefore, eSpeak would need to map ˌoʊkeɪˈhiːɹjəˌgoʊ to ,oUkeIh'i:rj@g,oU for the en-US voice, ,oUkeIh'i@j@g,oU for the en voice and @ w k E j h i: @ j @ g @ w for the MBROLA de5 voice.

reece

Posted 2013-01-05T14:56:27.227

Reputation: 816

That's very interesting. But note that I never expected to find an "IPA rendering" utility that would faithfully reproduce the accent of the (American) guy who posted the original transcript. My understanding is that for the most part, Brits and Americans have the same set of phonemes, but they may be articulated slightly differently. If you know that the input sequence is definitely IPA, I can't see why it should be a problem to "vocalise" it, if the user is already identified as "someone who expects UK English articulations".FumbleFingers 2013-03-03T01:22:28.077

2@FumbleFingers The problem is that the voices do not have the same phoneme sets -- e.g. American English voices do not have /ɒ/ so need to map this to /ɑ/. They also need to map IPA to their representation, e.g. to /0/ and /A:/ for eSpeak or /Q/ and /A:/ for MBROLA English voices that use English SAMPA phonemes. Another example is that the MBROLA de5 voice has /R/ as the uvular fricative /ʁ/ and /r/ as the alveolar flap /ɾ/. How does that speak the alveolar approximant /ɹ/ and alveolar trill /r/ as it does not have the correct phonemes? You need to experiment to map IPA to voice phonemes.reece 2013-03-04T10:52:14.673

3

The following remarks might be interesting when reflecting on why such text-to-speech software is not presently available:

The phonetic component of a computer system may use an orthographic text, but it does not really require one. Chinese characters would do just as well, as long as there was a character for each possible English syllable. The way most speech synthesis systems work is to search through a vast store of several hours of recorded speech and find the longest possible sequences that match the desired output. In practice this is done with reference to something like a segmental transcription of words, but there is nothing in the theory of concatenative speech synthesis, as this process is called, that requires traditional phonetic segments. In fact it virtually never uses them. The smallest lengths of sounds that are ever pulled out of the store are diphones — the last half of one segment and the first half of another — and even these are used only very occasionally, when there is no matching word or syllable in store. Concatenative synthesis nearly always manages to join whole syllables or even whole long words together. Listeners to high quality synthetic speech agree that the phonetic component of a language generator works quite well by simply finding appropriate bits of stored sounds and then joining them together. (Ladefoged 2005)

Also tangentially relevant, and a good and quick read regardless of one's familiarity with phonetics is a 2005 interview with Gunnar Fant, who was very influential in the early development of digital speech transmission and speech synthesis.

jlovegren

Posted 2013-01-05T14:56:27.227

Reputation: 7 665

Sorry this is not convincing. As my example below shows, the system can read any combination of sounds, not necessary those it can find in a text corpus.Anixx 2013-01-06T05:30:37.043

@Anixx can your system read any combination of sounds not all of which are found in russian? have it say a sentence in Zulu or Cantonese.jlovegren 2013-01-06T07:04:24.930

it can read only sounds found in Russian, in any combination. I think this system would need only a minimal change to be able to read any sounds.Anixx 2013-01-06T11:08:19.420

0

For my studies of Proto-Indo-European I always use Russian speech synthesizer. It is not perfect: Russian lacks some phones that were in PIE, also it pronounces unstressed o as a (following the Moscow dialect) and sometimes places stress wrongly. Yet in most cases I can get what I actually wrote:

cleu̯os ndhghu̯itom

qu̯oteros

pa̯tēr

Anixx

Posted 2013-01-05T14:56:27.227

Reputation: 3 628

"You are not allowed to download or view attachments in this section." :)Alenanno 2013-01-05T22:54:28.407

@Alenanno fixed it, thanks.Anixx 2013-01-05T23:21:20.507

2I am not convinced this answers the question however, I mean this one: "Is it hard for software speech synthesisers to handle IPA? If so, why?"Alenanno 2013-01-06T11:05:14.263