Can someone who has experience with TTS explain to me why the Azure AI Speech thingie has such a hard time using correct pauses and emphasis with prepared or generated text? I’m mostly interested in career mode “actors” and ATC.
We’ve all heard this: “One two three decimal four [pause] FiveCessna Alpha Bravo Charlie.”
Sometimes it feels like the AI model is going out of its way to put the emphasis on the wrong words in every sentence it says. And in ATC messages, it seems to actively try to detach the last digit of a number, the last character of a spelled aircraft registration or procedure id from the whole number or designator and add it to the next word, whatever that may be.
Isn’t there a way for Asobo to provide hints to the TTS system? Like the use of punctuation in written speech to help the reader parse the intent of the author. Can’t they add tags or something to the text they feed to TTS to help the AI to identify frequencies as numbers with two or three decimal digits, to tell the AI to group all letters of an identifier together when spelling it out using the phonetic alphabet?
This is really, really bothering me more than it should.
I believe it is the acoustic equivalent of the “uncanny valley”.
The speech synthesis as such is so convincing that it is really jarring when the immersion is broken by the wrong rhythm, cadence, semantics, not sure what to call it. (I throw balls far. You want good words? Date a languager.)
Is it just me?