Under the Hood: Automatic Speech Recognition

Submitted By: Rasmus Dall

We’ve previously written about one of our core technologies at Cobalt Speech & Language – automatic speech recognition (ASR). When you speak, the ASR system converts your spoken words into text. Another core technology at Cobalt is text-to-speech (TTS), or speech synthesis, which converts written words into spoken audio.

One place that TTS voices are currently reaching large audiences is in virtual assistants. They can perform a multitude of tasks and it would be impossible to pre-record all of their answers in advance. Hence, virtual assistants use TTS technology to automatically generate responses to questions, allowing them to talk to you about a wide range of topics.

Beyond virtual assistants, TTS is used in many applications that might not be immediately obvious. For example, TTS provides a voice to those who cannot otherwise speak. Stephen Hawking’s voice is a famous example, but many other people benefit from synthetic voices to help them remain independent. In other scenarios, you might hear TTS voices in call-centre IVR systems, broadcasting public announcements, in computer games to generate speech without having to pre-record every line of dialogue, or when calling self-service helpdesks.

Voice is a natural way to interact, and so TTS can be found anywhere you might talk to your customers. But how does TTS work? How good is it now? And where is the technology heading? In this blog post, we’ll give an overview of Cobalt’s answer to each of those questions.

WHAT’S IN A VOICE?

TTS has a long history. A famous early example is the Voder, which was exhibited at the 1939 New York World’s Fair. This device mimicked the physical way that speech is produced, and was controlled by an operator with keys and pedals. For some time after that, the best TTS approaches focused on using digital signal processing techniques. Stephen Hawking lost his ability to speak in the mid-1980s, and his famously robotic voice was created by expertly tuned digital signal processing methods. TTS technology has moved on a lot since then, although Hawking continued to favour the robotic voice. 

In the 1990s, those signal processing methods were replaced by unit selection – a method based on cutting segments of sound from a large database of speech and pasting them together to create new audio. More recently, we have moved towards using statistical techniques to model speech. The rise of neural networks has led to the very natural sounding synthetic speech that we can hear today. 

At Cobalt, we employ the latest statistical techniques to allow us the greatest flexibility, highest quality and smallest footprint – to allow deployment on even small embedded devices. Our TTS system contains three components: a text-processing frontend, an acoustic model and a vocoder. These three components work together to create our natural sounding voices.

2020 Screenshot

THE TEXT-PROCESSING FRONTEND

When a user types in text to be spoken, the system first has to figure out the sound and shape of what should be spoken. This component knows that the “St.” in “St. Patrick’s Day” is not the same as the one in “High St.”, that “1984” is pronounced as “nineteen eighty four”, and that the pronunciation of “bow” is different in “she took a bow” and “he tied a bow”. The text processing frontend converts the written text to a form that can be spoken, via a series of text transformations. 

Speech is not just the words that are spoken, but how they are spoken, including pronunciation and prosody. The words are converted to their corresponding speech sounds through a combination of a pronunciation lexicon and a machine learning model for predicting the pronunciation of unknown words. Useful extra information is also extracted, like part-of-speech, syllabification, and sentence structure. Sometimes, that information is language-specific, like tone. 

As the frontend handles a lot of text, it is the most language-dependent element of the TTS system, and is where a lot of customisation occurs.

THE ACOUSTIC MODEL

The role of the acoustic model is to capture the specific characteristics of the voice. Aspects like personality, speaking style, accent, gender and age are all incorporated into the acoustic model.

Cobalt’s statistical approaches learn a machine learning model from tens of hours of audio data, and use this model to generate speech. Today, this modelling is generally handled by a neural network for optimal quality.

The input data quality is of the utmost importance to the final result. If there’s noise in the recordings, this noise will also be present in the model. If the speaker is hoarse, the result will be a hoarse voice and if the style of speech is that of a news reader, the result will sound like a news reader. At Cobalt we specialise in voices tailored to your specific needs and work hard to ensure we find the right voice talents who speak in just the way you wish your product to sound.

THE VOCODER

You might never have heard of a vocoder before, yet you probably use one every day. Vocoders have been used in telephones ever since their invention. The job of the vocoder in a telephone is to compress the audio to reduce the amount of data sent over the phone line, and uncompress it at the other end, without degrading the audio quality. In a TTS system, using a vocoder simplifies the job of the acoustic model and allows us to create a better quality voice. 

In the 2010s, improved vocoder technology has been one of the major factors in the rising popularity of purely statistical approaches for TTS. Modern vocoders produce high quality speech with little quality loss that allows for a variety of manipulations of the speech (such as increasing the pitch) while allowing for low footprint acoustic models.

THE FUTURE

TTS has been consistently improving over the years, and we’ve now reached a point where synthetic voices can at times be indistinguishable from a human. With these improvements in quality, there’s still a number of places where we see industry trends moving:

  1. The frontend is very language dependent and often very hand-crafted. Recent trends show great promise in making this process more automatic and less language dependent, hence making it faster to build voices in new languages.
  2. New vocoders are being continually developed, with the most recent trends showing that neural network based vocoders can produce speech of such a quality as to be almost indistinguishable from human speech.
  3. Emotions, characters and speaking styles are a big research focus. Current systems are quite inflexible, generally speaking in a single voice in a single style. Enhancing the variety of the speech is an interesting problem with lots of applications to represent real characters instead of neutral assistants.

About the Author

Rasmus Dall leads Cobalt’s speech synthesis efforts. He holds a PhD in Speech Synthesis from the CSTR in Edinburgh and has previously worked to bring life into the voice of Jibo – Time Magazine’s invention of the year 2017.
He’s based in Denmark and when not pushing the boundaries of embedded speech synthesis he tends to his garden and pretends to be Scottish.

Rasmus Dall