Wed, 10/23/2019 - 10:25
Automatic Speech Recognition (ASR) is a key component of a virtual assistant - it converts audio into text. As well as being crucial for conversational AI, ASR has applications as a standalone technology in places like automated subtitling, call centre transcription and analytics, meeting transcription, and more. This post takes a deeper look at what’s under-the-hood of Cobalt’s Cubic speech recognition technology.
An automatic speech recognition system has three models: the acoustic model, language model and lexicon. They’re used together in an engine that ‘decodes’ the audio signal into a best guess transcription of the words that were spoken.
The lexicon describes how words are pronounced phonetically. It’s usually handcrafted by expert phoneticians, using a phone set that’s specific for each language. One phone set for English is Arpabet, which describes word pronunciations using a set of 50 phonemes. Some words have multiple pronunciations - ‘read’, ‘bow’ and ‘either’ are good examples. The lexicon contains these multiple pronunciations. Some example Arpabet pronunciations are in the table below.
In their October 2019 update, the OED added more than 650 new words, senses and subentries to their dictionary. New words appear in language all the time, and would be problematic if we didn’t know how to pronounce them. From a good quality lexicon, statistical techniques can be used to guess at pronunciations of new and unknown words.
The acoustic model (AM) models the acoustics of speech. The audio signal is split into small segments, or frames, typically 25ms in length. At a high level, the job of the acoustic model is to predict which sound, or phoneme, from the phone set is being spoken in each frame of audio.
Deep neural networks trained on thousands of hours of transcribed audio data are a popular choice for acoustic models. Having the right data for training and testing the AM is key to ensuring that the model works well for different acoustic characteristics. Factors like accent, gender, age, microphone, background noise are all modeled by the acoustic model, and the data used for training has to reflect these characteristics.
Another, less obvious, factor that impacts the acoustic model is someone’s speaking style. If a person knows they are talking to a machine, they’re more likely to enunciate their words more clearly than if they are having an informal conversation with a friend. This variation in enunciation is another aspect that’s modeled by the acoustic model.
If you read many different passages of text, then you’ll get an idea of which words, and which sequences of words, are more likely than others. This knowledge is captured by the language model (LM) component of a speech recognition system. It learns which sequences of words are most likely to be spoken, and its job is to predict which words will follow on from the current words and with what probability.
Catherine Breslin is a machine learning scientist with experience in a wide range of voice and language technology. Her work spans from academic research improving speech recognition algorithms through to scaling Amazon’s Alexa. She’s based in Cambridge UK and coordinates Cobalt’s UK efforts.