Under the Hood: Automatic Speech Recognition

Submitted by Catherine Breslin on Wed, 10/23/2019 - 10:25
Close up of man holding mobile phone, AI concept, automatic speech recognition

Automatic Speech Recognition (ASR) is a key component of a virtual assistant - it converts audio into text. As well as being crucial for conversational AI, ASR has applications as a standalone technology in places like automated subtitling, call centre transcription and analytics, meeting transcription, and more. This post takes a deeper look at what’s under-the-hood of Cobalt’s Cubic speech recognition technology.

decoding process 

An automatic speech recognition system has three models: the acoustic model, language model and lexicon. They’re used together in an engine that ‘decodes’ the audio signal into a best guess transcription of the words that were spoken. 


The lexicon describes how words are pronounced phonetically. It’s usually handcrafted by expert phoneticians, using a phone set that’s specific for each language. One phone set for English is Arpabet, which describes word pronunciations using a set of 50 phonemes. Some words have multiple pronunciations - ‘read’, ‘bow’ and ‘either’ are good examples. The lexicon contains these multiple pronunciations. Some example Arpabet pronunciations are in the table below.


In their October 2019 update, the OED added more than 650 new words, senses and subentries to their dictionary. New words appear in language all the time, and would be problematic if we didn’t know how to pronounce them. From a good quality lexicon, statistical techniques can be used to guess at pronunciations of new and unknown words.


The acoustic model (AM) models the acoustics of speech. The audio signal is split into small segments, or frames, typically 25ms in length. At a high level, the job of the acoustic model is to predict which sound, or phoneme, from the phone set is being spoken in each frame of audio.

Acoustic model 

The Acoustic model predicts the probability of each phoneme being spoken in a short frame of audio

Deep neural networks trained on thousands of hours of transcribed audio data are a popular choice for acoustic models. Having the right data for training and testing the AM is key to ensuring that the model works well for different acoustic characteristics. Factors like accent, gender, age, microphone, background noise are all modeled by the acoustic model, and the data used for training has to reflect these characteristics. 

Another, less obvious, factor that impacts the acoustic model is someone’s speaking style. If a person knows they are talking to a machine, they’re more likely to enunciate their words more clearly than if they are having an informal conversation with a friend. This variation in enunciation is another aspect that’s modeled by the acoustic model.


If you read many different passages of text, then you’ll get an idea of which words, and which sequences of words, are more likely than others. This knowledge is captured by the language model (LM) component of a speech recognition system. It learns which sequences of words are most likely to be spoken, and its job is to predict which words will follow on from the current words and with what probability.

language model 

The language model predicts the probability of which words come next

The language model is usually an N-gram model or a neural network trained on millions of words of text data. People’s word and phrase choice is largely influenced by the topic they’re talking about, but it can also be influenced by age, gender, formality, speaking style, and other factors. The language model training data should reflect the kinds of words and phrases users will say to the final system. 


The AM models sounds of the language, the lexicon describes how those sounds combine to make words, and the LM models how those words are constructed into sequences of words. Used together in a speech recognition engine, these allow you to automatically transcribe speech. 

As with all machine learning systems, these speech recognition models are heavily dependent on the data used to train them. This means we must take care to make the right choices of data to model the complete variety of speech. Yet, it also allows us to customise a system to a particular application or scenario by careful selection of data. For example, we can:

  • Build an acoustic model using audio data from a particularly noisy environment such as a factory floor or a cockpit. This lets us build an ASR system that’s resilient to that specific kind of noise in the background.

  • Construct a language model for a specific scenario, such as sales calls or technical meetings, so that the speech recognition accuracy is optimised for the application.

  • Adapt an existing acoustic model in one language to be used in a different language, e.g. English to German, using a technique called transfer learning. This transfers some of the knowledge of English acoustics to the German model, allowing us to build the latter with far less data than we’d otherwise need.

  • Run a speech recognition system with limited hardware resources, by compressing the language model and acoustic model to fit within the system constraints, while still maintaining performance.

Often, organisations already have data which can be used for customisation of a speech model. For example, internal collections of text documents can be used to build a language model to be combined with Cobalt’s general acoustic model and lexicon. The custom system can then be further refined after it’s deployed, using data captured in the field, if data handling and privacy restrictions allow. 


In the past 10 years, speech recognition accuracy has dramatically improved due to increased computing power, data availability, and more powerful modeling methods like deep neural networks. While we can’t predict the future, some interesting directions we see things evolving are: 

  1. Customisation - organisations are looking for AI solutions, including voice and language technology, that are tailored for their use cases.

  2. Data efficiency - currently it takes 1000s of hours of audio data to build a system, but reducing data requirements of ASR systems promises to speed up and lower the cost of development.

  3. End-to-end ASR - using a single neural network to model the lexicon, language model and acoustic model together is a popular research area that’s making huge strides. The results aren’t as accurate yet as separate models, but the research moves fast!


At Cobalt Speech, we specialise in building and scaling custom voice and language technology for our clients. Cobalt’s Cubic engine for automatic speech recognition is designed for use in a range of different custom, on-prem and embedded scenarios. Get in touch to see how we can help you.

About the Author

Catherine Breslin is a machine learning scientist with experience in a wide range of voice and language technology. Her work spans from academic research improving speech recognition algorithms through to scaling Amazon’s Alexa. She’s based in Cambridge UK and coordinates Cobalt’s UK efforts.

Catherine Breslin