Building Blocks of Voice Technology

Submitted By: Anonymous

With a set of simple building blocks, you’re only limited in what you can build by your imagination. The same is true with technology, if we think about building complex applications from simpler building blocks. 

At Cobalt, we have created a proprietary suite of engines, each of which handles a different voice or language task. Each of those engines can be customized to work well for specific languages or domains, and together they can be combined in a multitude of ways to create many different products – anything from call center analytics to tools for learning foreign languages and searching your video archives. 


The building blocks – or engines – we have built at cobalt are based around core voice and language technology. Automatic speech recognition (ASR), also known as speech-to-text (STT), is the technology that converts audio to text to give a transcription of what is spoken. This is the technology you need if you want to know the words that were spoken in a segment of audio. Text-to-speech (TTS) is the reverse of STT, creating natural and engaging speech from text. TTS is what you’re looking for if you want a convenient and natural way to convey information to your customers. 

Along with the words spoken, it’s possible to tease out other aspects of speech. Speaker identification and voice activity detection (VAD) combine to solve the task of diarisation, or “who spoke when?”. This information can be useful for applications to understand. Audio classification can identify further nuances of the audio such as sentiment or accent. Similar to classification, pronunciation assessment identifies how close your pronunciation of a word is to the native pronunciation. This is especially useful in educational settings, and for those learning foreign languages to get targeted feedback for improvement. 

Another audio-related problem is how to efficiently search a large archive of video or audio. One way would be to do ASR on the entire archive and then search the automatically generated transcription. However, where the ASR makes mistakes, it’s not easy to recover.  A better approach is to phonetically index the audio – that is, understand the sounds spoken in the audio rather than the words – and search that phonetic index. This lets you be more flexible and overcome speech recognition mistakes. 

Moving on from audio and towards language, there are the tasks of natural language understanding (NLU) and dialogue management (DM). The first, NLU, is concerned with understanding more than just the words that were written or spoken. That is, being able to say something about the intention of the user, about the sentiment of the words, or about the topic that they’re talking about. The second, DM, concerns the ability of a device to keep track of a conversation and the state of the world, in order to have a functional conversation with a person using natural language. 

jigsaw puzzleThese separate building blocks of voice technology are based on machine learning techniques – they learn their behaviour from data. Hence they can be customized, or adapted, to specific scenarios. For example, with some additional audio data, the speech recognition system can be tuned to work better in a particular noisy environment. With some domain-specific text, it can be made to better recognise the words of a specific topic like science lectures, where the vocabulary is specialised. This customization can boost performance even further for your customers. 


Each of the engines described above is useful in their own right, but put them together and there’s the chance to create a myriad of different applications to delight your customers. On top of that, being able to customize each engine to work specifically for your domain adds an extra dimension of usefulness. 

A simple video subtitling application might just use speech recognition to subtitle the audio, but combine it with speaker identification and you have the chance to improve intelligibility by, for example, color coding the subtitles to each speaker. Add in some language understanding capability, and you have the basis of a meeting assistant that can take intelligent notes and allow you to look back later when the exact content of the meeting has been forgotten. 

video search platform can use phonetic indexing and search as its basis, but could be combined with audio classification technology to allow for more specific searches. Now you can find examples of someone talking excitedly about your favourite football team, or when someone is talking in a more sombre tone about your favourite celebrity. 

Call center analytics might combine speech recognition and natural language understanding technology. With these, you can understand what your customers are calling about on any day and create a high level overview to help manage your call center more efficiently. For example, you can quickly see whether customers have particular queries for the time of year, and train your representatives to answer them, or spot and address issues in the production of your goods which have led to increased call volume. 

Combining many of the building blocks – speech recognition, language understanding, dialogue management and text-to-speech – allows creation of a voice interface. That is, a fully interactive system that users can talk to, ask questions of, and receive a verbal response. Such an application can be served to users in different ways. For example, in an automated call center or as a virtual assistant on a device or smart speaker. Voice interfaces can interact with your backend services to provide the information your customers need – such as their bank account balance, the wait time for a particular item you manufacture, or their closest store. Voice assistants also can be internally facing, allowing your employees to access key metrics and information by voice, wherever they are in the world. 

These are just a few of the many applications we’ve talked about with our customers. When it comes to voice and language applications, the world is your oyster! Please get in touch with us to find out how Cobalt can help you build the voice and language technology that will delight your customers.