About the technology

What are Virtual Assistants?

cobalt istock download.jpg

“Hey Computer, tell me the latest”

With the rise of virtual assistants like Amazon’s Alexa, Apple’s Siri and Google’s assistant, we’re all beginning to get used to talking to our devices. In contrast to computers that have a keyboard and mouse, or tablets and phones with a touchscreen, virtual assistants let us interact using natural spoken language. Voice interfaces drastically simplify our interaction with technology. 

To fulfil requests, virtual assistants are built on a complex pipeline of AI technology:

  • A Wakeword (WW) detector runs on the device, listening for the user to say a particular word or phrase to activate the assistant. It’s also possible to activate the assistant in other ways, like a push-to-talk button.

  • Automatic Speech Recognition (ASR) converts spoken audio from the user into a text transcription.

  • Natural Language Understanding (NLU) takes the transcription of what the user said and predicts their intention in a way that’s actionable. This component understands that users can make the same request in a multitude of different ways that should all have the same outcome.

  • The Dialogue Manager (DM) decides what to say back to the user, whether to take any action, and handles any conversation.

  • Text to Speech (TTS) is the output voice of the assistant.

Screen Shot 2019-09-09 at 09.48.40.png


The technology in this pipeline needs to cope with the breadth and ambiguity of natural language. Hence, alongside manually defined rules, it’s based on machine learning - a group of AI algorithms that learn their behaviour from data instead of being explicitly programmed. This allows assistants to learn how people speak and be able to generalise to new speakers or requests. 

Types of virtual assistant

AI assistants can be deployed in many ways - e.g. on a smartphone app, over a phone call, or on a dedicated device like a smart speaker. There are many places where virtual assistants are proving useful, and new applications are continually being built. The simplest setup is a command and control system. Here the user has just a few commands available to speak to control a device, with not much in the way of dialogue. Simple assistants are often used in environments where hands-free control improves efficiency, for example giving machine operators additional voice control on the factory floor. 

Screen Shot 2019-08-28 at 15.02.52.png

In a step up from command and control systems, many of today’s assistants are task-oriented. The user and computer work together to achieve well-defined tasks like making a bank transfer or finding a mortgage recommendation. These assistants typically work in narrow domains like finance or customer service and require some dialogue back and forth with the user to complete the task. 


More general virtual personal assistants like Amazon’s Alexa or Apple’s Siri handle many different enquiries across a number of domains. They allow you to play music, ask for the weather, control your smart home devices, ask for jokes and much more. Their interactions remain task-oriented, though they typically have some chatty responses to general enquiries. 


Academic research is moving beyond task-oriented dialogue towards new forms of conversational interaction. Fully conversational agents are some way from being built and deployed at scale, but current research is looking towards social forms of human-computer interaction. Competitions like the Alexa Prize - a university competition to build assistants that converse coherently and engagingly with humans - are showcasing some of these results.


Looking to the future

Despite their widespread adoption, AI assistants at scale are still in their infancy. Apple’s Siri launched on the iPhone relatively recently in 2011, and Amazon’s Alexa in 2014. The underlying technology is continually improving. In the next few years, we expect to see AI assistants become

  • Customisable - organisations will more easily be able to build custom interactions. We are already starting to see the first tools to allow easy customisation of voice assistants.

  • Contextual - assistants will incorporate context from different sources. Relevant context can come from real-world knowledge, from personalised information about the user, or from the history of the current conversation.

  • Conversational - while human levels of conversation are still a long way in the future, AI assistants will incorporate more rudimentary conversational capability in the near future.


At Cobalt Speech, we specialise in building custom voice and language technology for our clients, whether one part of the pipeline or a complete virtual assistant. Are you interested in building a secure and private virtual assistant for your enterprise? Get in touch to see how we can help you.


About the Author

catherine_small.jpg

Catherine Breslin is a machine learning scientist with experience in a wide range of voice and language technology. Her work spans from academic research improving speech recognition algorithms through to scaling Amazon’s Alexa. She’s based in Cambridge UK and coordinates Cobalt’s UK efforts.