Scaling Virtual Assistants

Virtual assistants allow us to interact with technology by voice. They are built on a complex pipeline of AI technology that understands the breadth and complexity of spoken language. This pipeline includes automatic speech recognition, natural language understanding, dialogue management and text-to-speech components. The technology in the pipeline is based on machine learning - a subset of AI algorithms that learn their behaviour from data instead of being explicitly programmed.

Building and releasing a virtual assistant is a huge milestone for any organisation. Yet, in our experience, this milestone is only the beginning of a much larger effort to grow and scale. There are different dimensions in which to scale, and each direction brings both technical and organisational challenges. Three directions where we’ve seen virtual assistants successfully scale are users, functionality and languages


Attracting more users is often the first thought when it comes to scaling. First and foremost, the production infrastructure that serves the assistant to users has to be scaled up to serve multiple users in parallel. Effective ways of scaling software services have been written about widely, and so the rest of this post will discuss more specific ways of scaling machine learning based voice technology.


More users will make requests in more varied ways, and so we must make the assistant robust to variation in its input. It’s particularly hard with natural language to predict in advance all the ways people will phrase a request. Machine learning is crucial for scalability as the underlying models can learn patterns from data and interpret phrases that haven’t been heard before. At the same time, we have to ensure that the system is designed for a wide range of people. Variation in spoken language comes from many sources, including accent, age and gender. Carefully selecting the data used for training and testing the underlying machine learning models helps ensure we build technology that performs well for different segments of the population.

Each machine learning component in the pipeline can be built and tested in isolation. It’s also important to develop end-to-end testing to ensure that the whole system works as intended. As an assistant scales to more users, it has more chances to go wrong. Users will report errors. Answers to questions can be wrong. End-to-end testing makes it simpler to track and fix these bugs across the complex pipeline of technology.


A second dimension of scale is in the types of request an assistant can respond to. AI Assistants model language as intents and slots. Intents are broad categories of request such as asking to play music or to make a payment. Slots are the specific entities being talked about, like a song name or a payment amount. Adding new features usually means adding new types of intent and slot. When adding intents and slots, it’s a good idea to begin by placing yourself in the shoes of the end user. Imagine the kinds of conversations they will have with your assistant. From these imagined conversations, it can be easier to identify the new intents and slots that are needed.


Inevitably, as new features are added there’ll be clashes between existing and new functionality that have to be resolved on a case-by-case basis. A request to ‘Turn it up’ almost certainly refers to the volume when you are only a music player, but is far more ambiguous as you’re enabling smart home control too. The resolution of these ambiguities usually needs discussion between both technical and business teams. 

To build new functionality also requires data on which to train and test the underlying models. The best source of data is realistic data collected from real users. Synthetic data, even data elicited in trials, doesn’t exactly match real usage. Thus, there’s a tension between users’ privacy and collecting data. It’s necessary to be transparent with users about their rights, clearly explain if and how their data is used, and to ensure compliance with data handling laws when collecting data for training machine learning algorithms.

Machine learning models are evaluated by computing their accuracy on a set of test data. It’s a characteristic of machine learning models that you cannot guarantee their behaviour. Changing the data they’re trained on, even when only adding new data for new features, will change the model’s behaviour. Thus, it’s important to have good accuracy testing to ensure that performance on existing features is maintained as new features are built.


A third dimension is scaling to new languages. An assistant trained for English users obviously won’t work if you speak to it in German. Yet you don’t need to start entirely from scratch to build a fully working German assistant.


The infrastructure for building, testing and deploying your models can be shared between languages. Additionally, the intent and slot ontology can be shared. Where things diverge is the need to collect language-specific data on which to train the underlying models. Techniques like transfer learning can minimise the amount of new data that’s needed. Transfer learning is a way to take knowledge from one domain or language, and transfer it to another using a small amount of data. 

New languages may have characteristics that need language-specific solutions. For example:

  • Compound words can encode multiple entities

    • E.g. The German word “Rechtsschutzversicherungsgesellschaften” translates as ‘insurance companies providing legal protection’. The exact part of this word which is important to a query depends on context, and in such languages we can identify sub-words.

  • Formality and politeness affect word choice

    • E.g. The Korean verb ‘nupta’, meaning ‘to lie down’ has different forms depending on politeness and formality, including nuweo, numneunda, nuweoyo and nupseumnida. These distinctions are important when designing the character of an assistant and determining how it should respond to users. 

  • Tonal languages use pitch to change the meaning of a word

    • E.g. “mā mā mà mǎ” in Mandarin means “Mum scolded the horse”. The words differ only in the pitch contour with which they’re spoken. Speech recognition for tonal languages uses pitch information to distinguish words.

  • Lack of spaces in some written text means it has to be automatically segmented into ‘words’ - necessitating another component in the pipeline

    • E.g. Japanese “おはようございます” (ohayō gozaimasu) means “good morning”

After the models for each language have been successfully built and deployed, we need a convenient way to update those models using new data. Language changes over time. Models ought to be updated and deployed regularly to keep up with the changing usage. This is especially true with fast moving domains like the news, where events happen suddenly and are talked about widely. 

Whether for financial advice, customer assistance, or another application, the goal in building any virtual assistant is to have it scale effectively so more people use it. Scaling virtual assistants brings a unique set of challenges. This post has discussed ways of overcoming those challenges when scaling to more users, features or languages. Yet, each organisation is different and there are plenty of tricky problems that we’re still figuring out. 

At Cobalt Speech, we specialise in building and scaling custom voice and language technology for our clients. Are you interested in building a secure and private virtual assistant for your enterprise? Get in touch to see how we can help you.

About the Author


Catherine Breslin is a machine learning scientist with experience in a wide range of voice and language technology. Her work spans from academic research improving speech recognition algorithms through to scaling Amazon’s Alexa. She’s based in Cambridge UK and coordinates Cobalt’s UK efforts.

This was originally given as a talk at Re-Work’s AI Assistant Summit in London, on Sept 19th 2019.