Foreign Language Learning

Using voice technology to help students learning foreign languages


Whether it’s to improve their employment prospects, to make traveling easier, or simply to exercise their brain, many people are learning foreign languages.

Grammar and vocabulary are important aspects of a new language, but often the first stumbling block to making yourself understood is pronunciation. Each language has its own core set of sounds, called phones, from which words are constructed. As young babies we already show a preference for the phones of our native language. By the time we are adults, those preferences are ingrained and it can be difficult even to hear sounds that aren’t present in your native language. This makes learning pronunciation hard because you cannot easily tell when you are wrong. For example - it is hard for native Spanish speakers  to distinguish the long and short vowel in the English words ‘sheep’ and ‘ship’, while native English speakers have trouble hearing and reproducing subtle vowel differences in French. 

Technology companies are changing the way that people learn new languages by bringing teaching online. When learning a foreign language, students who receive frequent and meaningful feedback improve faster. The best way to get this feedback is talking with a teacher who can correct you in the moment. One of the key challenges for providers of online foreign language instruction is to find a way of providing timely feedback on such aspects as pronunciation, sentence construction or word selection.

Here’s where voice technology can help. By analysing speech patterns, we can provide instant feedback and so help people learn faster. We use automatic speech recognition technology to recognise what students have said, including the phonemes, stress and intonation patterns they use. By comparing to one or more reference pronunciations, we can learn scores for entire phrases, words, syllables or phones. We provide detailed feedback to students about which sounds and words they successfully pronounced, and where they should put in more practice. Furthermore, we can use automated text-to-speech to speak out examples of phrases, words or sounds that the student can listen to and compare to their own pronunciations as they learn.

Cobalt has worked with each of our partner foreign language learning companies to create a custom technology that reflects their teaching philosophy and style. By automating language learning feedback, Cobalt has been able to deploy a scalable, online foreign language learning platform which is in use by our partners in their online education products.

Do you want to chat with us about using voice and language technology in your products? Get in touch by emailing

Cobalt at Interspeech

posted by Andrew Fletcher

The speech processing research community is a dynamic place these days.  The commercial prominence of highly popular speech interfaces has taken an already-thriving research community and provided a transforming jolt of attention.  The evidence is quite clear at the annual Interspeech conference, the 2019 version of which was held in Graz, Austria from September 15-19.  Cobalt had five employees in attendance this year -- Jeff Adams, Ryan Lish, Stan Salvador, Rasmus Dall, and myself -- and we found it invigorating both technically and socially. Plus, we experienced the joys of European tourism! (As an aside, if you’re looking to join a dynamic team in speech and language, reach out to us at

First, the locale.  Jeff, Ryan, and I chose to stay in a quaint cottage in a family-owned Austrian winery.  While this meant a commute into Graz, we got to experience the Austrian countryside in its early autumn grandeur.  Stan and Rasmus both stayed in the city and thus had easy walking access to the lovely Graz city center. I think it’s fair to say we were all more than satisfied with the experience.

By gathering thousands of speech scientists into one place, it was a great chance to reconnect with friends and colleagues.  Whether it be former Cobalt colleagues, friends from our past employments, or friends met at previous Interspeeches, we all had multiple reunions each day.  Jeff takes the prize in this department: he has over 25 years being deep within the speech science community and is an extremely extroverted guy. Throughout the week he reconnected with friends in academia and industry, giving him the chance to metaphorically take the pulse of the broader speech community. 

Jeff reconnecting with former Cobalt colleagues Jangwon Kim & Kevin Yang

Jeff reconnecting with former Cobalt colleagues Jangwon Kim & Kevin Yang

A valuable and enjoyable aspect of Interspeech is the chance to broaden our vision of both technical challenges and innovative solutions.  We get to see the highlights and accomplishments of a talented group of men and women who have insights into both what problems are worth solving and how to go about finding the best solutions.  As many researchers can attest, it requires focus and attention to detail to tackle hard problems. Time spent thinking about various techniques at Interspeech can help to avoid “over-fitting” (to borrow a term from machine learning).  We certainly came away with many ideas to improve technology for our customers -- the challenge now will be prioritizing and implementing!

One particular trend we’re watching closely is the extensive research effort going into “end-to-end” speech recognition.  Deep neural networks (DNNs) have revolutionized speech processing (and the greater machine learning community) over the last decade and researchers continue to probe means to leverage them to ever greater effect.  Many traditional approaches (such as kaldi) have trained separate acoustic and language models for speech recognition and DNNs have proven useful for both models.  Such methods are the foundation for production ASR systems and are now referred to as “hybrid.” This contrasts with efforts to do ASR with combined DNN networks to to solve the problem “end-to-end.”  End-to-end ASR has been a topic for several years now and is the clear focus of the research community. It is fascinating to see the innovative methods used to overcome the acknowledged challenges particularly in regards to the quantity of data required and designing DNN topologies to exploit data in the most efficient ways.  

While end-to-end ASR has not yet overtaken hybrid systems in achieving the state of the art performance, the achievements are both noteworthy and exciting.  The very first survey talk of Interspeech 2019 by Ralf Schlüter of RWTH Aachen University compared results to date and explored the place for each style of ASR. In particular, he showed results using the Librispeech corpus showing that the current state of the art is a hybrid system with an LSTM acoustic model and a transformer-based language model.  The two talks immediately following presented end-to-end approaches that nearly matched the hybrid system. It was a welcome reminder right as the conference began that ASR continues to be an active competition of ideas to help push the community toward ever more impactful technology.  It’s a fun time to be in the speech technology field!

We engaged with researchers discussing speech synthesis, natural language understanding and dialogue systems, voice biometrics, speech analytics, emotion detection, and more. I had some great discussions about techniques like generative adversarial networks, attention networks, and neural vocoders.  I won’t presume to summarize the tremendous volume of research presented at the conference (the archive is available here).  It was invigorating to be a part of it!

Cobalt remains a committed participant in this dynamic and innovative community.  We’re staying current in the latest techniques are are always looking for opportunities to transition cutting edge research into deployed technology.  If you’re looking to make these breakthroughs, come see how we can help you.

(L to R) Jeff, Stan, Andrew, and Ryan enjoying a delicious lunch in the hills above Graz

(L to R) Jeff, Stan, Andrew, and Ryan enjoying a delicious lunch in the hills above Graz

About the Author

This week’s blog is contributed by Dr. Andrew Fletcher, Cobalt’s Vice President of Research

This week’s blog is contributed by Dr. Andrew Fletcher, Cobalt’s Vice President of Research

Dr. Andrew Fletcher is an experienced research engineer with broad expertise in system architecture, machine learning, information theory, communication systems, signal processing, and quantum computing. He has led projects at Cobalt in voice user interface design and implementation, speech recognition, natural language understanding, speaker biometrics, text to speech, and keyword spotting. Prior to joining Cobalt, Andrew spent 15 years at MIT Lincoln Laboratory. He led research projects on undersea laser communication and quantum cryptography and was a leading system engineer for free-space laser communication systems for NASA and the US Department of Defense. Andrew earned a Ph.D. from the Massachusetts Institute of Technology in Electrical Engineering. At MIT, he was in the Laboratory for Information and Decision Systems (LIDS), and was advised by Peter Shor, a leading theorist in quantum computing and information. He earned B.S. and M.S. degrees in Electrical and Computer Engineering at Brigham Young University.

A CoW in the Mountains

posted by Jeff Adams

Many of our readers will know that Cobalt is a virtual company; all our employees work from home.  People often ask us how we manage to maintain such great team cohesion and loyalty when we don’t see each other daily.  There are many answers to that question, but one of them is certainly our tradition of CoWs.

The Cobalt team poses for a team picture at Bryce Canyon National Park

The Cobalt team poses for a team picture at Bryce Canyon National Park

Three times a year we gather the Cobalt team for a week of intensive work. As a distributed company without an office, these gatherings are an important way of staying connected. They also give us opportunities for strategic planning, deep dives on thorny technical issues, team building, and a lot of fun. We call this periodic meeting a CoW, Co being the abbreviation for the element Cobalt, W for workshop.

Coby, our company mascot, is a regular fixture at our CoWs.

Coby, our company mascot, is a regular fixture at our CoWs.

In June the CoW took us to Brian Head, Utah, into a world of fantastical red rock formations and a touch of altitude sickness. Brian Head has the distinction of being the municipality at the highest elevation anywhere in 49 of the 50 states (Colorado has higher towns still). Those of us living at or near sea level in various parts of the world found that the place literally took our breath away, at least for the first few days.

Throughout the week we spent time hiking, eating interesting food, stargazing, enjoying the mountain air, entertaining one another with evening games, and, it turned out, scrambling to bring in bottled water after receiving the news that the town’s water supply had been contaminated.

The pictures here might give the impression that we spent all our time hiking and having fun. In fact, the great majority of our time was spent in working together (but that doesn’t make for great photos).

The team hard at work

The team hard at work

One group made some advances in our domain-specific speech recognition work. Though there are other speech products that produce reasonable accuracy generally, Cobalt continues to improve our ability to dramatically minimize errors in situations where the topics are known beforehand. For example, in handling university-level chemistry lectures, we reduced the generic word error rate from 21.4% to 10.6%, where a leading competitor’s lowest error rate was 18.6%. We also experimented with data sets involving regional dialects & slang, business meetings, sports interviews, and courtroom recordings.

Several of us worked on the new generation of the synthesized text-to-speech voice we’ve created, getting it running on ARM-64 devices (phones, tablets, and device-embedded chips), as well as optimizing its usage of CPU and memory.

Though not very flashy, we did a lot of work on improving our internal processes, such as making the tests we run to measure speed and accuracy more reliable and informative, and making some improvements to the servers we use for building new models.

We also submitted proposals for two government grants, one of which we’ve already been awarded. Finally, the management team regrouped on some topics for company strategy going forward.

Admiring the view from the rim of Bryce Canyon

Admiring the view from the rim of Bryce Canyon

In the end, this CoW in the Mountains was a lot of fun, but more importantly, was a productive week full of collaboration among colleagues and friends.

If you would like to be part of Cobalt, send us an email at

If you have a project you’d like us to take on, send us an email at

We would love to hear from you!

A magnificant view from Cedar Breaks National Monument, just 10km from the location of our June CoW.

A magnificant view from Cedar Breaks National Monument, just 10km from the location of our June CoW.

Scaling Virtual Assistants

Virtual assistants allow us to interact with technology by voice. They are built on a complex pipeline of AI technology that understands the breadth and complexity of spoken language. This pipeline includes automatic speech recognition, natural language understanding, dialogue management and text-to-speech components. The technology in the pipeline is based on machine learning - a subset of AI algorithms that learn their behaviour from data instead of being explicitly programmed.

Building and releasing a virtual assistant is a huge milestone for any organisation. Yet, in our experience, this milestone is only the beginning of a much larger effort to grow and scale. There are different dimensions in which to scale, and each direction brings both technical and organisational challenges. Three directions where we’ve seen virtual assistants successfully scale are users, functionality and languages


Attracting more users is often the first thought when it comes to scaling. First and foremost, the production infrastructure that serves the assistant to users has to be scaled up to serve multiple users in parallel. Effective ways of scaling software services have been written about widely, and so the rest of this post will discuss more specific ways of scaling machine learning based voice technology.


More users will make requests in more varied ways, and so we must make the assistant robust to variation in its input. It’s particularly hard with natural language to predict in advance all the ways people will phrase a request. Machine learning is crucial for scalability as the underlying models can learn patterns from data and interpret phrases that haven’t been heard before. At the same time, we have to ensure that the system is designed for a wide range of people. Variation in spoken language comes from many sources, including accent, age and gender. Carefully selecting the data used for training and testing the underlying machine learning models helps ensure we build technology that performs well for different segments of the population.

Each machine learning component in the pipeline can be built and tested in isolation. It’s also important to develop end-to-end testing to ensure that the whole system works as intended. As an assistant scales to more users, it has more chances to go wrong. Users will report errors. Answers to questions can be wrong. End-to-end testing makes it simpler to track and fix these bugs across the complex pipeline of technology.


A second dimension of scale is in the types of request an assistant can respond to. AI Assistants model language as intents and slots. Intents are broad categories of request such as asking to play music or to make a payment. Slots are the specific entities being talked about, like a song name or a payment amount. Adding new features usually means adding new types of intent and slot. When adding intents and slots, it’s a good idea to begin by placing yourself in the shoes of the end user. Imagine the kinds of conversations they will have with your assistant. From these imagined conversations, it can be easier to identify the new intents and slots that are needed.


Inevitably, as new features are added there’ll be clashes between existing and new functionality that have to be resolved on a case-by-case basis. A request to ‘Turn it up’ almost certainly refers to the volume when you are only a music player, but is far more ambiguous as you’re enabling smart home control too. The resolution of these ambiguities usually needs discussion between both technical and business teams. 

To build new functionality also requires data on which to train and test the underlying models. The best source of data is realistic data collected from real users. Synthetic data, even data elicited in trials, doesn’t exactly match real usage. Thus, there’s a tension between users’ privacy and collecting data. It’s necessary to be transparent with users about their rights, clearly explain if and how their data is used, and to ensure compliance with data handling laws when collecting data for training machine learning algorithms.

Machine learning models are evaluated by computing their accuracy on a set of test data. It’s a characteristic of machine learning models that you cannot guarantee their behaviour. Changing the data they’re trained on, even when only adding new data for new features, will change the model’s behaviour. Thus, it’s important to have good accuracy testing to ensure that performance on existing features is maintained as new features are built.


A third dimension is scaling to new languages. An assistant trained for English users obviously won’t work if you speak to it in German. Yet you don’t need to start entirely from scratch to build a fully working German assistant.


The infrastructure for building, testing and deploying your models can be shared between languages. Additionally, the intent and slot ontology can be shared. Where things diverge is the need to collect language-specific data on which to train the underlying models. Techniques like transfer learning can minimise the amount of new data that’s needed. Transfer learning is a way to take knowledge from one domain or language, and transfer it to another using a small amount of data. 

New languages may have characteristics that need language-specific solutions. For example:

  • Compound words can encode multiple entities

    • E.g. The German word “Rechtsschutzversicherungsgesellschaften” translates as ‘insurance companies providing legal protection’. The exact part of this word which is important to a query depends on context, and in such languages we can identify sub-words.

  • Formality and politeness affect word choice

    • E.g. The Korean verb ‘nupta’, meaning ‘to lie down’ has different forms depending on politeness and formality, including nuweo, numneunda, nuweoyo and nupseumnida. These distinctions are important when designing the character of an assistant and determining how it should respond to users. 

  • Tonal languages use pitch to change the meaning of a word

    • E.g. “mā mā mà mǎ” in Mandarin means “Mum scolded the horse”. The words differ only in the pitch contour with which they’re spoken. Speech recognition for tonal languages uses pitch information to distinguish words.

  • Lack of spaces in some written text means it has to be automatically segmented into ‘words’ - necessitating another component in the pipeline

    • E.g. Japanese “おはようございます” (ohayō gozaimasu) means “good morning”

After the models for each language have been successfully built and deployed, we need a convenient way to update those models using new data. Language changes over time. Models ought to be updated and deployed regularly to keep up with the changing usage. This is especially true with fast moving domains like the news, where events happen suddenly and are talked about widely. 

Whether for financial advice, customer assistance, or another application, the goal in building any virtual assistant is to have it scale effectively so more people use it. Scaling virtual assistants brings a unique set of challenges. This post has discussed ways of overcoming those challenges when scaling to more users, features or languages. Yet, each organisation is different and there are plenty of tricky problems that we’re still figuring out. 

At Cobalt Speech, we specialise in building and scaling custom voice and language technology for our clients. Are you interested in building a secure and private virtual assistant for your enterprise? Get in touch to see how we can help you.

About the Author


Catherine Breslin is a machine learning scientist with experience in a wide range of voice and language technology. Her work spans from academic research improving speech recognition algorithms through to scaling Amazon’s Alexa. She’s based in Cambridge UK and coordinates Cobalt’s UK efforts.

This was originally given as a talk at Re-Work’s AI Assistant Summit in London, on Sept 19th 2019.

What are Virtual Assistants?

cobalt istock download.jpg

“Hey Computer, tell me the latest”

With the rise of virtual assistants like Amazon’s Alexa, Apple’s Siri and Google’s assistant, we’re all beginning to get used to talking to our devices. In contrast to computers that have a keyboard and mouse, or tablets and phones with a touchscreen, virtual assistants let us interact using natural spoken language. Voice interfaces drastically simplify our interaction with technology. 

To fulfil requests, virtual assistants are built on a complex pipeline of AI technology:

  • A Wakeword (WW) detector runs on the device, listening for the user to say a particular word or phrase to activate the assistant. It’s also possible to activate the assistant in other ways, like a push-to-talk button.

  • Automatic Speech Recognition (ASR) converts spoken audio from the user into a text transcription.

  • Natural Language Understanding (NLU) takes the transcription of what the user said and predicts their intention in a way that’s actionable. This component understands that users can make the same request in a multitude of different ways that should all have the same outcome.

  • The Dialogue Manager (DM) decides what to say back to the user, whether to take any action, and handles any conversation.

  • Text to Speech (TTS) is the output voice of the assistant.

Screen Shot 2019-09-09 at 09.48.40.png

The technology in this pipeline needs to cope with the breadth and ambiguity of natural language. Hence, alongside manually defined rules, it’s based on machine learning - a group of AI algorithms that learn their behaviour from data instead of being explicitly programmed. This allows assistants to learn how people speak and be able to generalise to new speakers or requests. 

Types of virtual assistant

AI assistants can be deployed in many ways - e.g. on a smartphone app, over a phone call, or on a dedicated device like a smart speaker. There are many places where virtual assistants are proving useful, and new applications are continually being built. The simplest setup is a command and control system. Here the user has just a few commands available to speak to control a device, with not much in the way of dialogue. Simple assistants are often used in environments where hands-free control improves efficiency, for example giving machine operators additional voice control on the factory floor. 

Screen Shot 2019-08-28 at 15.02.52.png

In a step up from command and control systems, many of today’s assistants are task-oriented. The user and computer work together to achieve well-defined tasks like making a bank transfer or finding a mortgage recommendation. These assistants typically work in narrow domains like finance or customer service and require some dialogue back and forth with the user to complete the task. 

More general virtual personal assistants like Amazon’s Alexa or Apple’s Siri handle many different enquiries across a number of domains. They allow you to play music, ask for the weather, control your smart home devices, ask for jokes and much more. Their interactions remain task-oriented, though they typically have some chatty responses to general enquiries. 

Academic research is moving beyond task-oriented dialogue towards new forms of conversational interaction. Fully conversational agents are some way from being built and deployed at scale, but current research is looking towards social forms of human-computer interaction. Competitions like the Alexa Prize - a university competition to build assistants that converse coherently and engagingly with humans - are showcasing some of these results.

Looking to the future

Despite their widespread adoption, AI assistants at scale are still in their infancy. Apple’s Siri launched on the iPhone relatively recently in 2011, and Amazon’s Alexa in 2014. The underlying technology is continually improving. In the next few years, we expect to see AI assistants become

  • Customisable - organisations will more easily be able to build custom interactions. We are already starting to see the first tools to allow easy customisation of voice assistants.

  • Contextual - assistants will incorporate context from different sources. Relevant context can come from real-world knowledge, from personalised information about the user, or from the history of the current conversation.

  • Conversational - while human levels of conversation are still a long way in the future, AI assistants will incorporate more rudimentary conversational capability in the near future.

At Cobalt Speech, we specialise in building custom voice and language technology for our clients, whether one part of the pipeline or a complete virtual assistant. Are you interested in building a secure and private virtual assistant for your enterprise? Get in touch to see how we can help you.

About the Author


Catherine Breslin is a machine learning scientist with experience in a wide range of voice and language technology. Her work spans from academic research improving speech recognition algorithms through to scaling Amazon’s Alexa. She’s based in Cambridge UK and coordinates Cobalt’s UK efforts.

Introducing the CoBlog: Cobalt Blog

On any given day, my colleagues and I at Cobalt Speech and Language are tackling a variety of diverse problems in the world of speech processing, machine learning, and natural language processing. In order to invite others to have a look into our virtual workshop, we’re starting this blog. About once a week, you’ll find a short feature here that will help acquaint you with the people and projects that put Cobalt Speech at the forefront of progress in the world of speech technology.

To kick off this “CoBlog”, I thought it appropriate to share a little bit about why we started Cobalt in the first place.

Five years ago, I sat in my office at Amazon, getting ready for the upcoming Echo & Alexa launch.  I thought back on the incredible three year adventure that had prepared us for that moment. It had been an amazing experience growing the speech & NLU team at Amazon, defining and discovering what Alexa could do, working with the hardware, applications, data, and other teams at Amazon.  One thing that stood out to me was how much effort had gone into developing this novel product. I guessed that Amazon had invested hundreds of millions of dollars into preparing for a successful launch of Alexa and the Echo products.

The idea for Alexa was simple – a voice-based home assistant that could do simple things like play music and tell you the time or weather, but only a giant like Amazon could have pulled it off.  I wondered how many other entrepreneurs had similarly innovative ideas, but lacked the technical resources to execute their vision. In particular, having built speech & language research teams at Nuance, Yap, and now Amazon, I knew how hard it was to recruit top speech & language scientists and engineers and convince them to join a new company. 

That’s when the idea for Cobalt came to me.  Amazon, Microsoft, Google, and a few others have their own world-class speech & language teams.  Cobalt would be the speech team for everyone else. I would hire world-class speech & language scientists and engineers, we would develop our own, independent core software and tools, and we would use those tools to help build the dreams of everyone who needed that technology.  We would be the speech & language development partner for everyone who needed one.

Over the past 5 years, we have worked with about 100 different companies, developing all manner of applications with our technology, and it has been an incredible adventure.  One of the interesting aspects is that while we have focused on speech & language technology, our customers focus on their respective specialties. Our partnerships have taught us things about those diverse domains, and we have enjoyed learning about speech applications that were new and different.  We have learned about language pedagogy as we have developed tools to help language learners improve their pronunciation. We have learned about agriculture as we developed tools to help farmers and agronomists. We learned about the worlds of finance, education, entertainment, call centers, assistive technologies, child development, government, and so many other areas.  It has been very rewarding.

Over the coming months and years, we will share more in this blog about some of the projects we have worked on, some details about Cobalt’s technological suite, and introduce you to our team and how we work together. 

We hope you’ll enjoy getting to know us.