How Can I Integrate Cutting Edge ASR Into My Company's Software?

Submitted by Julie Sheffield on Wed, 12/11/2019 - 09:52
software engineer

Now that you've learned how ASR systems work, and understand the business value of keeping control of your data and using models tuned for your use case, you might be wondering: "how can I integrate cutting-edge ASR into my company's software?"  The answer is to use Cobalt's Cubic engine, which can run on your own server, laptop, mobile or embedded device.  Cobalt's experts are always experimenting with the latest techniques from published speech science research, comparing different neural network architectures, applying various algorithms for data augmentation, word embeddings, language model rescoring--distilling the panoply of fascinating academic results to find what has the most practical value for our customers.  Cubic is where the rubber meets the road, the engine that makes those cutting-edge breakthroughs available to those of us without PhDs. 

Cobalt's Cubic Engine

Cubic is the runtime recognizer that decodes audio streams and is called by nearly all of Cobalt's products, from keyword spotting to pronunciation scoring to automatic transcription.  This post focuses on Cubic Server, a powerful interface to get transcription results from Cubic.

Some key benefits of Cubic Server:

  • Privacy by design. Protect your data and your competitive advantage by hosting your own speech recognition instead of giving it to a cloud service. Cubic does not require network connectivity, so it can run on an offline mobile device or embedded controller. This advantage has been essential to many of our projects, such as in a smart assistant for agricultural workers who don't have reliable internet access, or in a transcription app for therapists in which the privacy of the conversation is paramount.

data protection

  • Straightforward SDKs in multiple languages. Cubic Server implements a gRPC API, which makes it easy to build a flexible connected system. Whether Cubic is running on the same device or on a remote server, your client code calls the same simple methods; our software development kit takes care of the details.  We currently support C++, C#, Go, Java and Python, and can build an SDK in whatever gRPC-supported language you need. Documentation is available here, along with sample code that allows you to connect to our demo server and try it yourself.

  • Fast decoding speed. Cobalt has implemented cutting-edge techniques to speed up processing without compromising accuracy. Our large general conversational models transcribe audio seven times faster than real-time, meaning Cubic can produce the transcript for a full hour of audio in approximately 8.5 minutes.  Focused vocabulary and keyword-spotting models are even faster, reducing the amount of processing power needed to keep up with a real-time stream.


  • Streaming support, including partial results. Unlike some recognizers, which delay returning results until an entire file has been processed, Cubic decodes an audio stream and returns results as it goes.  Because context is important to accurately determine the results, Cubic returns the final result for an utterance whenever it reaches a pause, such as the end of a sentence.  In order to decrease latency, it can also be configured to return partial results before it reaches an endpoint, consisting of the most likely output up to that point, which may be modified in the final output once the full context has been received.  For example, Cubic might first return "Force core…" as a partial result, but the full result would be "four score and seven years ago".

  • Scalability. Cubic has been run on systems as small as chips embedded in toys, all the way up to server deployments supporting hundreds of simultaneous audio streams. The machine learning models are loaded once, then can be used by multiple threads per CPU.

  • Support for different outputs. Some use cases may require more than just the most likely transcript for each utterance. The Cubic API optionally provides much more detail if required, such as the n-best list of alternatives when a phrase is ambiguous, the start time, duration, and confidence of each word, or even the full confusion network. 

  • Configurable text formatting. A post-processing pipeline adds capital letters, punctuation, numerals instead of number words, and/or your own custom formatting rules.  For example, the raw output "YOUR NEW IPHONE MUST HAVE COST AT LEAST SIX FIFTY RIGHT" would be formatted "Your new iPhone must have cost at least 650, right?"

Your company's speech and language needs are unique. Cobalt's Cubic engine provides the power and flexibility you need to make your ideas a reality. Get in touch to see how we can help you.

About the Author

Julie Sheffield has 20 years of experience in the software industry, building enterprise platforms and applications, leading teams, and mentoring talented engineers. At Cobalt, she leads the engineering team to ensure the scalability and robustness of the various speech engines Cobalt deploys, and to support customers' integration of those engines into their own applications. Julie loves finding solutions, making the complex seem simple, and creating software that empowers people to accomplish more than they ever thought possible.
Julie Sheffield