Speech Recognition
& Transcription

Transcribe TM

Cobalt's Latest End-to-end Speech Recognition engines

Cobalt Transcribe is a state-of-the-art speech recognition system. Cobalt Transcribe uses Deep Neural Networks (DNNs) for fast, accurate speech recognition.

Cobalt Transcribe supports two different DNN architectures:

  1. Hybrid models combine separately tunable Acoustic Models, Lexicons, and Language Models, making them highly customizable for specific use cases. Hybrid models support extremely low-latency partial results.
  2. End-to-end models go straight from sounds to words in the same DNN. They tend to be more accurate for general use cases, particularly for systems in which sub-second response time is not required.

Cobalt Transcribe is a highly flexible system that can run on-premise, in your private cloud, or fully embedded on your device. Your data–both the audio and the transcripts–never leave your control.

  • Both end-to-end and hybrid deep learning models
  • Compatible with pre-built Kaldi models
  • Can be run on IntelTM, AMDTM and ARM-64 processors

Cobalt ASR has been used in a variety of applications, including

  • Contact center call transcription
  • Transcription of courtroom proceedings & legal depositions
  • Conversational interfaces for controlling apps and appliances
  • Smart home applications
  • Warehouse & inventory control
  • Inspection & Reporting (e.g. in agriculture, hospitality, machinery)


A large call center analytics company could not transcribe their clients’ audio using their usual solution because regulatory and security requirements disallowed sharing data via a third party cloud vendor in a multi-tenant environment.


The company licensed Cobalt’s best-in-breed technology to deploy in the client’s data centers and processed securely.


The OEM partner has significantly grown market share in regulated verticals such as financial services, healthcare and insurance.


Speech Recognition

Transcribe Tuner TM

Use your own data to improve your speech recognition models.

Run with audio files and human-corrected transcripts to make the model more robust to your unique acoustic environment (noisy background, specific accents, etc.) or business-specific needs.

Add in-domain text documents such as user manuals, customer service scripts, or other written work similar to the kind of dialogue you expect to transcribe. This customizes your transcription model to be more inclined to recognize ambiguous phrases in the way most appropriate for your usage.  For example, “low number” and “loan number” sound nearly identical; a bank could customize the model to be biased toward recognizing “loan number”.  

Add lists of custom vocabulary words, with optional pronunciations and comparable common words.  For example, if your new word is “Hooli”, you can provide “Google”, “Amazon”, or “Facebook” as comparables and the Tuner will generate sentences for the new word similar to contexts in which those commonly used company names appear. 

Voice Channel TM

Multi-speaker separator

Cobalt’s Voice Channel engine differentiates between speakers in a conversation based on distinct characteristics of their voices. Our engine greatly improves the utility of automatic speech recognition when multiple speakers are recorded on a single channel.

By incorporating Neural Networks, Cobalt’s Voice Channel diarization system can detect and segment each speaker into a separated channel.

With the increased multi-speaker audio files from broadcast, Google MeetTM,  ZoomTM, WebexTM  meetings and many more the need to separate speakers for analysis or editing has grown.

Cobalt Detect TM

Word and Phrase Detection

Cobalt’s Detect engine can spot words or phrases in real-time, or from a collection of recorded audio. It operates phonetically, so it recognizes search terms that are not in a dictionary.

Customer Experience soars when you can detect certain words and phrases that are being uttered in real-time during an interaction.

Contact center escalations as well as risk management can be aided by the automated method of understanding words, phrases, combinations and gaps.

Need Something Custom?

Cobalt’s Products are very flexible and can be adapted to many scenarios.

 For example, we can train models to handle:

  • domain-specific language (e.g. medical, legal, agricultural, industrial, specific applications)
  • noisy or unusual acoustic conditions
  • alternate microphone configurations
  • accented speech (i.e. non-native speakers)
  • children’s speech
  • dysarthric speech