Who Spoke When?

Submitted By: Arif Haque


Many people have used automatic speech recognition systems to transcribe audio to text, but there are a host of other items that it’s useful to identify from a stream of audio. One task in particular is called diarization – who spoke when? Knowing this information can help with a range of downstream applications. For example, in meeting summarization, knowing who said something means you can accurately make notes and allocate action items. In video subtitling, the speech from different speakers can be color coded, to better assist those who are hard of hearing. In a virtual assistant, background speech can be ignored to improve the performance of the assistant.

Cobalt teamed up with a client who provides clinical support to therapists in assessing patient risk. They use a regular smartphone app to assist the therapist, rather than a dedicated microphone array, so that patients do not feel uneasy during a session by the presence of unfamiliar technology. This setup presents some challenges for diarization. There’s a lot of ambient and reverberant noise that gets picked up by the smartphone microphone, and the patient’s voice is often lower volume than the therapist’s because of the placement of the phone. In practice, sessions often focus on the therapist speaking with some short interjections by the patient – and many short turns are a challenge for a diarization system. Yet, to accurately provide assistance, the app needs to separate the therapist’s from the patient’s voice with high accuracy.


A typical diarization system works as follows: it removes the non-speech audio using a voice activity detection (VAD) algorithm, splits the speech audio into 1-2 sec segments via a sliding window, estimates the speaker embedding of each segment, and then clusters those segments using a pairwise similarity score between embedding vectors.

Diarization error rate (DER) is a widely used metric to evaluate how good a diarization system is at knowing who spoke when. The metric is defined as the percentage of input speech that is wrongly labeled by the diarization system due to missed detection, false alarms and speaker confusion. For this particular task however, it is not suitable because 1) the downstream application is more concerned with the patient’s speech than the therapist’s, and 2) the fraction of patient’s speech is very low compared to the therapist’s speech. For these reasons, we focus on classifying the patient’s speech and use precision, recall and F1 score as our evaluation metrics. 

Precision measures what percentage of speech the system identifies as being from the patient actually is from the patient.

Recall measures what percentage of the patient’s speech the system correctly identifies as being from the patient.

As with all classification tasks, there is a trade-off between precision and recall. We can maximise one, but usually at the expense of the other. To find a good balance, we combine the two metrics using the F1 score.

F1 score is the harmonic mean of Precision and Recall:



To improve over our baseline system – a third party cloud service – on this task, Cobalt made a number of updates and modifications.

(i) Integration with the ASR system: Diarization is usually run as a pre-processing step before the speech recognition system is run. However, the ASR system also predicts some speaker-specific characteristics which ought to help the diarization. By integrating the diarizer and the ASR system, both can benefit. The ASR can return transcripts with speaker labels and the diarizer gets an ASR-enabled voice activity detector (VAD) which is much more accurate than an energy-based VAD. This setup is shown in figure 1.

ASR SD integration

(ii) Streaming diarization: A conventional diarization system requires the entire audio file to be available before starting the clustering process. We implemented a streaming mode which enables us to start diarization as soon as the audio begins streaming. The diarization is performed in two stages. In the first pass, diarization is performed on local chunks. Some representative segments from each chunk are passed to the second stage. In the second stage, these accumulated segments are clustered and the diarization result is generated for the entire file. The result is available 10-30 seconds after the end of streaming, and so is available for downstream processes quicker than when using a non-streaming diarization algorithm.

(iii) Training-dataset augmentation: The dataset for training was augmented with volume perturbation, noise, and reverberation. Data augmentation is a common technique in machine learning to increase the training data variability, and has been used for many years by the speech community. 

(v) Variational-Bayesian refinement: With a view to capturing short responses and interjects of the patient’s speech, we have added a post-processing step based on Variational-Bayesian (VB) resegmentation. This step generally increases recall at the expense of precision.

Combining these improvements and tuning for our client’s data gives a marked improvement in performance against a baseline system from a third party cloud provider, as can be seen in table 1. 

diarization results

Customized diarization proved to be a substantial improvement for our client. Certainly the diarization accuracy improved substantially by leveraging the best techniques and relevant data. With better diarization results, our client’s application has a much stronger foundation for success. Additionally, our attention to their specific use case allowed a dramatic reduction in the delay between the end of the interview and the results of the assessment being available. The streaming implementation meant complete results for an hour-long interview were available within 30 seconds of its completion. In comparison, the non-streaming processing could take delays of 10-15 minutes. This project represents Cobalt’s ideal of customizing speech applications: better performance and an efficient solution for a high value application.

If diarization is a functionality that you’re interested in using, please get in touch and we’d be happy to chat about your requirements. 

About the Author

Dr Arif Haque is a professor at the Bangladesh University of Electronics and Technology.  He was a postdoctoral fellow in Concordia University, Montreal, Canada. He developed several multi-microphone speech dereverberation techniques during his PhD work. Arif has coached his students, who have been consistent winners or runners-up in the annual Signal Processing Cup organized by the IEEE signal processing society.  His specialties include digital signal processing, speech processing, and machine learning. 

Dr Arif Haque