IMPROVING SPEAKER DIARIZATION
Many people have used automatic speech recognition systems to transcribe audio to text, but there are a host of other items that it’s useful to identify from a stream of audio. One task in particular is called diarization - who spoke when? Knowing this information can help with a range of downstream applications. For example, in meeting summarization, knowing who said something means you can accurately make notes and allocate action items. In video subtitling, the speech from different speakers can be color coded, to better assist those who are hard of hearing. In a virtual assistant, background speech can be ignored to improve the performance of the assistant.
Cobalt teamed up with a client who provides clinical support to therapists in assessing patient risk. They use a regular smartphone app to assist the therapist, rather than a dedicated microphone array, so that patients do not feel uneasy during a session by the presence of unfamiliar technology. This setup presents some challenges for diarization. There’s a lot of ambient and reverberant noise that gets picked up by the smartphone microphone, and the patient’s voice is often lower volume than the therapist’s because of the placement of the phone. In practice, sessions often focus on the therapist speaking with some short interjections by the patient - and many short turns are a challenge for a diarization system. Yet, to accurately provide assistance, the app needs to separate the therapist’s from the patient’s voice with high accuracy.
A typical diarization system works as follows: it removes the non-speech audio using a voice activity detection (VAD) algorithm, splits the speech audio into 1-2 sec segments via a sliding window, estimates the speaker embedding of each segment, and then clusters those segments using a pairwise similarity score between embedding vectors.
Diarization error rate (DER) is a widely used metric to evaluate how good a diarization system is at knowing who spoke when. The metric is defined as the percentage of input speech that is wrongly labeled by the diarization system due to missed detection, false alarms and speaker confusion. For this particular task however, it is not suitable because 1) the downstream application is more concerned with the patient’s speech than the therapist’s, and 2) the fraction of patient’s speech is very low compared to the therapist’s speech. For these reasons, we focus on classifying the patient’s speech and use precision, recall and F1 score as our evaluation metrics.
Precision measures what percentage of speech the system identifies as being from the patient actually is from the patient.
Recall measures what percentage of the patient’s speech the system correctly identifies as being from the patient.
As with all classification tasks, there is a trade-off between precision and recall. We can maximise one, but usually at the expense of the other. To find a good balance, we combine the two metrics using the F1 score.
F1 score is the harmonic mean of Precision and Recall:
To improve over our baseline system - a third party cloud service - on this task, Cobalt made a number of updates and modifications.
(i) Integration with the ASR system: Diarization is usually run as a pre-processing step before the speech recognition system is run. However, the ASR system also predicts some speaker-specific characteristics which ought to help the diarization. By integrating the diarizer and the ASR system, both can benefit. The ASR can return transcripts with speaker labels and the diarizer gets an ASR-enabled voice activity detector (VAD) which is much more accurate than an energy-based VAD. This setup is shown in figure 1.