Speaker
diarization is often performed as a first step to speaker or speech recognition
systems, which work better when the input signal is split into its speakers.
When performing speaker diarization, it is common to use an agglomerative
clustering approach in which the acoustic data is first split in small pieces
and then pairs are merged until reaching a stopping point. The speaker clusters
often contain non-speech frames that jeopardize discrimination between
speakers, creating problems when deciding which two clusters to merge and when
to stop the clustering. In this paper, we present one algorithm that aims to
purify the clusters, eliminating the non-discriminant frames –selected using a
likelihood-based metric– when comparing two clusters. We show improvements of
over 15.5% relative using three datasets from the most current Rich
Transcription (RT) evaluations.