A comparison of distance measures for clustering in speaker diarization (original) (raw)

Efficient combination of parametric spaces, models and metrics for speaker diarization1

2007

In this paper we present a method of combining several acoustic parametric spaces, statistical models and distance metrics in speaker diarization task. Focusing our interest on the post-segmentation part of the problem, we adopt an incremental feature selection and fusion algorithm based on the Maximum Entropy Principle and Iterative Scaling Algorithm that combines several statistical distance measures on speech-chunk pairs. By this approach, we place the merging-of-chunks clustering process into a probabilistic framework. We also propose a decomposition of the input space according to gender, recording conditions and chunk lengths. The algorithm produced highly competitive results compared to GMM-UBM state-of-the-art methods.

Efficient combination of parametric spaces, models and metrics for speaker diarization1

2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU), 2007

In this paper we present a method of combining several acoustic parametric spaces, statistical models and distance metrics in speaker diarization task. Focusing our interest on the post-segmentation part of the problem, we adopt an incremental feature selection and fusion algorithm based on the Maximum Entropy Principle and Iterative Scaling Algorithm that combines several statistical distance measures on speech-chunk pairs. By this approach, we place the merging-of-chunks clustering process into a probabilistic framework. We also propose a decomposition of the input space according to gender, recording conditions and chunk lengths. The algorithm produced highly competitive results compared to GMM-UBM state-of-the-art methods.

T-test distance and clustering criterion for speaker diarization

2008

In this paper, we present an application of student's t-test to measure the similarity between two speaker models. The measure is evaluated by comparing with other distance metrics: the Generalized Likelihood Ratio, the Cross Likelihood Ratio and the Normalized Cross Likelihood Ratio in speaker detection task. We also propose an objective criterion for speaker clustering. The criterion deduces the number of speakers automatically by maximizing the separation between intra-speaker distances and inter-speaker distances. It requires no development data and works well with various distance metrics. We then report the performance of our proposed similarity distance measure and objective criterion in speaker diarization task. The system produces competitive results: low speaker diarization error rate and high accuracy in detecting number of speakers.

Constrained speaker diarization of TV series based on visual patterns

2014 IEEE Spoken Language Technology Workshop (SLT), 2014

Speaker diarization, usually denoted as the "who spoke when" task, turns out to be particularly challenging when applied to fictional films, where many characters talk in various acoustic conditions (background music, sound effects...). Despite this acoustic variability, such movies exhibit specific visual patterns in the dialogue scenes. In this paper, we introduce a two-step method to achieve speaker diarization in TV series: a speaker diarization is first performed locally in the scenes detected as dialogues; then, the hypothesized local speakers are merged in a second agglomerative clustering process, with the constraint that speakers locally hypothesized to be distinct must not be assigned to the same cluster. The performances of our approach are compared to those obtained by standard speaker diarization tools applied to the same data.

Generalized Viterbi-based models for time-series segmentation and clustering applied to speaker diarization

Computer Speech & Language, 2017

Speaker diarization is a problem of separating unknown speakers in a conversation into homogeneous parts in the speaker sense. State-of-the-art diarization systems are based on i-vector methodologies. However, these approaches require large quantities of training data, which must be obtained from an environment that is similar to that of the conversation being diarized. In this paper we present a diarization system that does not require such training data but instead can suffice with some development data for parameter-tuning. This system is a generalization of the well-known hidden Markov model (HMM), a popular clustering algorithm trained by Viterbi statistics. Our proposed model, referred to as a hidden distortion model (HDM), is based on state distortion models and transition costs, for which probabilistic calculations are not mandatory, in contrast to the case of HMM. We provide a mathematical basis for our approach, and we demonstrate that Viterbi-based HMM can be seen as a special case of HDM. This proximity allows us to apply similar approaches for state-model training when the new paradigm is used to learn sequence dependencies. We carry out diarizations of two-speaker telephone conversations in order to evaluate the performance of HDM. When applied to conversations from the LDC CALLHOME database, HDM improves on the performance of a baseline HMM system by about 26% (relative improvement). Moreover, when applied to the NIST 2005 database, it yields a small improvement over the HMM system.

A novel method for selecting the number of clusters in a speaker diarization system

This paper introduces the cluster score (C-score) as a measure for determining a suitable number of clusters when performing speaker clustering in a speaker diarization system. C-score finds a trade-off between intra-cluster and extra-cluster similarities, selecting a number of clusters with cluster elements that are similar between them but different to the elements in other clusters. Speech utterances are represented by Gaussian mixture model mean supervectors, and also the projection of the supervectors into a low-dimensional discriminative subspace by linear discriminant analysis is assessed. This technique shows robustness to segmentation errors and, compared with the widely used Bayesian information criterion (BIC)-based stopping criterion, results in a lower speaker clustering error and dramatically reduces computation time. Experiments were run using the broadcast news database used for the Albayzin 2010 Speaker Diarization Evaluation.

A NOVEL DTW-BASED DISTANCE MEASURE FOR SPEAKER SEGMENTATION

2006

We present a novel distance measure for comparing two speech segments that is based on the segmental DTW algorithm. Our approach is based on the idea of finding word-level speech patterns that are repeated by the same speaker. Using this distance measure, we develop a speaker segmentation procedure and apply it to the task of segmenting multi-speaker lectures. We demonstrate that our approach is able to generate segmentations that correlate well to independently generated human segmentations.

Speaker diarization using data-driven audio sequencing

2013 IEEE International Conference on Acoustics, Speech and Signal Processing, 2013

In this paper, a speaker diarization system based on datadriven segmentation is proposed. In addition to the usual segmentation and clustering steps, a new module which detects repeated segments between the same shows broadcasted on different dates is added. This process is achieved by using the ALISP-based audio identification system which segments audio data into pseudo-phonetic units. The ALISP segmentation is then used to identify the similar audio segments in TV and radio shows. The system was evaluated during the ETAPE 2011 evaluation campaign and obtained a Diarization Error Rate -DER of 16.23% which was the best result among seven participants.