Speaker Diarization Exploiting the Eigengap Criterion and Cluster Ensembles (original) (raw)

A novel method for selecting the number of clusters in a speaker diarization system

This paper introduces the cluster score (C-score) as a measure for determining a suitable number of clusters when performing speaker clustering in a speaker diarization system. C-score finds a trade-off between intra-cluster and extra-cluster similarities, selecting a number of clusters with cluster elements that are similar between them but different to the elements in other clusters. Speech utterances are represented by Gaussian mixture model mean supervectors, and also the projection of the supervectors into a low-dimensional discriminative subspace by linear discriminant analysis is assessed. This technique shows robustness to segmentation errors and, compared with the widely used Bayesian information criterion (BIC)-based stopping criterion, results in a lower speaker clustering error and dramatically reduces computation time. Experiments were run using the broadcast news database used for the Albayzin 2010 Speaker Diarization Evaluation.

An adaptive initialization method for speaker Diarization based on prosodic features

2010 IEEE International Conference on Acoustics, Speech and Signal Processing, 2010

The following article presents a novel, adaptive initialization scheme that can be applied to most state-of-the-art Speaker Diarization algorithms, i.e. algorithms that use agglomerative hierarchical clustering with Bayesian Information Criterion (BIC) and Gaussian Mixture Models (GMMs) of framebased cepstral features (MFCCs). The initialization method is a combination of the recently proposed "adaptive seconds per Gaussian" (ASPG) method and a new pre-clustering and number of initial clusters estimation method based on prosodic features. The presented initialization method has two important advantages. First, the method requires no manual tuning and is robust against file length and speaker count variations. Second, the method outperforms our previously used initialization methods on all benchmark files that were presented in the 2006, 2007, and 2009 NIST Rich Transcription (RT) evaluations and results in a Diarization Error Rate (DER) improvement of up to 67% (relative).

A review on speaker diarization systems and approaches

Speech Communication, 2012

Speaker indexing or diarization is an important task in audio processing and retrieval. Speaker diarization is the process of labeling a speech signal with labels corresponding to the identity of speakers. This paper includes a comprehensive review on the evolution of the technology and different approaches in speaker indexing and tries to offer a fully detailed discussion on these approaches and their contributions. This paper reviews the most common features for speaker diarization in addition to the most important approaches for speech activity detection (SAD) in diarization frameworks. Two main tasks of speaker indexing are speaker segmentation and speaker clustering. This paper includes a separate review on the approaches proposed for these subtasks. However, speaker diarization systems which combine the two tasks in a unified framework are also introduced in this paper. Another discussion concerns the approaches for online speaker indexing which has fundamental differences with traditional offline approaches. Other parts of this paper include an introduction on the most common performance measures and evaluation datasets. To conclude this paper, a complete framework for speaker indexing is proposed, which is aimed to be domain independent and parameter free and applicable for both online and offline applications.

A comparison of distance measures for clustering in speaker diarization

2014

Matching video segments in order to detect their similarity is a necessary task in retrieval and summarization applications. In order to determine nearly identical content, such as repeated takes of the same scene, very precise matching of sequences of features extracted from the video segments needs to be performed. In this paper we compare the performance of three distance measures for the task of clustering multiple takes of the same scene: Dynamic Time Warping (DTW) and two variants of Longest Common Subsequence (LCSS). We also evaluate the influence of the quality of the input segmentation on the performance of the algorithms.

Adaptive speaker diarization of broadcast news based on factor analysis

Computer Speech & Language, 2017

The introduction of factor analysis techniques in a speaker diarization system enhances its performance by facilitating the use of speaker specific information, by improving the suppression of nuisance factors such as phonetic content, and by facilitating various forms of adaptation. This paper describes a state-of-the-art iVector-based diarization system which employs factor analysis and adaptation on all levels. The diarization modules relevant for this work are: the speaker segmentation which searches for speaker boundaries and the speaker clustering which aims at grouping speech segments of the same speaker. The speaker segmentation relies on speaker factors which are extracted on a frame-by-frame basis using eigenvoices. We incorporate soft voice activity detection in this extraction process as the speaker change detection should be based on speaker information only and we want it to disregard the non-speech frames by applying speech posteriors. Potential speaker boundaries are inserted at positions where rapid changes in speaker factors are witnessed. By employing Mahalanobis distances, the effect of the phonetic content can be further reduced, which results in more accurate speaker boundaries. This iVector-based segmentation significantly outperforms more common segmentation methods based on the Bayesian Information Criterion (BIC) or speech activity marks. The speaker clustering employs two-step Agglomerative Hierarchical Clustering (AHC): after initial BIC clustering, the second cluster stage is realized by either an iVector Probabilistic Linear Discriminant Analysis (PLDA) system or Cosine Distance Scoring (CDS) of extracted speaker factors. The segmentation system is made adaptive on a file-by-file basis by iterating the diarization process using eigenvoice matrices adapted (unsupervised) on the output of the previous iteration. Assuming that for most use cases material similar to the recording in question is readily available, unsupervised domain adaptation of the speaker clustering is possible as well. We obtain this by expanding the eigenvoice matrix used during speaker factor extraction for the CDS clustering stage with a small set of new eigenvoices that, in combination with the initial generic eigenvoices, models the recurring speakers and acoustic conditions more accurately. Experiments on the COST278 multilingual broadcast news database show the generation of significantly more accurate speaker boundaries by using adaptive speaker segmentation which also results in more accurate clustering. The obtained speaker error rate (SER) can be further reduced by another 13% relative to 7.4% via domain adaptation of the CDS clustering.

GTTS System for the Albayzin 2010 Speaker Diarization Evaluation

2010

This paper briefly describes the diarization system developed by the Software Technology Working Group (http://gtts.ehu.es) at the University of the Basque Country (EHU), for the Albayzin 2010 Speaker Diarization Evaluation. The system consists of three decoupled elements: (1) speech/non-speech segmentation; (2) acoustic change detection; and (3) clustering of speech segments. Speech/non-speech segmentation is performed by means of one of the systems presented to the Albayzin 2010 Audio Segmentation Evaluation. With the aim to detect speaker changes, speech segments are further segmented by means of a naive metric-based approach which locates the most likely spectral change points. The third element is based on a dotscoring speaker verification system: speech segments are represented by MAP-adapted GMM zero and first order statistics, dot scoring is applied to compute a similarity measure between segments (or clusters) and finally an agglomerative clustering algorithm is applied until no pair of clusters exceeds a similarity threshold.

Towards robust speaker segmentation: The ICSI-SRI fall 2004 diarization system

2004

We describe the ICSI-SRI entry in the Fall 2004 DARPA EARS Metadata Evaluation. The current system was derived from ICSI's Fall 2003 Speaker-attributed STT system. Our system is an agglomerative clustering system that uses a BIC-like measure to determine when to stop merging clusters and to decide which pairs of clusters to merge. The main advantage of this approach is that it does not require pre-trained acoustic models, providing robustness and portability. Changes for this year's system include: different front-end features, the addition of SRI's Broadcast News speech/non-speech detector, and modifications to the segmentation routine. In post-evaluation work, we found further improvement by changing the stopping criterion from the BIC-like measure to a Viterbi measure. Additionally, we have explored issues related to pruning and improved initialization.

On the use of dot scoring for speaker diarization

2011

In this paper, an alternative dot scoring based agglomerative hierarchical clustering approach for speaker diarization is presented. Dot-scoring is a simple and fast technique used in speaker verication that makes use of a linearized procedure to score test segments against target models. In our speaker diarization approach speech segments are represented by MAP-adapted GMM zero and rst order statistics, dot scoring is applied to compute a similarity measure between segments (or clusters) and nally an agglomerative clustering algorithm is applied until no pair of clusters exceeds a similarity threshold. This diarization system was developed for the Albayzin 2010 Speaker Diarization Evaluation on broadcast news. Results show that the lowest error rate that the clustering algorithm could attain for the evaluation set was around 20% and that over-segmentation was the main source of degradation, due to the lack of robustness in the estimation of statistics for short segments.

The Approach of Speaker Diarization by Gaussian Mixture Model (GMM

Speaker identification is an important activity in the process of speaker diarization. We need to model the speaker by Gaussian mixture model (GMM) for speaker identification purpose. Large GMM is called as a Universal Background Model (UBM) which is adapted into each speaker model for speaker identification purpose. This paper focuses on speech clustering for speaker diarization. The speaker diarization includes the steps speech segmentation and the process of speech clustering. In speech segmentation, the features are extracted for each speech segment which is converted into Mel-Frequency-Cepstral-Coefficients (MFCC). Each speech segment is modeled by UBM adaptation. The relevant speech segments are grouped as speech clusters. This paper describes the speech segmentation, UBM adaptation, and speech clustering technique.