New Advances in Speaker Diarization (original) (raw)
Related papers
Unsupervised deep feature embeddings for speaker diarization
TURKISH JOURNAL OF ELECTRICAL ENGINEERING & COMPUTER SCIENCES, 2019
Speaker diarization aims to determine "who spoke when?" from multispeaker recording environments. In this paper, we propose to learn a set of high-level feature representations, referred to as feature embeddings, from an unsupervised deep architecture for speaker diarization. These sets of embeddings are learned through a deep autoencoder model when trained on mel-frequency cepstral coefficients (MFCCs) of input speech frames. Learned embeddings are then used in Gaussian mixture model based hierarchical clustering for diarization. The results show that these unsupervised embeddings are better compared to MFCCs in reducing the diarization error rate. Experiments conducted on the popular subset of the AMI meeting corpus consisting of 5.4 h of recordings show that the new embeddings decrease the average diarization error rate by 2.96%. However, for individual recordings, maximum improvement of 8.05% is acquired.
Speaker Diarization using Deep Recurrent Convolutional Neural Networks for Speaker Embeddings
In this paper we propose a new method of speaker diariza-tion that employs a deep learning architecture to learn speaker embeddings. In contrast to the traditional approaches that build their speaker embeddings using manually hand-crafted spectral features, we propose to train for this purpose a recurrent convo-lutional neural network applied directly on magnitude spectro-grams. To compare our approach with the state of the art, we collect and release for the public an additional dataset of over 6 hours of fully annotated broadcast material. The results of our evaluation on the new dataset and three other benchmark datasets show that our proposed method significantly outper-forms the competitors and reduces diarization error rate by a large margin of over 30% with respect to the baseline.
Speaker Clustering With Neural Networks And Audio Processing
Speaker clustering is the task of differentiating speakers in a recording. In a way, the aim is to answer "who spoke when" in audio recordings. A common method used in industry is feature extraction directly from the recording thanks to MFCC features, and by using well-known techniques such as Gaussian Mixture Models (GMM) and Hidden Markov Models (HMM). In this paper, we studied neural networks (especially CNN) followed by clustering and audio processing in the quest to reach similar accuracy to state-of-the-art methods.
Large-Scale Speaker Diarization for Long Recordings and Small Collections
IEEE Transactions on Audio, Speech, and Language Processing, 2000
Performing speaker diarization of very long recordings is a problem for most diarization systems that are based on agglomerative clustering with an HMM topology. Performing collectionwide speaker diarization, where each speaker is identified uniquely across the entire collection, is even a more challenging task. In this paper we propose a method with which it is possible to efficiently perform diarization of long recordings. We have also applied this method successfully to a collection of a total duration of approximately 15 hours. The method consists of first segmenting long recordings into smaller chunks on which diarization is performed. Next, a speaker detection system is used to link the speech clusters from each chunk and to assign a unique label to each speaker in the long recording or in the small collection. We show for three different audio collections that it is possible to perform high quality diarization with this approach. The long meetings from the ICSI corpus are processed 5.5 times faster than the originally needed time and by uniquely labeling each speaker across the entire collection it becomes possible to perform speaker-based information retrieval with high accuracy (mean average precision of 0.57).
A novel method for selecting the number of clusters in a speaker diarization system
This paper introduces the cluster score (C-score) as a measure for determining a suitable number of clusters when performing speaker clustering in a speaker diarization system. C-score finds a trade-off between intra-cluster and extra-cluster similarities, selecting a number of clusters with cluster elements that are similar between them but different to the elements in other clusters. Speech utterances are represented by Gaussian mixture model mean supervectors, and also the projection of the supervectors into a low-dimensional discriminative subspace by linear discriminant analysis is assessed. This technique shows robustness to segmentation errors and, compared with the widely used Bayesian information criterion (BIC)-based stopping criterion, results in a lower speaker clustering error and dramatically reduces computation time. Experiments were run using the broadcast news database used for the Albayzin 2010 Speaker Diarization Evaluation.
Speaker Embeddings for Diarization of Broadcast Data In The Allies Challenge
ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021
Diarization consists in the segmentation of speech signals and the clustering of homogeneous speaker segments. State-of-the-art systems typically operate upon speaker embeddings, such as ivectors or neural x-vectors, extracted from mel cepstral coefficients (MFCCs) or spectrograms. The recent SincNet architecture extracts x-vectors directly from raw speech signals. The work reported in this paper compares the performance of different embeddings extracted from MFCCs or the raw signal for speaker diarization and broadcast media treated with compression and sub-sampling, operations which typically degrade performance. Experiments are performed with the new ALLIES database that was designed to complement existing, publicly available French corpora of broadcast radio and TV shows. Results show that, in adverse conditions, with compression and sampling mismatch, SincNet x-vectors outperform i-vectors and x-vectors by relative DERs of 43% and 73% respectively. Additionally we found that SincNet x-vectors are not the absolute best embeddings but are more robust to data mismatch than others.
A review on speaker diarization systems and approaches
Speech Communication, 2012
Speaker indexing or diarization is an important task in audio processing and retrieval. Speaker diarization is the process of labeling a speech signal with labels corresponding to the identity of speakers. This paper includes a comprehensive review on the evolution of the technology and different approaches in speaker indexing and tries to offer a fully detailed discussion on these approaches and their contributions. This paper reviews the most common features for speaker diarization in addition to the most important approaches for speech activity detection (SAD) in diarization frameworks. Two main tasks of speaker indexing are speaker segmentation and speaker clustering. This paper includes a separate review on the approaches proposed for these subtasks. However, speaker diarization systems which combine the two tasks in a unified framework are also introduced in this paper. Another discussion concerns the approaches for online speaker indexing which has fundamental differences with traditional offline approaches. Other parts of this paper include an introduction on the most common performance measures and evaluation datasets. To conclude this paper, a complete framework for speaker indexing is proposed, which is aimed to be domain independent and parameter free and applicable for both online and offline applications.
End-To-End Speaker Diarization as Post-Processing
ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021
This paper investigates the utilization of an end-to-end diarization model as post-processing of conventional clustering-based diarization. Clustering-based diarization methods partition frames into clusters of the number of speakers; thus, they typically cannot handle overlapping speech because each frame is assigned to one speaker. On the other hand, some end-to-end diarization methods can handle overlapping speech by treating the problem as multi-label classification. Although some methods can treat a flexible number of speakers, they do not perform well when the number of speakers is large. To compensate for each other's weakness, we propose to use a two-speaker end-to-end diarization method as post-processing of the results obtained by a clustering-based method. We iteratively select two speakers from the results and update the results of the two speakers to improve the overlapped region. Experimental results show that the proposed algorithm consistently improved the performance of the state-of-the-art methods across CALLHOME, AMI, and DIHARD II datasets.
Combination of agglomerative and sequential clustering for speaker diarization
Acoustics, Speech and …, 2008
This paper aims at investigating the use of sequential clustering for speaker diarization. Conventional diarization systems are based on parametric models and agglomerative clustering. In our previous work we proposed a non-parametric method based on the agglomerative Information Bottleneck for very fast diarization. Here we consider the combination of sequential and agglomerative clustering for avoiding local maxima of the objective function and for purification. Experiments are run on the RT06 eval data. Sequential Clustering with oracle model selection can reduce the speaker error by 10% w.r.t. agglomerative clustering. When the model selection is based on Normalized Mutual Information criterion, a relative improvement of 5% is obtained using a combination of agglomerative and sequential clustering.