Robust acoustic domain identification with its application to speaker diarization (original) (raw)

New Advances in Speaker Diarization

Interspeech 2020, 2020

Recently, speaker diarization based on speaker embeddings has shown excellent results in many works. In this paper we propose several enhancements throughout the diarization pipeline. This work addresses two clustering frameworks: agglomerative hierarchical clustering (AHC) and spectral clustering (SC). First, we use multiple speaker embeddings. We show that fusion of x-vectors and d-vectors boosts accuracy significantly. Second, we train neural networks to leverage both acoustic and duration information for scoring similarity of segments or clusters. Third, we introduce a novel method to guide the AHC clustering mechanism using a neural network. Fourth, we handle short duration segments in SC by deemphasizing their effect on setting the number of speakers. Finally, we propose a novel method for estimating the number of clusters in the SC framework. The method takes each eigenvalue and analyzes the projections of the SC similarity matrix on the corresponding eigenvector. We evaluated our system on NIST SRE 2000 CALLHOME and, using cross-validation, we achieved an error rate of 5.1%, going beyond state-of-the-art speaker diarization.

INVESTIGATING DEEP NEURAL NETWORKS FOR SPEAKER DIARIZATION IN THE DIHARD CHALLENGE

IEEE Workshop on Spoken Language Technology, 2018

We investigate the use of deep neural networks (DNNs) for the speaker diarization task to improve performance under domain mismatched conditions. Three unsupervised domain adaptation techniques, namely inter-dataset variability compensation (IDVC), domain-invariant covariance normalization (DICN), and domain mismatch modeling (DMM), are applied on DNN based speaker embeddings to compensate for the mismatch in the embedding subspace. We present results conducted on the DIHARD data, which was released for the 2018 diarization challenge. Collected from a diverse set of domains, this data provides very challenging domain mismatched conditions for the diarization task. Our results provide insights into how the performance of our proposed system could be further improved.

Unsupervised deep feature embeddings for speaker diarization

TURKISH JOURNAL OF ELECTRICAL ENGINEERING & COMPUTER SCIENCES, 2019

Speaker diarization aims to determine "who spoke when?" from multispeaker recording environments. In this paper, we propose to learn a set of high-level feature representations, referred to as feature embeddings, from an unsupervised deep architecture for speaker diarization. These sets of embeddings are learned through a deep autoencoder model when trained on mel-frequency cepstral coefficients (MFCCs) of input speech frames. Learned embeddings are then used in Gaussian mixture model based hierarchical clustering for diarization. The results show that these unsupervised embeddings are better compared to MFCCs in reducing the diarization error rate. Experiments conducted on the popular subset of the AMI meeting corpus consisting of 5.4 h of recordings show that the new embeddings decrease the average diarization error rate by 2.96%. However, for individual recordings, maximum improvement of 8.05% is acquired.

End-To-End Speaker Diarization as Post-Processing

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021

This paper investigates the utilization of an end-to-end diarization model as post-processing of conventional clustering-based diarization. Clustering-based diarization methods partition frames into clusters of the number of speakers; thus, they typically cannot handle overlapping speech because each frame is assigned to one speaker. On the other hand, some end-to-end diarization methods can handle overlapping speech by treating the problem as multi-label classification. Although some methods can treat a flexible number of speakers, they do not perform well when the number of speakers is large. To compensate for each other's weakness, we propose to use a two-speaker end-to-end diarization method as post-processing of the results obtained by a clustering-based method. We iteratively select two speakers from the results and update the results of the two speakers to improve the overlapped region. Experimental results show that the proposed algorithm consistently improved the performance of the state-of-the-art methods across CALLHOME, AMI, and DIHARD II datasets.

The Intelligent Voice System for the IberSPEECH-RTVE 2018 Speaker Diarization Challenge

IberSPEECH 2018

This paper describes the Intelligent Voice (IV) speaker diarization system for IberSPEECH-RTVE 2018 speaker diarization challenge. We developed a new speaker diarization built on the success of deep neural network based speaker embeddings in speaker verification systems. In contrary to acoustic features such as MFCCs, deep neural network embeddings are much better at discerning speaker identities especially for speech acquired without constraint on recording equipment and environment. We perform spectral clustering on our proposed CNN-LSTM-based speaker embeddings to find homogeneous segments and generate speaker log likelihood for each frame. A HMM is then used to refine the speaker posterior probabilities through limiting the probability of switching between speakers when changing frames. We present results obtained on the development set (dev2) as well as the evaluation set provided by the challenge.

Large-Scale Speaker Diarization for Long Recordings and Small Collections

IEEE Transactions on Audio, Speech, and Language Processing, 2000

Performing speaker diarization of very long recordings is a problem for most diarization systems that are based on agglomerative clustering with an HMM topology. Performing collectionwide speaker diarization, where each speaker is identified uniquely across the entire collection, is even a more challenging task. In this paper we propose a method with which it is possible to efficiently perform diarization of long recordings. We have also applied this method successfully to a collection of a total duration of approximately 15 hours. The method consists of first segmenting long recordings into smaller chunks on which diarization is performed. Next, a speaker detection system is used to link the speech clusters from each chunk and to assign a unique label to each speaker in the long recording or in the small collection. We show for three different audio collections that it is possible to perform high quality diarization with this approach. The long meetings from the ICSI corpus are processed 5.5 times faster than the originally needed time and by uniquely labeling each speaker across the entire collection it becomes possible to perform speaker-based information retrieval with high accuracy (mean average precision of 0.57).

Speaker Embeddings for Diarization of Broadcast Data In The Allies Challenge

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021

Diarization consists in the segmentation of speech signals and the clustering of homogeneous speaker segments. State-of-the-art systems typically operate upon speaker embeddings, such as ivectors or neural x-vectors, extracted from mel cepstral coefficients (MFCCs) or spectrograms. The recent SincNet architecture extracts x-vectors directly from raw speech signals. The work reported in this paper compares the performance of different embeddings extracted from MFCCs or the raw signal for speaker diarization and broadcast media treated with compression and sub-sampling, operations which typically degrade performance. Experiments are performed with the new ALLIES database that was designed to complement existing, publicly available French corpora of broadcast radio and TV shows. Results show that, in adverse conditions, with compression and sampling mismatch, SincNet x-vectors outperform i-vectors and x-vectors by relative DERs of 43% and 73% respectively. Additionally we found that SincNet x-vectors are not the absolute best embeddings but are more robust to data mismatch than others.

UWB-NTIS Speaker Diarization System for the DIHARD II 2019 Challenge

Interspeech 2019

In this paper, we present our system developed by the team from the New Technologies for the Information Society (NTIS) research center of the University of West Bohemia in Pilsen, for the Second DIHARD Speech Diarization Challenge. The base of our system follows the currently-standard approach of segmentation, i/x-vector extraction, clustering, and resegmentation. The hyperparameters for each of the subsystems were selected according to the domain classifier trained on the development set of DIHARD II. We compared our system with results from the Kaldi diarization (with i/x-vectors) and combined these systems. At the time of writing of this abstract, our best submission achieved a DER of 23.47% and a JER of 48.99% on the evaluation set (in Track 1 using reference SAD).

Trainable speaker diarization

This paper presents a novel framework for speaker diarization. We explicitly model intra-speaker inter-segment variability using a speaker-labeled training corpus and use this modeling to assess the speaker similarity between speech segments. Modeling is done by embedding segments into a segment-space using kernel-PCA, followed by explicit modeling of speaker variability in the segment-space. Our framework leads to a significant improvement in diarization accuracy. Finally, we present a similar method for bandwidth classification.

A review on speaker diarization systems and approaches

Speech Communication, 2012

Speaker indexing or diarization is an important task in audio processing and retrieval. Speaker diarization is the process of labeling a speech signal with labels corresponding to the identity of speakers. This paper includes a comprehensive review on the evolution of the technology and different approaches in speaker indexing and tries to offer a fully detailed discussion on these approaches and their contributions. This paper reviews the most common features for speaker diarization in addition to the most important approaches for speech activity detection (SAD) in diarization frameworks. Two main tasks of speaker indexing are speaker segmentation and speaker clustering. This paper includes a separate review on the approaches proposed for these subtasks. However, speaker diarization systems which combine the two tasks in a unified framework are also introduced in this paper. Another discussion concerns the approaches for online speaker indexing which has fundamental differences with traditional offline approaches. Other parts of this paper include an introduction on the most common performance measures and evaluation datasets. To conclude this paper, a complete framework for speaker indexing is proposed, which is aimed to be domain independent and parameter free and applicable for both online and offline applications.