Speaker linking in large data sets (original) (raw)

Speaker utterances tying among speaker segmented audio documents using hierarchical classification: towards speaker indexing of audio databases

2002

Speaker indexing of an audio database consists in organizing the audio data according to the speakers present in the database. It is composed of three steps: (1) segmentation by speakers of each audio document; (2) speaker tying among the various segmented portions of the audio documents; and (3) generation of a speakerbased index. This paper focuses on the second step, the speaker tying task, which has not been addressed in the literature. The result of this task is a classification of the segmented acoustic data by clusters; each cluster should represent one speaker. This paper investigates on hierarchical classification approaches for speaker tying. Two new discriminant dissimilarity measures and a new bottom-up algorithm are also proposed. The experiments are conducted on a subset of the Switchboard database, a conversational telephone database, and show that the proposed method allows a very satisfying speaker tying among various audio documents, with a good level of purity for the clusters, but with a number of clusters significantly higher than the number of speakers.

Constrained speaker linking

In this paper we study speaker linking (a.k.a. partitioning) given constraints of the distribution of speaker identities over speech recordings. Specifically, we show that the intractable partitioning problem becomes tractable when the constraints pre-partition the data in smaller cliques with non-overlapping speakers. The surprisingly common case where speakers in telephone conversations are known, but the assignment of channels to identities is unspecified, is treated in a fully Bayesian way. We show that for the Dutch CGN database, where this channel assignment task is at hand, a lightweight speaker recognition system can quite effectively solve the channel assignment problem, with 93 % of the cliques solved. We further show that the posterior distribution over channel assignment configurations is well calibrated.

Large-Scale Speaker Diarization for Long Recordings and Small Collections

IEEE Transactions on Audio, Speech, and Language Processing, 2000

Performing speaker diarization of very long recordings is a problem for most diarization systems that are based on agglomerative clustering with an HMM topology. Performing collectionwide speaker diarization, where each speaker is identified uniquely across the entire collection, is even a more challenging task. In this paper we propose a method with which it is possible to efficiently perform diarization of long recordings. We have also applied this method successfully to a collection of a total duration of approximately 15 hours. The method consists of first segmenting long recordings into smaller chunks on which diarization is performed. Next, a speaker detection system is used to link the speech clusters from each chunk and to assign a unique label to each speaker in the long recording or in the small collection. We show for three different audio collections that it is possible to perform high quality diarization with this approach. The long meetings from the ICSI corpus are processed 5.5 times faster than the originally needed time and by uniquely labeling each speaker across the entire collection it becomes possible to perform speaker-based information retrieval with high accuracy (mean average precision of 0.57).

Comparison of clustering algorithms in speaker identification

2000

In speaker identification, we match a given (unkown) speaker to the set of known speakers in a database. The database is constructed from the speech samples of each known speaker. Feature vectors are extracted from the samples by short-term spectral analysis, and processed further by vector quantization for locating the clusters in the feature space. We study the role of the vector quantization in the speaker identification system. We compare the performance of different clustering algorithms, and the influence of the codebook size. We want to find out, which method provides the best clustering result, and whether the difference in quality contribute to improvement in recognition accuracy of the system.

Resolution Limitation in Speakers Clustering and Segmentation Problems

2001

In unlabeled and unsegmented conversation, i.e. no a-priori knowledge about speakers' identity and segments boundaries is provided, it is very important to cluster the conversation (make a segmentation and labeling) with the best possible resolution. For low-resolution cases, i.e. the duration of the segment is long; the segments might contain data from several speakers. On the other hand, when short segments are used (high resolution) not enough statistics is provided to allow correct decision about the identity of the speakers. In this work the performance of a system, which employs different segment lengths, is presented. We assumed that the number of speakers, R, is known, and high-quality conversations were used. Each speaker was modeled by a Self-Organizing-Map (SOM). An iterative algorithm allows the data move from one model to another and adjust the SOMs. The restriction that the data can move only in small groups but not by moving each and every feature vector separately force the SOMs to adjust to speakers (instead of phonemes or other vocal events). We found that the optimal segment duration was half-second. The system has a clustering performance of about 90% for towspeaker conversation and over 80% for three-speaker conversations.

An Information Theoretic Approach to Speaker Diarization of Meeting Data

IEEE Transactions on Audio, Speech, and Language Processing, 2000

A speaker diarization system based on an information theoretic framework is described. The problem is formulated according to the Information Bottleneck (IB) principle. Unlike other approaches where the distance between speaker segments is arbitrarily introduced, IB method seeks the partition that maximizes the mutual information between observations and variables relevant for the problem while minimizing the distortion between observations. This solves the problem of choosing the distance between speech segments, which becomes the Jensen-Shannon divergence as it arises from the IB objective function optimization. We discuss issues related to speaker diarization using this information theoretic framework such as the criteria for inferring the number of speakers, the trade-off between quality and compression achieved by the diarization system, and the algorithms for optimizing the objective function. Furthermore we benchmark the proposed system against a state-of-the-art system on the NIST RT06 (Rich Transcription) data set for speaker diarization of meeting. The IB based system achieves a Diarization Error Rate of 23.2% as compared to 23.6% of the baseline system. This approach being mainly based on non-parametric clustering, it runs significantly faster then the baseline HMM/GMM based system, resulting in faster-then-real-time diarization.

Analysis of the Impact of the Audio Database Characteristics in the Accuracy of a Speaker Clustering System

Odyssey 2016, 2016

In this paper, a traditional clustering algorithm based on speaker identification is presented. Several audio data sets were tested to conclude how accurate the clustering algorithm is depending on the characteristics of the analyzed database. We show that, issues such as the size of the database, the number speakers, or how the audios are balanced over the speakers in the database significantly affect the accuracy of the clustering task. These conclusions can be used to propose strategies to solve a clustering task or to predict in which situations a higher performance of the clustering algorithm is expected. We also focus on the stopping criterion to avoid the worsening of the results due to mismatch between training and testing data while using traditional stopping criteria based on maximum distance thresholds.