Speaker utterances tying among speaker segmented audio documents using hierarchical classification: towards speaker indexing of audio databases (original) (raw)

Speaker linking in large data sets

This paper investigates the task of linking speakers across multiple recordings, which can be accomplished by speaker clustering. Various aspects are considered, such as computational complexity, on/offline approaches, and evaluation measures but also speaker recognition approaches. It has not been the aim of this study to optimize clustering performance, but as an experimental exercise, we perform speaker linking on all '1conv-4w' conversation sides of the NIST-2006 evaluation data set. This set contains 704 speakers in 3835 conversation sides. Using both on-line and off-line algorithms, equal-purity figures of about 86 % are obtained.

Speaker-Based Segmentation for Audio Data Indexing

Speech Communication, 2000

In this paper, we address the problem of the speaker- based segmentation, which is the first necessary step for several indexing tasks. It consists in recognizing from their voice the sequence of people engaged in a conversation. In our context, we make no assump- tions about prior knowledge of the speaker character- istics (no speaker model, no speech model, no

A Simple Approach to Unsupervised Speaker Indexing

2006 International Symposium on Intelligent Signal Processing and Communications, 2006

Unsupervised speaker indexing is a rapidly developing field in speech processing, which involves determining who is speaking when, without having prior knowledge about the speakers being observed. In this research, a distance-based technique for indexing telephone conversations is presented. Submodels are formed (using data of approximately equal sizes) from the conversations, from which two references models are judiciously chosen such that they represent the two different speakers in the conversation. Models are then matched to the reference speakers based on a technique referred to as the Restrained-Relative Minimum Distance (RRMD) approach. Some models, which fail to meet the RRMD criteria, are considered "undecided" and left unmatched with either of the reference speakers. Analysis is made to determine the appropriate size (or length of data to be used) for these models, which are formed using Cepstral Coefficients of the speech data.

Analysis of the Impact of the Audio Database Characteristics in the Accuracy of a Speaker Clustering System

Odyssey 2016, 2016

In this paper, a traditional clustering algorithm based on speaker identification is presented. Several audio data sets were tested to conclude how accurate the clustering algorithm is depending on the characteristics of the analyzed database. We show that, issues such as the size of the database, the number speakers, or how the audios are balanced over the speakers in the database significantly affect the accuracy of the clustering task. These conclusions can be used to propose strategies to solve a clustering task or to predict in which situations a higher performance of the clustering algorithm is expected. We also focus on the stopping criterion to avoid the worsening of the results due to mismatch between training and testing data while using traditional stopping criteria based on maximum distance thresholds.

Audio data indexing: Use of second-order statistics for speaker-based segmentation

Proceedings IEEE International Conference on Multimedia Computing and Systems

The content-based indexing task considered in this paper consists in recognizing from their voice, speakers involved in a conversation. A new approach for speaker-based segmentation, which is the first necessary step for this indexing task, is described. Our study is done under the assumptions that no prior information on speakers is available, that the number of speakers is unknown and that people do not speak simultaneously. Audio data indexing is commonly divided in two parts : audio data is first segmented with respect to speakers utterances and then resulting segments associated with a given speaker are merged together. In this work, we focus on the first part and we propose a new segmentation method based on second order statistics. The practical significance of this study is illustrated by applying our new technique to real data to show its efficiency.

Resolution Limitation in Speakers Clustering and Segmentation Problems

2001

In unlabeled and unsegmented conversation, i.e. no a-priori knowledge about speakers' identity and segments boundaries is provided, it is very important to cluster the conversation (make a segmentation and labeling) with the best possible resolution. For low-resolution cases, i.e. the duration of the segment is long; the segments might contain data from several speakers. On the other hand, when short segments are used (high resolution) not enough statistics is provided to allow correct decision about the identity of the speakers. In this work the performance of a system, which employs different segment lengths, is presented. We assumed that the number of speakers, R, is known, and high-quality conversations were used. Each speaker was modeled by a Self-Organizing-Map (SOM). An iterative algorithm allows the data move from one model to another and adjust the SOMs. The restriction that the data can move only in small groups but not by moving each and every feature vector separately force the SOMs to adjust to speakers (instead of phonemes or other vocal events). We found that the optimal segment duration was half-second. The system has a clustering performance of about 90% for towspeaker conversation and over 80% for three-speaker conversations.

A comparison of distance measures for clustering in speaker diarization

2014

Matching video segments in order to detect their similarity is a necessary task in retrieval and summarization applications. In order to determine nearly identical content, such as repeated takes of the same scene, very precise matching of sequences of features extracted from the video segments needs to be performed. In this paper we compare the performance of three distance measures for the task of clustering multiple takes of the same scene: Dynamic Time Warping (DTW) and two variants of Longest Common Subsequence (LCSS). We also evaluate the influence of the quality of the input segmentation on the performance of the algorithms.

A novel method for selecting the number of clusters in a speaker diarization system

This paper introduces the cluster score (C-score) as a measure for determining a suitable number of clusters when performing speaker clustering in a speaker diarization system. C-score finds a trade-off between intra-cluster and extra-cluster similarities, selecting a number of clusters with cluster elements that are similar between them but different to the elements in other clusters. Speech utterances are represented by Gaussian mixture model mean supervectors, and also the projection of the supervectors into a low-dimensional discriminative subspace by linear discriminant analysis is assessed. This technique shows robustness to segmentation errors and, compared with the widely used Bayesian information criterion (BIC)-based stopping criterion, results in a lower speaker clustering error and dramatically reduces computation time. Experiments were run using the broadcast news database used for the Albayzin 2010 Speaker Diarization Evaluation.