Speaker-Based Segmentation for Audio Data Indexing (original) (raw)
Related papers
Audio data indexing: Use of second-order statistics for speaker-based segmentation
Proceedings IEEE International Conference on Multimedia Computing and Systems
The content-based indexing task considered in this paper consists in recognizing from their voice, speakers involved in a conversation. A new approach for speaker-based segmentation, which is the first necessary step for this indexing task, is described. Our study is done under the assumptions that no prior information on speakers is available, that the number of speakers is unknown and that people do not speak simultaneously. Audio data indexing is commonly divided in two parts : audio data is first segmented with respect to speakers utterances and then resulting segments associated with a given speaker are merged together. In this work, we focus on the first part and we propose a new segmentation method based on second order statistics. The practical significance of this study is illustrated by applying our new technique to real data to show its efficiency.
DISTBIC: A speaker-based segmentation for audio data indexing
Speech Communication, 2000
In this paper, we address the problem of speaker-based segmentation, which is the ®rst necessary step for several indexing tasks. It aims to extract homogeneous segments containing the longest possible utterances produced by a single speaker. In our context, no assumption is made about prior knowledge of the speaker or speech signal characteristics (neither speaker model, nor speech model). However, we assume that people do not speak simultaneously and that we have no real-time constraints. We review existing techniques and propose a new segmentation method, which combines two dierent segmentation techniques. This method, called DISTBIC, is organized into two passes: ®rst the most likely speaker turns are detected, and then they are validated or discarded. The advantage of our algorithm is its eciency in detecting speaker turns even close to one another (i.e., separated by a few seconds). Ó 2000 Elsevier Science B.V. All rights reserved. Zusammenfassung Dieser Artikel beschreibt Sprecher basierte Segmentierung, den ersten Schritt beim Indexieren von Sprechern. Das Ziel besteht darin, m oglichst lange homogene Segmente zu extrahieren, die Laute eines einzelnen Sprechers enthalten. Wir legen zugrunde, daû keinerlei Sprachcharakteristik des Sprechers bekannt ist (weder Sprechermodel noch Sprachmodel). Auûerdem wird die Annahme gemacht, daû immer nur ein Sprecher zur Zeit spricht und daû keine Echtzeitanforderungen vorhanden sind. Wir stellen existierende Segmentierungtechniken vor und schlagen eine neue Methode vor, welche zwei gebr auchliche Methoden kombiniert. Unsere Methode (DISTBIC) ist in zwei Phasen aufgeteilt: erst werden die wahrscheinlichsten Sprecherwechsel gefunden, die dann entweder validiert oder verworfen werden. Der Vorteil unseres Algorithmuses liegt in seiner Ezienz Sprecherwechsel aufzu®nden, besonders wenn sie sehr nahe beieinander liegen (d.h. Abst ande von wenigen Sekunden).
Benefits of prior acoustic segmentation for automatic speaker segmentation
2004
The paper investigates the interest of segmentation in acoustic macro classes (like gender or bandwidth) as front-end processing for the segmentation/diarization task. The impact of this prior acoustic segmentation is evaluated in terms of speaker diarization performance in the particular context of NIST RT'03 evaluation (done on the HUB4 broadcast news corpora). It is rarely discussed in the literature, but our work shows that the application of prior acoustic segmentation, in a similar way to the automatic speech recognition task, may be very useful to the speaker segmentation task. Experiments were conducted using two different kinds of speaker segmentation systems developed individually by the LIA and CLIPS laboratories in the framework of the ELISA consortium. For both systems, improvement was observed when combined with prior acoustic segmentation. However, a larger impact, in terms of performance, is observed on the LIA system based on an ascending/HMM approach compared to the CLIPS system based on speaker turn detection.
A novel method for two-speaker segmentation
Proc. of ICSLP, Jeju, …, 2004
This paper addresses the problem of speaker based audio data segmentation. A novel method that has the advantages of both model and metric based techniques is proposed which creates a model for each speaker from the available data on the fly. This can be viewed as building a Hidden Markov Model (HMM) for the data with speakers abstracted as the hidden states. Each speaker/state is modeled with a Gaussian Mixture Model (GMM). To prevent a large number of spurious change points being detected, the use of the Generalized Likelihood Ratio (GLR) metric for grouping feature vectors is proposed. A clustering technique is described, through which a good initialization of each GMM is achieved, such that each state corresponds to a single speaker and not noise, silence or word classes, something that may happen in conventional unlabelled clustering. Finally, a refinement method, along the lines of Viterbi Training of HMMs is presented. The proposed method does not require prior knowledge of any speaker characteristics. It also does not require any tuning of threshold parameters, so it can be used with confidence over new data sets. The method assumes that the number of speakers is known apriori to be two. The method results in a decrease in the error rate by 84.75% on the files reported in the baseline system. It performs just as well even when the speaker segments are as short as 1s each, which is a large improvement over some previous methods, which require larger segments for accurate detection of speaker change points.
Unsupervised speaker segmentation of telephone conversations
… Conference on Spoken …, 2002
A process for segmenting 2-speaker telephone conversations by speaker with no prior speaker models is described and evaluated. The process consists of an initial segmentation using acoustic change and pause detection, segment clustering, and iterative modeling of ...
A Simple Approach to Unsupervised Speaker Indexing
2006 International Symposium on Intelligent Signal Processing and Communications, 2006
Unsupervised speaker indexing is a rapidly developing field in speech processing, which involves determining who is speaking when, without having prior knowledge about the speakers being observed. In this research, a distance-based technique for indexing telephone conversations is presented. Submodels are formed (using data of approximately equal sizes) from the conversations, from which two references models are judiciously chosen such that they represent the two different speakers in the conversation. Models are then matched to the reference speakers based on a technique referred to as the Restrained-Relative Minimum Distance (RRMD) approach. Some models, which fail to meet the RRMD criteria, are considered "undecided" and left unmatched with either of the reference speakers. Analysis is made to determine the appropriate size (or length of data to be used) for these models, which are formed using Cepstral Coefficients of the speech data.
BINSEG: an efficient speaker-based segmentation technique
Interspeech 2006, 2006
In this paper we present a new efficient approach to speaker-based audio stream segmentation. It employs binary segmentation technique that is well-known from mathematical statistic. Because integral part of this technique is hypotheses testing, we compare two well-founded (Maximum Likelihood, Informational) and one commonly used (BIC difference) approach for deriving speakerchange test statistics. Based on results of this comparison we propose both off-line and on-line speaker change detection algorithms (including way of effective training) that have merits of high accuracy and low computational costs. In simulated tests with artificially mixed data the on-line algorithm identified 95.7% of all speaker changes with precision of 96.9%. In tests done with 30 hours of real broadcast news (in 9 languages) the average recall was 74.4% and precision 70.3%.
2006
In this paper, we describe an automatic-speaker based-audio segmentation and identiJication system for broadcasted news indexation purposes. We speciJically focus on speaker identification and audio scene detection. Speaker identiJication (SI) is based on the state of the art Gaussian mixture models, whereas scene change detection process uses the classical Bayesian Information Criteria (BIC) and the recently proposed DISTBIC algorithm. In this work, the effectiveness of Mel Frequency Cepstral coefficients MFCC, Linear Predictive Cepstral Coefficients LPCC, and Log Area Ratio LAR coefficients are compared for the purpose of text-independent speaker identification and speaker based audio segmentation. Both the Fisher Discrimination Ratio-feature analysis and performance evaluation in terms of correct identification rate on the TIMIT database showed that the LPCC outperforms the other features especially for low order coefficients. Our experiments on audio segmentation module showed that the DISTBIC segmentation technique is more accurate than the BIC procedure especially in the presence of short segments.
2002
Speaker indexing of an audio database consists in organizing the audio data according to the speakers present in the database. It is composed of three steps: (1) segmentation by speakers of each audio document; (2) speaker tying among the various segmented portions of the audio documents; and (3) generation of a speakerbased index. This paper focuses on the second step, the speaker tying task, which has not been addressed in the literature. The result of this task is a classification of the segmented acoustic data by clusters; each cluster should represent one speaker. This paper investigates on hierarchical classification approaches for speaker tying. Two new discriminant dissimilarity measures and a new bottom-up algorithm are also proposed. The experiments are conducted on a subset of the Switchboard database, a conversational telephone database, and show that the proposed method allows a very satisfying speaker tying among various audio documents, with a good level of purity for the clusters, but with a number of clusters significantly higher than the number of speakers.
Segmentation of speech for speaker and language recognition
2003
Current Automatic Speech Recognition systems convert the speech signal into a sequence of discrete units, such as phonemes, and then apply statistical methods on the units to produce the linguistic message. Similar methodology has also been applied to recognize speaker and language, except that the output of the system can be the speaker or language information. Therefore, we propose the use of temporal trajectories of fundamental frequency and short-term energy to segment and label the speech signal into a small set of discrete units that can be used to characterize speaker and/or language. The proposed approach is evaluated using the NIST Extended Data Speaker Detection task and the NIST Language Identification task.