BINSEG: an efficient speaker-based segmentation technique (original) (raw)
Related papers
Speaker Change Detection via Binary Segmentation Technique and Informational Approach
Proc. of SPECOM, 2006
This paper deals with problems of speaker change detection in acoustic data. The aim is to identify the optimal number and position of the change-points that split the signal into shorter sections belonging to individual speakers. In particular we focus on so-called binary segmentation technique, which is well-known in mathematical statistics, but it has never been used in speaker change detection task. We prove its applicability on this task in simulated tests with artificially mixed utterances ans also in tests done with 30 hours of real broadcast news (in 9 languages). Further we review commonly used approach to speaker change detection via Bayesian Information Criterion and we suggest theoretically more tenable solution.
Speaker change detection using BIC: a comparison on two datasets
2006
This paper addresses the problem of unsupervised speaker change detection. We assume that there is no prior knowledge of the number of speakers or their identities. Two methods are tested. The first method uses the Bayesian Information Criterion (BIC), investigates the AudioSpectrumCentroid and AudioWaveformEnvelope features, and implements a dynamic thresholding followed by a fusion scheme. The second method is a real-time one that uses a metric-based approach employing line spectral pairs (LSP) and the BIC criterion to validate a potential speaker change point. The methods are tested on two different datasets. The first set was created by concatenating speakers from the TIMIT database and is referred to as the TIMIT data set. The second set was created by using recordings from the MPEG-7 test set CD1 and broadcast news and is referred to as the INESC dataset.
A novel method for two-speaker segmentation
Proc. of ICSLP, Jeju, …, 2004
This paper addresses the problem of speaker based audio data segmentation. A novel method that has the advantages of both model and metric based techniques is proposed which creates a model for each speaker from the available data on the fly. This can be viewed as building a Hidden Markov Model (HMM) for the data with speakers abstracted as the hidden states. Each speaker/state is modeled with a Gaussian Mixture Model (GMM). To prevent a large number of spurious change points being detected, the use of the Generalized Likelihood Ratio (GLR) metric for grouping feature vectors is proposed. A clustering technique is described, through which a good initialization of each GMM is achieved, such that each state corresponds to a single speaker and not noise, silence or word classes, something that may happen in conventional unlabelled clustering. Finally, a refinement method, along the lines of Viterbi Training of HMMs is presented. The proposed method does not require prior knowledge of any speaker characteristics. It also does not require any tuning of threshold parameters, so it can be used with confidence over new data sets. The method assumes that the number of speakers is known apriori to be two. The method results in a decrease in the error rate by 84.75% on the files reported in the baseline system. It performs just as well even when the speaker segments are as short as 1s each, which is a large improvement over some previous methods, which require larger segments for accurate detection of speaker change points.
A new speaker change detection method for two-speaker segmentation
IEEE International Conference on Acoustics Speech and Signal Processing, 2002
In absence of prior information about speakers, an important step in speaker segmentation is to obtain initial estimates for training speaker models. In this paper, we present a new method for obtaining these estimates. The method assumes that a conversation must be initiated by one of the speakers. Thus one speaker model is estimated from the small segment at the beginning of the conversation and the segment that has the largest distance from the initial segment is used to train second speaker model. We describe a system based on this method and evaluate it on two different tasks: a controlled task with variations in the duration of the initial speaker segment and amount of overlapped speech and 2001 NIST Speaker Recognition Evaluation task that contains natural conversations.
Speaker Segmentation of Interviews Using Integrated Video and Audio Change Detectors
2007
Abstract In this paper, we study the use of audio and visual cues to perform speaker segmentation of audiovisual recordings of formal meetings such as interviews, lectures, or courtroom sessions. The sole use of audio cues for such recordings can be ineffective due to low recording quality and high level of background noise. We propose to use additional cues from the video stream by exploiting the relative static locations of speakers among the scene.
MultiBIC: an improved speaker segmentation technique for TV shows
Interspeech 2010, 2010
Speaker segmentation systems usually have problems detecting short segments, which causes the number of deletions to be high and therefore harming the performance of the system. This is a complication when it comes to segmenting multimedia information such as movies and TV shows, where dialogs among characters are very common. In this paper a modification of the BIC algorithm is presented, which will reduce remarkably the number of deletions without causing an increase in the number of false alarms. This modification, referred to as MultiBIC, assumes that two change-points are present in a window of data, while conventional BIC approach supposes that there is just one. This causes the system to notice when there is more than one change-point in a window, finding shorter segments than traditional BIC.
Novel strategies for reducing the false alarm rate in a speaker segmentation system
2010
Reliable speaker segmentation is critical in many applications in the speech processing domain. In this paper, we extend our earlier formulation for false alarm reduction in a typical state-of-art speaker segmentation system. Specifically, we present two novel strategies for reducing the false alarm rate with a minimal impact on the true speaker change detection rate. One of the new strategies rejects, given a discard probability, those changes that are suspicious of being false alarms because of their low ΔBIC value; and the other one assumes that the occurrence of changes constitute a Poisson process, so changes will be discarded with a probability that follows a Poisson cumulative density function. Our experiments show the improvements obtained with each false alarm reduction approach using the Spanish Parliament Sessions defined for the 2006 TC-STAR Automatic Speech Recognition evaluation campaign.
IEEE Transactions on Audio, Speech, and Language Processing, 2000
In this paper, we propose three divide-and-conquer approaches for BIC-based speaker segmentation. The approaches detect speaker changes by recursively partitioning a large analysis window into two sub-windows and recursively verifying the merging of two adjacent audio segments using ∆BIC, a widelyadopted distance measure of two audio segments. We compare our approaches to three popular distance-based approaches, namely, Chen and Gopalakrishnan's window-growing-based approach, Siegler et al.'s fixed-size sliding window approach, and Delacourt and Wellekens's DISTBIC approach, by performing computational cost analysis and conducting speaker change detection experiments on two broadcast news data sets. The results show that the proposed approaches are more efficient and achieve higher segmentation accuracy than the compared distance-based approaches. In addition, we apply the segmentation approaches discussed in this paper to the speaker diarization task. The experiment results show that a more effective segmentation approach leads to better diarization accuracy.
2005
This paper describes a large scale experiment in which eight research institutions have tested their audio partitioning and labeling algorithms on the same data, a multi-lingual database of news broadcasts, using the same evaluation tools and protocols. The experiments have provide more insight in the cross-lingual robustness of the methods and they have demonstrated that by further collaborating in the domains of speaker change detection and speaker clustering it should be possible to achieve further technological progress in the near future.
Speaker change detection using features through a neural network speaker classifier
2017 Intelligent Systems Conference (IntelliSys), 2017
The mechanism proposed here is for real-time speaker change detection in conversations, which firstly trains a neural network text-independent speaker classifier using indomain speaker data. Through the network, features of conversational speech from out-of-domain speakers are then converted into likelihood vectors, i.e. similarity scores comparing to the in-domain speakers. These transformed features demonstrate very distinctive patterns, which facilitates differentiating speakers and enable speaker change detection with some straightforward distance metrics. The speaker classifier and the speaker change detector are trained/tested using speech of the first 200 (indomain) and the remaining 126 (out-of-domain) male speakers in TIMIT respectively. For the speaker classification, 100% accuracy at a 200 speaker size is achieved on any testing file, given the speech duration is at least 0.97 seconds. For the speaker change detection using speaker classification outputs, performance based on 0.5, 1, and 2 seconds of inspection intervals were evaluated in terms of error rate and F1 score, using synthesized data by concatenating speech from various speakers. It captures close to 97% of the changes by comparing the current second of speech with the previous second, which is very competitive among literature using other methods.