Novel strategies for reducing the false alarm rate in a speaker segmentation system (original) (raw)

BINSEG: an efficient speaker-based segmentation technique

Interspeech 2006, 2006

In this paper we present a new efficient approach to speaker-based audio stream segmentation. It employs binary segmentation technique that is well-known from mathematical statistic. Because integral part of this technique is hypotheses testing, we compare two well-founded (Maximum Likelihood, Informational) and one commonly used (BIC difference) approach for deriving speakerchange test statistics. Based on results of this comparison we propose both off-line and on-line speaker change detection algorithms (including way of effective training) that have merits of high accuracy and low computational costs. In simulated tests with artificially mixed data the on-line algorithm identified 95.7% of all speaker changes with precision of 96.9%. In tests done with 30 hours of real broadcast news (in 9 languages) the average recall was 74.4% and precision 70.3%.

A new speaker change detection method for two-speaker segmentation

IEEE International Conference on Acoustics Speech and Signal Processing, 2002

In absence of prior information about speakers, an important step in speaker segmentation is to obtain initial estimates for training speaker models. In this paper, we present a new method for obtaining these estimates. The method assumes that a conversation must be initiated by one of the speakers. Thus one speaker model is estimated from the small segment at the beginning of the conversation and the segment that has the largest distance from the initial segment is used to train second speaker model. We describe a system based on this method and evaluate it on two different tasks: a controlled task with variations in the duration of the initial speaker segment and amount of overlapped speech and 2001 NIST Speaker Recognition Evaluation task that contains natural conversations.

Speaker Change Detection via Binary Segmentation Technique and Informational Approach

Proc. of SPECOM, 2006

This paper deals with problems of speaker change detection in acoustic data. The aim is to identify the optimal number and position of the change-points that split the signal into shorter sections belonging to individual speakers. In particular we focus on so-called binary segmentation technique, which is well-known in mathematical statistics, but it has never been used in speaker change detection task. We prove its applicability on this task in simulated tests with artificially mixed utterances ans also in tests done with 30 hours of real broadcast news (in 9 languages). Further we review commonly used approach to speaker change detection via Bayesian Information Criterion and we suggest theoretically more tenable solution.

A novel method for two-speaker segmentation

Proc. of ICSLP, Jeju, …, 2004

This paper addresses the problem of speaker based audio data segmentation. A novel method that has the advantages of both model and metric based techniques is proposed which creates a model for each speaker from the available data on the fly. This can be viewed as building a Hidden Markov Model (HMM) for the data with speakers abstracted as the hidden states. Each speaker/state is modeled with a Gaussian Mixture Model (GMM). To prevent a large number of spurious change points being detected, the use of the Generalized Likelihood Ratio (GLR) metric for grouping feature vectors is proposed. A clustering technique is described, through which a good initialization of each GMM is achieved, such that each state corresponds to a single speaker and not noise, silence or word classes, something that may happen in conventional unlabelled clustering. Finally, a refinement method, along the lines of Viterbi Training of HMMs is presented. The proposed method does not require prior knowledge of any speaker characteristics. It also does not require any tuning of threshold parameters, so it can be used with confidence over new data sets. The method assumes that the number of speakers is known apriori to be two. The method results in a decrease in the error rate by 84.75% on the files reported in the baseline system. It performs just as well even when the speaker segments are as short as 1s each, which is a large improvement over some previous methods, which require larger segments for accurate detection of speaker change points.

Automatic Speaker Segmentation using Multiple Features and Distance Measures: A Comparison of Three Approaches

2006

This paper addresses the problem of unsupervised speaker change detection. Three systems based on the Bayesian Information Criterion (BIC) are tested. The first system investigates the AudioSpec-trumCentroid and the AudioWaveformEnvelope features, implements a dynamic thresholding followed by a fusion scheme, and finally applies BIC. The second method is a real-time one that uses a metric-based approach employing the line spectral pairs and the BIC to validate a potential speaker change point. The third method consists of three modules. In the first module, a measure based on second-order statistics is used; in the second module, the Euclidean distance and T 2 Hotelling statistic are applied; and in the third module, the BIC is utilized. The experiments are carried out on a dataset created by concatenating speakers from the TIMIT database, that is referred to as the TIMIT data set. A comparison between the performance of the three systems is made based on t-statistics.

An adaptive threshold computation for unsupervised speaker segmentation

… Annual Conference of …, 2009

Reliable speaker segmentation is critical in many applications in the speech processing domain. In this paper, we compare the performance of two speaker segmentation systems: the first one is inspired on a typical state-of-art speaker segmentation sys-tem, and the other ...

Systematic comparison of BIC-based speaker segmentation systems

2007

Unsupervised speaker change detection is addressed in this paper. Three speaker segmentation systems are examined. The first system investigates the AudioSpectrumCentroid and the AudioWaveformEnvelope features, implements a dynamic fusion scheme, and applies the Bayesian Information Criterion (BIC). The second system consists of three modules. In the first module, a second-order statistic-measure is extracted; the Euclidean distance and the T 2 Hotelling statistic are applied sequentially in the second module; and BIC is utilized in the third module. The third system, first uses a metric-based approach, in order to detect potential speaker change points, and then the BIC criterion is applied to validate the previously detected change points. Experiments are carried out on a dataset, which is created by concatenating speakers from the TIMIT database. A systematic performance comparison among the three systems is carried out by means of one-way ANOVA method and post hoc Tukey's method.

Speaker segmentation and clustering for simultaneously presented speech

Interspeech 2009

This paper proposes a new scheme used to segment and cluster speech segments on an unsupervised basis in cases where multiple speakers are presented simultaneously at different SNRs. The new elements in our work are in the development of new feature for segmenting and clustering simultaneously-presented speech, the procedure for identifying a candidate set of possible speaker-change points, and the use of pair-wise cross-segment distance distributions to cluster segments by speaker. The proposed system is evaluated in terms of the F measure that is obtained. The system is compared to a baseline system that uses MFCC for acoustic features, the Bayesian Information Criterion (BIC) for detecting speaker-change points, and the Kullback-Leibler distance for clustering the segments. Experimental indicate that the new system consistently provides better performance than the baseline system with very small computational cost. 1

BIC-Based Speaker Segmentation Using Divide-and-Conquer Strategies With Application to Speaker Diarization

IEEE Transactions on Audio, Speech, and Language Processing, 2000

In this paper, we propose three divide-and-conquer approaches for BIC-based speaker segmentation. The approaches detect speaker changes by recursively partitioning a large analysis window into two sub-windows and recursively verifying the merging of two adjacent audio segments using ∆BIC, a widelyadopted distance measure of two audio segments. We compare our approaches to three popular distance-based approaches, namely, Chen and Gopalakrishnan's window-growing-based approach, Siegler et al.'s fixed-size sliding window approach, and Delacourt and Wellekens's DISTBIC approach, by performing computational cost analysis and conducting speaker change detection experiments on two broadcast news data sets. The results show that the proposed approaches are more efficient and achieve higher segmentation accuracy than the compared distance-based approaches. In addition, we apply the segmentation approaches discussed in this paper to the speaker diarization task. The experiment results show that a more effective segmentation approach leads to better diarization accuracy.

Computationally Efficient and Robust BIC-Based Speaker Segmentation

IEEE Transactions on Audio, Speech & Language Processing, 2008

is presented. BIC tests are not performed for every window shift (e.g. every milliseconds), as previously, but when a speaker change is most probable to occur. This is done by estimating the next probable change point thanks to a model of utterance durations. It is found that the inverse Gaussian fits best the distribution of utterance durations. As a result, less BIC tests are needed, making the proposed system less computationally demanding in time and memory, and considerably more efficient with respect to missed speaker change points. A feature selection algorithm based on branch and bound search strategy is applied in order to identify the most efficient features for speaker segmentation. Furthermore, a new theoretical formulation of BIC is derived by applying centering and simultaneous diagonalization. This formulation is considerably more computationally efficient than the standard BIC, when the covariance matrices are estimated by other estimators than the usual maximum likelihood ones. Two commonly used pairs of figures of merit are employed and their relationship is established. Computational efficiency is achieved through the speaker utterance modeling, whereas robustness is achieved by feature selection and application of BIC tests at appropriately selected time instants. Experimental results indicate that the proposed modifications yield a superior performance compared to existing approaches.