Extraction of Audio Features Specific to Speech Production for Multimodal Speaker Detection (original) (raw)

A multimodal approach to extract optimized audio features for speaker detection

2005

We present a method that exploits the information theoretic framework described in [1] to extract optimal audio features with respect to the video features. A simple measure of mutual information between the resulting audio features and the video ones allows to detect the active speaker among different candidates. The results show that our method is able to exploit the shared speech information contained in audio and video signals to recover their common source.

Hypothesis testing as a performance evaluation method for multimodal speaker detection

This work addresses the problem of detecting the speaker on audio- visual sequences by evaluating the synchrony between the audio and video sig- nals. Prior to the classification, an information theoretic framework is applie d to extract optimized audio features using video information. The classificatio n step is then defined through a hypothesis testing framework so as to get confid ence levels associated to the classifier outputs. Such an approach allows to evalu ate the whole classification process efficiency, and in particular, to evaluate th e ad- vantage of performing or not the feature extraction. As a result, it is sho wn that introducing a feature extraction step prior to the classification increases the ability of the classifier to produce good relative instance scores.

Multimodal speaker identification with audio-video processing

Proceedings 2003 International Conference on Image Processing (Cat. No.03CH37429)

In this paper we present a multimodal audiovisual speaker identifcation system. The objective is to improve the recognition performance over conventional unimodal schemes. The proposed system decomposes the information existing in a video stream into three components: speech, face texture and lip motion. Lip motion between successive frames is frst computed in terms of optical Ww vectors and then encoded as a feature vector in a magnitudedirection histogram domain. The feaNre vectors obtained along the whole stream are then interpolated to match the rate of the speech signal and fused with me1 frequency cepstral coeffcients (MFCC) of the corresponding speech signal. The resulting joint feature vectors are used to train and test a Hidden Markov Model (HMM) based identifcation system. Face texture images are treated separately in eigenface domain and integrated to the system through decision-fusion. Experimental results are also included for demonstration of the system performance.

Application of the mutual information minimization to speaker recognition/identification improvement

Neurocomputing, 2004

In this paper we propose the inversion of nonlinear distortions in order to improve the recognition rates of a speaker recognizer system. We study the effect of saturations on the test signals, trying to take into account real situations where the training material has been recorded in a controlled situation, but the testing signals present some mismatch with the input signal level (saturations). The experimental results for speaker recognition shows that a combination of several strategies can improve the recognition rates with saturated test sentences from 80% to 89.39%, while the results with clean speech (without saturation) is 87.76% for one microphone, and for speaker identification can reduce the minimum detection cost function with saturated test sentences from 6.42% to 4.15%, while the results with clean speech (without saturation) is 5.74% for one microphone and 7.02% for the other one. r

Information optimization for speaker recognition using correlation functions

In this article, a method that optimizes the information of an analyzed signal is described. This method is practicable for various analyses and in fact is an additional process that is used after the analysis. We tendered our idea in our previous article. In this method, we first analyzed the signal, after which the first frame and some other frames were put in a matrix, where each row was a frame. Then the likeness of this matrix was computed with itself. Subsequently, this process was done for the second frame and the other frames following it, after which it was done for all frames.

Active Speaker Detection as a Multi-Objective Optimization with Uncertainty-Based Multimodal Fusion

Interspeech 2021

It is now well established from a variety of studies that there is a significant benefit from combining video and audio data in detecting active speakers. However, either of the modalities can potentially mislead audiovisual fusion by inducing unreliable or deceptive information. This paper outlines active speaker detection as a multi-objective learning problem to leverage best of each modalities using a novel self-attention, uncertaintybased multimodal fusion scheme. Results obtained show that the proposed multi-objective learning architecture outperforms traditional approaches in improving both mAP and AUC scores. We further demonstrate that our fusion strategy surpasses, in active speaker detection, other modality fusion methods reported in various disciplines. We finally show that the proposed method significantly improves the state-of-the-art on the AVA-ActiveSpeaker dataset.