Extraction of Audio Features Specific to Speech Production for Multimodal Speaker Detection (original) (raw)

A multimodal approach to extract optimized audio features for speaker detection

2005

We present a method that exploits the information theoretic framework described in [1] to extract optimal audio features with respect to the video features. A simple measure of mutual information between the resulting audio features and the video ones allows to detect the active speaker among different candidates. The results show that our method is able to exploit the shared speech information contained in audio and video signals to recover their common source.

Hypothesis testing as a performance evaluation method for multimodal speaker detection

This work addresses the problem of detecting the speaker on audio- visual sequences by evaluating the synchrony between the audio and video sig- nals. Prior to the classification, an information theoretic framework is applie d to extract optimized audio features using video information. The classificatio n step is then defined through a hypothesis testing framework so as to get confid ence levels associated to the classifier outputs. Such an approach allows to evalu ate the whole classification process efficiency, and in particular, to evaluate th e ad- vantage of performing or not the feature extraction. As a result, it is sho wn that introducing a feature extraction step prior to the classification increases the ability of the classifier to produce good relative instance scores.

Multimodal speaker identification with audio-video processing

Proceedings 2003 International Conference on Image Processing (Cat. No.03CH37429)

In this paper we present a multimodal audiovisual speaker identifcation system. The objective is to improve the recognition performance over conventional unimodal schemes. The proposed system decomposes the information existing in a video stream into three components: speech, face texture and lip motion. Lip motion between successive frames is frst computed in terms of optical Ww vectors and then encoded as a feature vector in a magnitudedirection histogram domain. The feaNre vectors obtained along the whole stream are then interpolated to match the rate of the speech signal and fused with me1 frequency cepstral coeffcients (MFCC) of the corresponding speech signal. The resulting joint feature vectors are used to train and test a Hidden Markov Model (HMM) based identifcation system. Face texture images are treated separately in eigenface domain and integrated to the system through decision-fusion. Experimental results are also included for demonstration of the system performance.

Audio/Video Fusion: a Preprocessing Step for Multimodal Person Identication

In the audiovisual indexing context, we propose a system for automatic association of voices and images. This association can be used as a preprocessing step for existing applications like person identification systems. We use a fusion of audio and video indexes (without any prior knowledge) in order to make the information brought by each of them more robust. If both audio and video indexes are correctly segmented, this automatic association yields excellent results. In order to deal with oversegmentation, we propose an approach which uses one index to improve the segmentation of the other. We show that the use of the audio index improves an oversegmented video index on a corpus composed of French TV broadcasts.

Multimodal Speaker Diarization Using a Pre-Trained Audio-Visual Synchronization Model

Sensors, 2019

Speaker diarization systems aim to find ‘who spoke when?’ in multi-speaker recordings. The dataset usually consists of meetings, TV/talk shows, telephone and multi-party interaction recordings. In this paper, we propose a novel multimodal speaker diarization technique, which finds the active speaker through audio-visual synchronization model for diarization. A pre-trained audio-visual synchronization model is used to find the synchronization between a visible person and the respective audio. For that purpose, short video segments comprised of face-only regions are acquired using a face detection technique and are then fed to the pre-trained model. This model is a two streamed network which matches audio frames with their respective visual input segments. On the basis of high confidence video segments inferred by the model, the respective audio frames are used to train Gaussian mixture model (GMM)-based clusters. This method helps in generating speaker specific clusters with high pro...

Speaker Detection and Applications to Cross-Modal Analysis of Planning Meetings

2009 11th IEEE International Symposium on Multimedia, 2009

Detection of meeting events is one of the most important tasks in multimodal analysis of planning meetings. Speaker detection is a key step for extraction of most meaningful meeting events. In this paper, we present an approach of speaker localization using combination of visual and audio information in multimodal meeting analysis. When talking, people make a speech accompanying mouth movements and hand gestures. By computing correlation of audio signals, mouth movements, and hand motion, we detect a talking person both spatially and temporally. Three kinds of features are extracted for speaker localization. Hand movements are expressed by hand motion efforts; audio features are expressed by computing 12 mel-frequency cepstral coefficients from audio signals, and mouth movements are expressed by normalized cross-correlation coefficients of mouth area between two successive frames. A time delay neural network is trained to learn the correlation relationships, which is then applied to perform speaker localization. Experiments and applications in planning meeting environments are provided.

Using information theory to detect voice activity

2009 IEEE International Conference on Acoustics, Speech and Signal Processing, 2009

Voice Activity Detection systems attempt to discriminate between voice and other ambient sounds. Most systems use a single microphone approach and rely on training prior to employment. The performance of these systems relies heavily on reverberation and noise levels. In this paper we present an unsupervised Voice Activity Detection system that uses pairs of microphones to discern between a coherent acoustic source and spatially diffuse noise of low coherence. Measurement of coherency is performed using an information theoretic metric that integrates means to filter out more effectively the effect of reverberation and noise. Using extensive experiments, the performance of the system is investigated. Based on the conditions imposed by the experimental environments it is shown that the proposed system remains more robust than its counterparts in all cases.