Robust Speaker Localization Using A Microphone Array (original) (raw)
Related papers
Speaker localization with moving microphone arrays
2016 24th European Signal Processing Conference (EUSIPCO), 2016
Speaker localization algorithms often assume static location for all sensors. This assumption simplifies the models used, since all acoustic transfer functions are linear time invariant. In many applications this assumption is not valid. In this paper we address the localization challenge with moving microphone arrays. We propose two algorithms to find the speaker position. The first approach is a batch algorithm based on the maximum likelihood criterion, optimized via expectationmaximization iterations. The second approach is a particle filter for sequential Bayesian estimation. The performance of both approaches is evaluated and compared for simulated reverberant audio data from a microphone array with two sensors. * The research leading to these results has received funding from the European Unions Seventh Framework Programme (FP7/2007-2013) under grant agreement no. 609465.
Microphone Array Speaker Localizers Using Spatial-Temporal Information
EURASIP Journal on Advances in Signal Processing, 2006
A dual-step approach for speaker localization based on a microphone array is addressed in this paper. In the first stage, which is not the main concern of this paper, the time difference between arrivals of the speech signal at each pair of microphones is estimated. These readings are combined in the second stage to obtain the source location. In this paper, we focus on the second stage of the localization task. In this contribution, we propose to exploit the speaker's smooth trajectory for improving the current position estimate. Three localization schemes, which use the temporal information, are presented. The first is a recursive form of the Gauss method. The other two are extensions of the Kalman filter to the nonlinear problem at hand, namely, the extended Kalman filter and the unscented Kalman filter. These methods are compared with other algorithms, which do not make use of the temporal information. An extensive experimental study demonstrates the advantage of using the spatial-temporal methods. To gain some insight on the obtainable performance of the localization algorithm, an approximate analytical evaluation, verified by an experimental study, is conducted. This study shows that in common TDOA-based localization scenarios-where the microphone array has small interelement spread relative to the source position-the elevation and azimuth angles can be accurately estimated, whereas the Cartesian coordinates as well as the range are poorly estimated.
Speaker localization for microphone array-based ASR
Proceedings of the 8th international conference on Multimodal interfaces - ICMI '06, 2006
Accurate speaker location is essential for optimal performance of distant speech acquisition systems using microphone array techniques. However, to the best of our knowledge, no comprehensive studies on the degradation of automatic speech recognition (ASR) as a function of speaker location accuracy in a multi-party scenario exist. In this paper, we describe a framework for evaluation of the effects of speaker location errors on a microphone array-based ASR system, in the context of meetings in multi-sensor rooms comprising multiple cameras and microphones. Speakers are manually annotated in videos in different camera views, and triangulation is used to determine an accurate speaker location. Errors in the speaker location are then induced in a systematic manner to observe their influence on speech recognition performance. The system is evaluated on real overlapping speech data collected with simultaneous speakers in a meeting room. The results are compared with those obtained from close-talking headset microphones, lapel microphones, and speaker location based on audio-only and audiovisual information approaches.
Robust speech recognition with speaker localization by a microphone array
1996
This paper proposes robust speech recognition with Speaker Localization by a Arrayed Microphone (SLAM) to realize hands-free speech interface in noisy environments. In order to localize a speaker direction accurately in low SNR conditions, a speaker localization algorithm based on extracting a pitch harmonics is introduced. To evaluate the performance of the proposed system, speech recognition experiments are carried out both in computer simulation and real environments. These results show that the proposed system attains the much higher speech recognition performance than that of a single microphone not only in computer simulation but also in real environments.
Microphone Array Driven Speech Recognition: Influence of Localization on the Word Error Rate
Lecture Notes in Computer Science, 2006
Interest within the automatic speech recognition research community has recently focused on the recognition of speech where the microphone is located in the medium field, rather than being mounted on a headset and positioned next to the speakers mouth to realize the long-term goal of ubiquitous computing. This is a natural application for beamforming techniques using a microphone array. A crucial ingredient for optimal performance of beamforming techniques is the speaker location. Hence, to apply such techniques, a source localization algorithm is required. In prior work, we proposed using an extended Kalman filter to directly update position estimates in a speaker localization system based on time delays of arrival. We also have enhanced our audio localizer with video information. In this work, we investigate the influence of the speaker position on the word error rate of an automatic speech recognition system operating on the output of a beamformer, and compare this error rate with that obtained with a close talking microphone. Moreover, we compare the effectiveness of different localization algorithms. We tested our algorithm on a data set consisting of seminars held by actual speakers. Our experiments revealed that accurate speaker tracking is crucial for minimizing the errors of a farfield speech recognition system.
Sound Source Localization by a Single Linear Moving Microphone
2014
This paper discusses about the implementation of a single microphone that is moving in linear track to substitute some channels of linear microphone array and a static microphone as a reference in beamforming method of sound source localization. The single microphone moves at constant velocity from a reference point. All recorded data from the moving and the reference microphone are split into several data represent each discrete microphone position. By this method, the sound localization system is modified from single microphone into artificial linear microphone array. Time delay for each artificial linear microphone is obtained by cross-correlation function between signal from moving microphone and signal from reference microphone. Time Domain beamforming method is performed by the delay-and-sum algorithm for stationary microphone. It is found that the methods can predict the direction of sound source. The shorter track and higher microphone speed can reduce the possibility of ali...
Discriminability Measure for Microphone Array Source Localization
2012
The performance of sound source localization systems based on microphone arrays is dictated by a combination of factors that range from array, source, and environmental characteristics to the nature of the localization algorithm itself. Array geometry is an example of critical feature for source localizability. This paper proposes a numerical measure of the capability of a microphone array with a specific geometry to distinguish a given point in space from its neighbors. Such numerical measure, herein called discriminability index (D), has the interesting feature of taking into account only the effects of array geometry on spatial resolution, thus providing a way of connecting a microphone array geometry to the region of interest. The proposed measure can be particularly useful to help choose an appropriate array geometry when a sound source is confined to a predefined region. Simulation results using the classic SRP-PHAT method are presented for highlighting the correlation between D and the accuracy of the source location estimates.
EURASIP Journal on Advances in Signal Processing, 2007
Speaker localization with microphone arrays has received significant attention in the past decade as a means for automated speaker tracking of individuals in a closed space for videoconferencing systems, directed speech capture systems, and surveillance systems. Traditional techniques are based on estimating the relative time difference of arrivals (TDOA) between different channels, by utilizing crosscorrelation function. As we show in the context of speaker localization, these estimates yield poor results, due to the joint effect of reverberation and the directivity of sound sources. In this paper, we present a novel method that utilizes a priori acoustic information of the monitored region, which makes it possible to localize directional sound sources by taking the effect of reverberation into account. The proposed method shows significant improvement of performance compared with traditional methods in "noise-free" condition. Further work is required to extend its capabilities to noisy environments.
Group delay based methods for speech source localization over circular arrays
Conventional sub space based approaches for source localization use the spectral magnitude of MUSIC. In this paper, a group delay based method for source localization of spatially close speech sources over circular arrays, with minimal number of sensors is proposed. This approach is based on the MUSIC-Group delay spectrum and can be used to accurately estimate both azimuth and elevation angles of spatially close sources. Both simulated and real speech signal measurements are acquired over a circular array and the DOA estimation is carried out for several trials. The accuracy of the proposed approach is illustrated by using two dimensional scatter plots for a single source, and average error distribution plots for multiple sources. The high resolution property of this method is explained using the additive property of the MUSIC-Group delay spectrum. The proposed method is also evaluated under sensor perturbation errors. Experiments on distant speech recognition are conducted using th...