Data-Driven Multi-Microphone Speaker Localization on Manifolds (original) (raw)

Semi-Supervised Source Localization on Multiple Manifolds With Distributed Microphones

IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2017

The problem of source localization with ad hoc microphone networks in noisy and reverberant enclosures, given a training set of prerecorded measurements, is addressed in this paper. The training set is assumed to consist of a limited number of labelled measurements, attached with corresponding positions, and a larger amount of unlabelled measurements from unknown locations. However, microphone calibration is not required. We use a Bayesian inference approach for estimating a function that maps measurement-based feature vectors to the corresponding positions. The central issue is how to combine the information provided by the different microphones in a unified statistical framework. To address this challenge, we model this function using a Gaussian process with a covariance function that encapsulates both the connections between pairs of microphones and the relations among the samples in the training set. The parameters of the process are estimated by optimizing a maximum likelihood (ML) criterion. In addition, a recursive adaptation mechanism is derived where the new streaming measurements are used to update the model. Performance is demonstrated for 2-D localization of both simulated data and real-life recordings in a variety of reverberation and noise levels.

Speaker localization with moving microphone arrays

2016 24th European Signal Processing Conference (EUSIPCO), 2016

Speaker localization algorithms often assume static location for all sensors. This assumption simplifies the models used, since all acoustic transfer functions are linear time invariant. In many applications this assumption is not valid. In this paper we address the localization challenge with moving microphone arrays. We propose two algorithms to find the speaker position. The first approach is a batch algorithm based on the maximum likelihood criterion, optimized via expectationmaximization iterations. The second approach is a particle filter for sequential Bayesian estimation. The performance of both approaches is evaluated and compared for simulated reverberant audio data from a microphone array with two sensors. * The research leading to these results has received funding from the European Unions Seventh Framework Programme (FP7/2007-2013) under grant agreement no. 609465.

MAVL: Multiresolution Analysis of Voice Localization

2021

The ability for a smart speaker to localize a user based on his/her voice opens the door to many new applications. In this paper, we present a novel system, MAVL, to localize human voice. It consists of three major components: (i) We first develop a novel multi-resolution analysis to estimate the AoA of time-varying low-frequency coherent voice signals coming from multiple propagation paths; (ii) We then automatically estimate the room structure by emitting acoustic signals and developing an improved 3D MUSIC algorithm; (iii) We finally re-trace the paths using the estimated AoA and room structure to localize the voice. We implement a prototype system using a single speaker and a uniform circular microphone array. Our results show that it achieves median errors of 1.49o and 3.33o for the top two AoAs estimation and achieves median localization errors of 0.31m in line-of-sight (LoS) cases and 0.47m in non-line-of-sight (NLoS) cases.

Microphone Array Speaker Localizers Using Spatial-Temporal Information

EURASIP Journal on Advances in Signal Processing, 2006

A dual-step approach for speaker localization based on a microphone array is addressed in this paper. In the first stage, which is not the main concern of this paper, the time difference between arrivals of the speech signal at each pair of microphones is estimated. These readings are combined in the second stage to obtain the source location. In this paper, we focus on the second stage of the localization task. In this contribution, we propose to exploit the speaker's smooth trajectory for improving the current position estimate. Three localization schemes, which use the temporal information, are presented. The first is a recursive form of the Gauss method. The other two are extensions of the Kalman filter to the nonlinear problem at hand, namely, the extended Kalman filter and the unscented Kalman filter. These methods are compared with other algorithms, which do not make use of the temporal information. An extensive experimental study demonstrates the advantage of using the spatial-temporal methods. To gain some insight on the obtainable performance of the localization algorithm, an approximate analytical evaluation, verified by an experimental study, is conducted. This study shows that in common TDOA-based localization scenarios-where the microphone array has small interelement spread relative to the source position-the elevation and azimuth angles can be accurately estimated, whereas the Cartesian coordinates as well as the range are poorly estimated.

Speaker localization for microphone array-based ASR

Proceedings of the 8th international conference on Multimodal interfaces - ICMI '06, 2006

Accurate speaker location is essential for optimal performance of distant speech acquisition systems using microphone array techniques. However, to the best of our knowledge, no comprehensive studies on the degradation of automatic speech recognition (ASR) as a function of speaker location accuracy in a multi-party scenario exist. In this paper, we describe a framework for evaluation of the effects of speaker location errors on a microphone array-based ASR system, in the context of meetings in multi-sensor rooms comprising multiple cameras and microphones. Speakers are manually annotated in videos in different camera views, and triangulation is used to determine an accurate speaker location. Errors in the speaker location are then induced in a systematic manner to observe their influence on speech recognition performance. The system is evaluated on real overlapping speech data collected with simultaneous speakers in a meeting room. The results are compared with those obtained from close-talking headset microphones, lapel microphones, and speaker location based on audio-only and audiovisual information approaches.

Multiple Speaker Localization using Mixture of Gaussian Model with Manifold-based Centroids

2020 28th European Signal Processing Conference (EUSIPCO)

A data-driven approach for multiple speakers localization in reverberant enclosures is presented. The approach combines semi-supervised learning on multiple manifolds with unsupervised maximum likelihood estimation. The relative transfer functions (RTFs) are used in both stages of the proposed algorithm as feature vectors, which are known to be related to source positions. The microphone positions are not known. In the training stage, a nonlinear, manifold-based, mapping between RTFs and source locations is inferred using single-speaker utterances. The inference procedure utilizes two RTF datasets: A small set of RTFs with their associated position labels; and a large set of unlabelled RTFs. This mapping is used to generate a dense grid of localized sources that serve as the centroids of a Mixture of Gaussians (MoG) model, used in the test stage of the algorithm to cluster RTFs extracted from multiple-speakers utterances. Clustering is applied by applying the expectation-maximization (EM) procedure that relies on the sparsity and intermittency of the speech signals. A preliminary experimental study, with either two or three overlapping speakers in various reverberation levels, demonstrates that the proposed scheme achieves high localization accuracy compared to a baseline method using a simpler propagation model.

Keyword Based Speaker Localization: Localizing a Target Speaker in a Multi-speaker Environment

Interspeech 2018

Speaker localization is a hard task, especially in adverse environmental conditions involving reverberation and noise. In this work we introduce the new task of localizing the speaker who uttered a given keyword, e.g., the wake-up word of a distantmicrophone voice command system, in the presence of overlapping speech. We employ a convolutional neural network based localization system and investigate multiple identifiers as additional inputs to the system in order to characterize this speaker. We conduct experiments using ground truth identifiers which are obtained assuming the availability of clean speech and also in realistic conditions where the identifiers are computed from the corrupted speech. We find that the identifier consisting of the ground truth time-frequency mask corresponding to the target speaker provides the best localization performance and we propose methods to estimate such a mask in adverse reverberant and noisy conditions using the considered keyword.

Source tracking using moving microphone arrays for robot audition

2017

Intuitive spoken dialogues are a prerequisite for human-robot interaction. In many practical situations, robots must be able to identify and focus on sources of interest in the presence of interfering speakers. Techniques such as spatial filtering and blind source separation are therefore often used, but rely on accurate knowledge of the source location. In practice, sound emitted in enclosed environments is subject to reverberation and noise. Hence, sound source localization must be robust to both diffuse noise due to late reverberation, as well as spurious detections due to early reflections. For improved robustness against reverberation, this paper proposes a novel approach for sound source tracking that constructively exploits the spatial diversity of a microphone array installed in a moving robot. In previous work, we developed speaker localization approaches using expectation-maximization (EM) approaches and using Bayesian approaches. In this paper we propose to combine the EM and Bayesian approach in one framework for improved robustness against reverberation and noise.

Localization of multiple speakers based on a two step acoustic map analysis

2008 IEEE International Conference on Acoustics, Speech and Signal Processing, 2008

An interface for distant-talking control of home devices requires the possibility of identifying the positions of multiple users. Acoustic maps, based either on Global Coherence Field (GCF) or Oriented Global Coherence Field (OGCF), have already been exploited successfully to determine position and head orientation of a single speaker. This paper proposes a new method using acoustic maps to deal with the case of two simultaneous speakers. The method is based on a two step analysis of a coherence map: first the dominant speaker is localized; then the map is modified by compensating for the effects due to the first speaker and the position of the second speaker is detected. Simulations were carried out to show how an appropriate analysis of OGCF and GCF maps allows one to localize both speakers. Experiments proved the effectiveness of the proposed solution in a linear microphone array set up.

Sound Source Localization with Non-calibrated Microphones

Lecture Notes in Computer Science, 2008

We propose a new method for localizing a sound source in a known space with non-calibrated microphones. Our method does not need the accurate positions of the microphones that are required by traditional sound source localization. Our method can make use of wide variety of microphone layout in a large space because it does not need calibration step on installing microphones. After a number of sampling points have been stored in a database, our system can estimate the nearest sampling point of a sound by utilizing the set of time delays of microphone pairs. We conducted a simulation experiment to determine the best microphone layout in order to maximize the accuracy of the localization. We also conducted a preliminary experiment in real environment and obtained promising results.