Jordi Sanchezriera - Academia.edu (original) (raw)
Uploads
Papers by Jordi Sanchezriera
Icmi Workshop on Multimodal Corpora For Machine Learning, 2011
La base de donnes Ravel Résumé : Dans ce papier, nous introduisons l'ensemble des donnes disponib... more La base de donnes Ravel Résumé : Dans ce papier, nous introduisons l'ensemble des donnes disponibles publiquement Ravel. Tous les scnarios ont t enregistr en utilisant la tłte robotique AV Popeye, quip de deux camras et quatre microphones. L'environnement d'enregistrement tait une salle de runion rgulire joignant tous les dfis d'une scne naturelle intrieur. La configuration d'acquisition est entirement dtaill ainsi que la conception des scnarios. Deux exemples d'utilisation de l'ensemble des donnes sont fournies, prouvant la convivialit de l'ensemble de donnes Ravel. Depuis la tendance actuelle est de concevoir des robots capables de interagir avec les environnements sans contrainte, cet ensemble de donnes fournit plusieurs scnarios pour tester des algorithmes et des mthodes visant satisfaire ces contraintes de conception. L'ensemble de donnes est accessible au public l'adresse suivante: http://ravel.humavips.eu/ Mots-clés : Interaction home-machine, base de donnes, audio-visuel.
Proceedings of the 14th ACM international conference on Multimodal interaction - ICMI '12, 2012
This paper addresses the problem of audiovisual command recognition in the framework of the D-MET... more This paper addresses the problem of audiovisual command recognition in the framework of the D-META Grand Challenge 1. Temporal and non-temporal learning models are trained on visual and auditory descriptors. In order to set a proper baseline, the methods are tested on the "Robot Gestures" scenario of the publicly available RAVEL data set, following the leave-one-out cross-validation strategy. The classification-level audiovisual fusion strategy allows for compensating the errors of the unimodal (audio or vision) classifiers. The obtained results (an average audiovisual recognition rate of almost 80%) encourage us to investigate on how to further develop and improve the methodology described in this paper.
2013 IEEE International Conference on Acoustics, Speech and Signal Processing, 2013
The problem of choosing a classifier for audiovisual command recognition is addressed. Because su... more The problem of choosing a classifier for audiovisual command recognition is addressed. Because such commands are culture-and user-dependant, methods need to learn new commands from a few examples. We benchmark three state-ofthe-art discriminative classifiers based on bag of words and SVM. The comparison is made on monocular and monaural recordings of a publicly available dataset. We seek for the best trade off between speed, robustness and size of the training set. In the light of over 150,000 experiments, we conclude that this is a promising direction of work towards a flexible methodology that must be easily adaptable to a large variety of users.
2012 12th IEEE-RAS International Conference on Humanoid Robots (Humanoids 2012), 2012
In this paper we address the problem of audiovisual speaker detection. We introduce an online sys... more In this paper we address the problem of audiovisual speaker detection. We introduce an online system working on the humanoid robot NAO. The scene is perceived with two cameras and two microphones. A multimodal Gaussian mixture model (mGMM) fuses the information extracted from the auditory and visual sensors and detects the most probable audiovisual object, e.g., a person emitting a sound, in the 3D space. The system is implemented on top of a platformindependent middleware and it is able to process the information online (17Hz). A detailed description of the system and its implementation are provided, with special emphasis on the online processing issues and the proposed solutions. Experimental validation, performed with five different scenarios, show that that the proposed method opens the door to robust humanrobot interaction scenarios.
Recent works have shown that 3D shape of non-rigid surfaces can be accurately retrieved from a si... more Recent works have shown that 3D shape of non-rigid surfaces can be accurately retrieved from a single image given a set of 3D-to-2D correspondences between that image and another one for which the shape is known. However, existing approaches assume that such correspondences can be readily established, which is not necessarily true when large deformations produce significant appearance changes between the input and the reference images. Furthermore, it is either assumed that the pose of the camera is known, or the estimated solution is pose-ambiguous. In this paper we relax all these assumptions and, given a set of 3D and 2D unmatched points, we present an approach to simultaneously solve their correspondences, compute the camera pose and retrieve the shape of the surface in the input image. This is achieved by introducing weak priors on the pose and shape that we model as Gaussian Mixtures. By combining them into a Kalman filter we can progressively reduce the number of 2D candidates that can be potentially matched to each 3D point, while pose and shape are refined. This lets us to perform a complete and efficient exploration of the solution space and retain the best solution.
The introduction of active (pan-tilt-zoom or PTZ) cameras in Smart Rooms in addition to fixed sta... more The introduction of active (pan-tilt-zoom or PTZ) cameras in Smart Rooms in addition to fixed static cameras allows to improve resolution in volumetric reconstruction, adding the capability to track smaller objects with higher precision in actual 3D world coordinates. To accomplish this goal, precise camera calibration data should be available for any pan, tilt, and zoom settings of each PTZ camera. The PTZ calibration method proposed in this paper introduces a novel solution to the problem of computing extrinsic and intrinsic parameters for active cameras. We first determine the rotation center of the camera expressed under an arbitrary world coordinate origin. Then, we obtain an equation relating any rotation of the camera with the movement of the principal point to define extrinsic parameters for any value of pan and tilt. Once this position is determined, we compute how intrinsic parameters change as a function of zoom. We validate our method by evaluating the re-projection error and its stability for points inside and outside the calibration set.
ABSTRACT Stereo matching is a challenging problem, especially in the presence of noise or of weak... more ABSTRACT Stereo matching is a challenging problem, especially in the presence of noise or of weakly textured objects. Using temporal information in a binocular video sequence to increase the discriminability for matching has been introduced in the recent past, but all the proposed methods assume either constant disparity over time, or small object motions, which is not always true. We introduce a novel stereo algorithm that exploits temporal information by robustly aggregating a similarity statistic over time, in order to improve the matching accuracy for weak data, while preserving regions undergoing large motions without introducing artifacts.
2013 13th Ieee Ras International Conference on Humanoid Robots, Sep 12, 2013
In this paper we present a method for detecting and localizing an active speaker, i.e., a speaker... more In this paper we present a method for detecting and localizing an active speaker, i.e., a speaker that emits a sound, through the fusion between visual reconstruction with a stereoscopic camera pair and sound-source localization with several microphones. Both the cameras and the microphones are embedded into the head of a humanoid robot. The proposed statistical fusion model associates 3D faces of potential speakers with 2D sound directions. The paper has two contributions: (i) a method that discretizes the two-dimensional space of all possible sound directions and that accumulates evidence for each direction by estimating the time difference of arrival (TDOA) over all the microphone pairs, such that all the microphones are used simultaneously and symmetrically and (ii) an audio-visual alignment method that maps 3D visual features onto 2D sound directions and onto TDOAs between microphone pairs. This allows to implicitly represent both sensing modalities into a common audiovisual coordinate frame. Using simulated as well as real data, we quantitatively assess the robustness of the method against noise and reverberations, and we compare it with several other methods. Finally, we describe a realtime implementation using the proposed technique and with a humanoid head embedding four microphones and two cameras: this enables natural human-robot interactive behavior.
CVPR 2011, 2011
A simple seed growing algorithm for estimating scene flow in a stereo setup is presented. Two cal... more A simple seed growing algorithm for estimating scene flow in a stereo setup is presented. Two calibrated and synchronized cameras observe a scene and output a sequence of image pairs. The algorithm simultaneously computes a disparity map between the image pairs and optical flow maps between consecutive images. This, together with calibration data, is an equivalent representation of the 3D scene flow, i.e. a 3D velocity vector is associated with each reconstructed point. The proposed method starts from correspondence seeds and propagates these correspondences to their neighborhood. It is accurate for complex scenes with large motions and produces temporallycoherent stereo disparity and optical flow results. The algorithm is fast due to inherent search space reduction. An explicit comparison with recent methods of spatiotemporal stereo and variational optical and scene flow is provided.
Proceedings of the 5th ACM Multimedia Systems Conference on - MMSys '14, 2014
ABSTRACT We present the LaRED, a Large RGB-D Extensible hand gesture Dataset, recorded with an In... more ABSTRACT We present the LaRED, a Large RGB-D Extensible hand gesture Dataset, recorded with an Intel's newly-developed short range depth camera. This dataset is unique and differs from the existing ones in several aspects. Firstly, the large volume of data recorded: 243, 000 tuples where each tuple is composed of a color image, a depth image, and a mask of the hand region. Secondly, the number of different classes provided: a total of 81 classes (27 gestures in 3 different rotations). Thirdly, the extensibility of dataset: the software used to record and inspect the dataset is also available, giving the possibility for future users to increase the number of data as well as the number of gestures. Finally, in this paper, some experiments are presented to characterize the dataset and establish a baseline as the start point to develop more complex recognition algorithms. The LaRED dataset is publicly available at: http://mclab.citi.sinica.edu.tw/dataset/lared/lared.html.
Pattern Recognition Letters, 2010
This paper shows that Hidden Markov Models (HMMs) can be effectively applied to 3D face data. The... more This paper shows that Hidden Markov Models (HMMs) can be effectively applied to 3D face data. The examined HMM techniques are shown to be superior to a previously examined Gaussian Mixture Model (GMM) technique. Experiments conducted on the Face Recognition Grand Challenge database show that the Equal Error Rate can be reduced from 0.88% for the GMM technique to 0.36% for the best HMM approach.
Icmi Workshop on Multimodal Corpora For Machine Learning, 2011
La base de donnes Ravel Résumé : Dans ce papier, nous introduisons l'ensemble des donnes disponib... more La base de donnes Ravel Résumé : Dans ce papier, nous introduisons l'ensemble des donnes disponibles publiquement Ravel. Tous les scnarios ont t enregistr en utilisant la tłte robotique AV Popeye, quip de deux camras et quatre microphones. L'environnement d'enregistrement tait une salle de runion rgulire joignant tous les dfis d'une scne naturelle intrieur. La configuration d'acquisition est entirement dtaill ainsi que la conception des scnarios. Deux exemples d'utilisation de l'ensemble des donnes sont fournies, prouvant la convivialit de l'ensemble de donnes Ravel. Depuis la tendance actuelle est de concevoir des robots capables de interagir avec les environnements sans contrainte, cet ensemble de donnes fournit plusieurs scnarios pour tester des algorithmes et des mthodes visant satisfaire ces contraintes de conception. L'ensemble de donnes est accessible au public l'adresse suivante: http://ravel.humavips.eu/ Mots-clés : Interaction home-machine, base de donnes, audio-visuel.
Proceedings of the 14th ACM international conference on Multimodal interaction - ICMI '12, 2012
This paper addresses the problem of audiovisual command recognition in the framework of the D-MET... more This paper addresses the problem of audiovisual command recognition in the framework of the D-META Grand Challenge 1. Temporal and non-temporal learning models are trained on visual and auditory descriptors. In order to set a proper baseline, the methods are tested on the "Robot Gestures" scenario of the publicly available RAVEL data set, following the leave-one-out cross-validation strategy. The classification-level audiovisual fusion strategy allows for compensating the errors of the unimodal (audio or vision) classifiers. The obtained results (an average audiovisual recognition rate of almost 80%) encourage us to investigate on how to further develop and improve the methodology described in this paper.
2013 IEEE International Conference on Acoustics, Speech and Signal Processing, 2013
The problem of choosing a classifier for audiovisual command recognition is addressed. Because su... more The problem of choosing a classifier for audiovisual command recognition is addressed. Because such commands are culture-and user-dependant, methods need to learn new commands from a few examples. We benchmark three state-ofthe-art discriminative classifiers based on bag of words and SVM. The comparison is made on monocular and monaural recordings of a publicly available dataset. We seek for the best trade off between speed, robustness and size of the training set. In the light of over 150,000 experiments, we conclude that this is a promising direction of work towards a flexible methodology that must be easily adaptable to a large variety of users.
2012 12th IEEE-RAS International Conference on Humanoid Robots (Humanoids 2012), 2012
In this paper we address the problem of audiovisual speaker detection. We introduce an online sys... more In this paper we address the problem of audiovisual speaker detection. We introduce an online system working on the humanoid robot NAO. The scene is perceived with two cameras and two microphones. A multimodal Gaussian mixture model (mGMM) fuses the information extracted from the auditory and visual sensors and detects the most probable audiovisual object, e.g., a person emitting a sound, in the 3D space. The system is implemented on top of a platformindependent middleware and it is able to process the information online (17Hz). A detailed description of the system and its implementation are provided, with special emphasis on the online processing issues and the proposed solutions. Experimental validation, performed with five different scenarios, show that that the proposed method opens the door to robust humanrobot interaction scenarios.
Recent works have shown that 3D shape of non-rigid surfaces can be accurately retrieved from a si... more Recent works have shown that 3D shape of non-rigid surfaces can be accurately retrieved from a single image given a set of 3D-to-2D correspondences between that image and another one for which the shape is known. However, existing approaches assume that such correspondences can be readily established, which is not necessarily true when large deformations produce significant appearance changes between the input and the reference images. Furthermore, it is either assumed that the pose of the camera is known, or the estimated solution is pose-ambiguous. In this paper we relax all these assumptions and, given a set of 3D and 2D unmatched points, we present an approach to simultaneously solve their correspondences, compute the camera pose and retrieve the shape of the surface in the input image. This is achieved by introducing weak priors on the pose and shape that we model as Gaussian Mixtures. By combining them into a Kalman filter we can progressively reduce the number of 2D candidates that can be potentially matched to each 3D point, while pose and shape are refined. This lets us to perform a complete and efficient exploration of the solution space and retain the best solution.
The introduction of active (pan-tilt-zoom or PTZ) cameras in Smart Rooms in addition to fixed sta... more The introduction of active (pan-tilt-zoom or PTZ) cameras in Smart Rooms in addition to fixed static cameras allows to improve resolution in volumetric reconstruction, adding the capability to track smaller objects with higher precision in actual 3D world coordinates. To accomplish this goal, precise camera calibration data should be available for any pan, tilt, and zoom settings of each PTZ camera. The PTZ calibration method proposed in this paper introduces a novel solution to the problem of computing extrinsic and intrinsic parameters for active cameras. We first determine the rotation center of the camera expressed under an arbitrary world coordinate origin. Then, we obtain an equation relating any rotation of the camera with the movement of the principal point to define extrinsic parameters for any value of pan and tilt. Once this position is determined, we compute how intrinsic parameters change as a function of zoom. We validate our method by evaluating the re-projection error and its stability for points inside and outside the calibration set.
ABSTRACT Stereo matching is a challenging problem, especially in the presence of noise or of weak... more ABSTRACT Stereo matching is a challenging problem, especially in the presence of noise or of weakly textured objects. Using temporal information in a binocular video sequence to increase the discriminability for matching has been introduced in the recent past, but all the proposed methods assume either constant disparity over time, or small object motions, which is not always true. We introduce a novel stereo algorithm that exploits temporal information by robustly aggregating a similarity statistic over time, in order to improve the matching accuracy for weak data, while preserving regions undergoing large motions without introducing artifacts.
2013 13th Ieee Ras International Conference on Humanoid Robots, Sep 12, 2013
In this paper we present a method for detecting and localizing an active speaker, i.e., a speaker... more In this paper we present a method for detecting and localizing an active speaker, i.e., a speaker that emits a sound, through the fusion between visual reconstruction with a stereoscopic camera pair and sound-source localization with several microphones. Both the cameras and the microphones are embedded into the head of a humanoid robot. The proposed statistical fusion model associates 3D faces of potential speakers with 2D sound directions. The paper has two contributions: (i) a method that discretizes the two-dimensional space of all possible sound directions and that accumulates evidence for each direction by estimating the time difference of arrival (TDOA) over all the microphone pairs, such that all the microphones are used simultaneously and symmetrically and (ii) an audio-visual alignment method that maps 3D visual features onto 2D sound directions and onto TDOAs between microphone pairs. This allows to implicitly represent both sensing modalities into a common audiovisual coordinate frame. Using simulated as well as real data, we quantitatively assess the robustness of the method against noise and reverberations, and we compare it with several other methods. Finally, we describe a realtime implementation using the proposed technique and with a humanoid head embedding four microphones and two cameras: this enables natural human-robot interactive behavior.
CVPR 2011, 2011
A simple seed growing algorithm for estimating scene flow in a stereo setup is presented. Two cal... more A simple seed growing algorithm for estimating scene flow in a stereo setup is presented. Two calibrated and synchronized cameras observe a scene and output a sequence of image pairs. The algorithm simultaneously computes a disparity map between the image pairs and optical flow maps between consecutive images. This, together with calibration data, is an equivalent representation of the 3D scene flow, i.e. a 3D velocity vector is associated with each reconstructed point. The proposed method starts from correspondence seeds and propagates these correspondences to their neighborhood. It is accurate for complex scenes with large motions and produces temporallycoherent stereo disparity and optical flow results. The algorithm is fast due to inherent search space reduction. An explicit comparison with recent methods of spatiotemporal stereo and variational optical and scene flow is provided.
Proceedings of the 5th ACM Multimedia Systems Conference on - MMSys '14, 2014
ABSTRACT We present the LaRED, a Large RGB-D Extensible hand gesture Dataset, recorded with an In... more ABSTRACT We present the LaRED, a Large RGB-D Extensible hand gesture Dataset, recorded with an Intel's newly-developed short range depth camera. This dataset is unique and differs from the existing ones in several aspects. Firstly, the large volume of data recorded: 243, 000 tuples where each tuple is composed of a color image, a depth image, and a mask of the hand region. Secondly, the number of different classes provided: a total of 81 classes (27 gestures in 3 different rotations). Thirdly, the extensibility of dataset: the software used to record and inspect the dataset is also available, giving the possibility for future users to increase the number of data as well as the number of gestures. Finally, in this paper, some experiments are presented to characterize the dataset and establish a baseline as the start point to develop more complex recognition algorithms. The LaRED dataset is publicly available at: http://mclab.citi.sinica.edu.tw/dataset/lared/lared.html.
Pattern Recognition Letters, 2010
This paper shows that Hidden Markov Models (HMMs) can be effectively applied to 3D face data. The... more This paper shows that Hidden Markov Models (HMMs) can be effectively applied to 3D face data. The examined HMM techniques are shown to be superior to a previously examined Gaussian Mixture Model (GMM) technique. Experiments conducted on the Face Recognition Grand Challenge database show that the Equal Error Rate can be reduced from 0.88% for the GMM technique to 0.36% for the best HMM approach.