David Demirdjian - Profile on Academia.edu (original) (raw)

Papers by David Demirdjian

We present a new approach to multi-signal gesture recognition that attends to simultaneous body a... more We present a new approach to multi-signal gesture recognition that attends to simultaneous body and hand movements. The system examines temporal sequences of dualchannel input signals obtained via statistical inference that indicate 3D body pose and hand pose. Learning gesture patterns from these signals can be quite challenging due to the existence of long-range temporal-dependencies and low signal-to-noise ratio (SNR). We incorporate a Gaussian temporal-smoothing kernel into the inference framework, capturing long-range temporal-dependencies and increasing the SNR efficiently. An extensive set of experiments was performed, allowing us to (1) show that combining body and hand signals significantly improves the recognition accuracy; (2) report on which features of body and hands are most informative; and (3) show that using a Gaussian temporal-smoothing significantly improves gesture recognition accuracy.

Recognizing events with temporal random forests

In this paper, we present a novel technique for classifying multimodal temporal events. Our main ... more In this paper, we present a novel technique for classifying multimodal temporal events. Our main contribution is the introduction of temporal random forests (TRFs), an extension of random forests (and decision trees in general) to the time domain. The approach is relatively simple and able to discriminatively learn event classes while performing feature selection in an implicit fashion. We describe here our ongoing research and present experiments performed on gesture and audio-visual speech recognition datasets comparing our method against state-of-the-art algorithms.

Virtual Reality, Jul 12, 2005

Humans use a combination of gesture and speech to interact with objects and usually do so more na... more Humans use a combination of gesture and speech to interact with objects and usually do so more naturally without holding a device or pointer. We present a system that incorporates user body-pose estimation, gesture recognition and speech recognition for interaction in virtual reality environments. We describe a vision-based method for tracking the pose of a user in real time and introduce a technique that provides parameterized gesture recognition. More precisely, we train a support vector classifier to model the boundary of the space of possible gestures, and train Hidden Markov Models on specific gestures. Given a sequence, we can find the start and end of various gestures using a support vector classifier, and find gesture likelihoods and parameters with a HMM. A multimodal recognition process is performed using rank-order fusion to merge speech and vision hypotheses. Finally we describe the use of our multimodal framework in a virtual world application that allows users to interact using gestures and speech.

We present a unified framework for body and hand tracking, the output of which can be used for un... more We present a unified framework for body and hand tracking, the output of which can be used for understanding simultaneously performed body-and-hand gestures. The framework uses a stereo camera to collect 3D images, and tracks body and hand together, combining various existing techniques to make tracking tasks efficient. In addition, we introduce a multi-signal gesture database: the NATOPS aircraft handling signals. Unlike previous gesture databases, this data requires knowledge about both body and hand in order to distinguish gestures. It is also focused on a clearly defined gesture vocabulary from a real-world scenario that has been refined over many years. The database includes 24 body-andhand gestures, and provides both gesture video clips and the body and hand features we extracted.

Markerless motion capture techniques to facilitate rehabilitation intervention

Automatic detection of communication errors in conversational systems has been explored extensive... more Automatic detection of communication errors in conversational systems has been explored extensively in the speech community. However, most previous studies have used only acoustic cues. Visual information has also been used by the speech community to improve speech recognition in dialogue systems, but this visual information is only used when the speaker is communicating vocally. A recent perceptual study indicated that human observers can detect communication problems when they see the visual footage of the speaker during the system's reply. In this paper, we present work in progress towards the development of a communication error detector that exploits this visual cue. In datasets we collected or acquired, facial motion features and head poses were estimated while users were listening to the system response and passed to a classifier for detecting a communication error. Preliminary experiments have demonstrated that the speaker's visual information during the system's reply is potentially useful and accuracy of automatic detection is close to human performance.

Navigating virtual environments usually requires a wired interface, game console, or keyboard. Th... more Navigating virtual environments usually requires a wired interface, game console, or keyboard. The advent of perceptual interface techniques allows a new option: the passive and untethered sensing of users' pose and gesture to allow them maneuver through and manipulate virtual worlds. We describe new algorithms for interacting with 3-D environments using real-time articulated body tracking with standard cameras and personal computers. Our method is based on rigid stereo-motion estimation algorithms and uses a linear technique for enforcing articulation constraints. With our tracking system users can navigate virtual environments using 3-D gesture and body poses. We analyze the space of possible perceptual interface abstractions for full-body navigation, and present a prototype system based on these results. We finally describe an initial evaluation of our prototype system with users guiding avatars through a series of 3-D virtual game worlds.

Lecture Notes in Computer Science, 2004

In this paper we propose an efficient real-time approach that combines vision-based tracking and ... more In this paper we propose an efficient real-time approach that combines vision-based tracking and a view-based model to estimate the pose of a person. We introduce an appearance model that contains views of a person under various articulated poses. The appearance model is built and updated online. The main contribution consists of modeling, in each frame, the pose changes as a linear transformation of the view change. This linear model allows (i) for predicting the pose in a new image, and (ii) for obtaining a better estimate of the pose corresponding to a key frame. Articulated pose is computed by merging the estimation provided by the tracking-based algorithm and the linear prediction given by the view-based model.

A problem faced by groups that are not co-located but need to collaborate on a common task is the... more A problem faced by groups that are not co-located but need to collaborate on a common task is the reduced access to the rich multimodal communicative context that they would have access to if they were collaborating face-to-face. Collaboration support tools aim to reduce the adverse effects of this restricted access to the fluid intermixing of speech, gesturing, writing and sketching by providing mechanisms to enhance the awareness of distributed participants of each others' actions. In this work we explore novel ways to leverage the capabilities of multimodal context-aware systems to bridge colocated and distributed collaboration contexts. We describe a system that allows participants at remote sites to collaborate in building a project schedule via sketching on multiple distributed whiteboards, and show how participants can be made aware of naturally occurring pointing gestures that reference diagram constituents as they are performed by remote participants. The system explores the multimodal fusion of pen, speech and 3D gestures, coupled to the dynamic construction of a semantic representation of the interaction, anchored on the sketched diagram, to provide feedback that overcomes some of the intrinsic ambiguities of pointing gestures.

A novel approach for tracking 3D articulated human bodies in stereo images is presented. We prese... more A novel approach for tracking 3D articulated human bodies in stereo images is presented. We present a projection-based method for enforcing articulated constraints. We define the articulated motion space as the space in which the motions of the limbs of a body belong. We show that around the origin, the articulated motion space can be approximated by a linear space estimated directly from the previous body pose. Articulated constraints are enforced by projecting unconstrained motions onto the linearized articulated motion space in an optimal way. Our paper also addresses the problem of accounting for other constraints on body pose and dynamics (e.g. joint angle bounds, maximum speed). We present here an approach to guarantee these constraints while tracking people.

We describe a state-space tracking approach based on a Conditional Random Field (CRF) model, wher... more We describe a state-space tracking approach based on a Conditional Random Field (CRF) model, where the observation potentials are learned from data. We find functions that embed both state and observation into a space where similarity corresponds to L 1 distance, and define an observation potential based on distance in this space. This potential is extremely fast to compute and in conjunction with a grid-filtering framework can be used to reduce a continuous state estimation problem to a discrete one. We show how a state temporal prior in the grid-filter can be computed in a manner similar to a sparse HMM, resulting in real-time system performance. The resulting system is used for human pose tracking in video sequences.

Patch-Based Pose Inference with a Mixture of Density Estimators

Springer eBooks, Nov 3, 2007

This paper presents a patch-based approach for pose estimation from single images using a kerneli... more This paper presents a patch-based approach for pose estimation from single images using a kernelized density voting scheme. We introduce a boosting-like algorithm that models the density using a mixture of weighted ‘weak’ estimators. The ‘weak’ density estimators and corresponding weights are learned iteratively from a training set, providing an efficient method for feature selection. Given a query image, voting

Recognition of temporal events using multiscale bags of features

This paper presents a novel method for learning classes of temporal sequences using a bag-of-feat... more This paper presents a novel method for learning classes of temporal sequences using a bag-of-features approach. We define a temporal sequence as a bag of temporal features and show how this representation can be used for the recognition and segmentation of temporal events. A codebook of temporal descriptors, representing the local temporal texture, is automatically constructed from a set of

By providing different input channels, multimodal interfaces allow a more natural and efficient i... more By providing different input channels, multimodal interfaces allow a more natural and efficient interaction between user and machine. Recent years have seen the emergence of many systems using speech and gestures, where users interact with an application by talking to it, pointing (or looking) at icons and/or performing gestures. Research in multimodal interfaces and ubiquitous computing aim at building the tools to implement these abilities, in as natural and unobtrusive a manner as possible.

Gesture + play

A new method for 3D rigid motion estimation from stereo is proposed in this paper. The appealing ... more A new method for 3D rigid motion estimation from stereo is proposed in this paper. The appealing feature of this method is that it directly uses the disparity images obtained from stereo matching. We assume that the stereo rig has parallel cameras and show, in that case, the geometric and topological properties of the disparity images. Then we introduce a rigid transformation (called d-motion) that maps two disparity images of a rigidly moving object. We show how it is related to the Euclidean rigid motion and a motion estimation algorithm is derived. We show with experiments that our approach is simple and more accurate than standard approaches.

Je tiens tout d'abord a remercier Radu Horaud pour m'avoir permis de faire mes ((premiers pas)) d... more Je tiens tout d'abord a remercier Radu Horaud pour m'avoir permis de faire mes ((premiers pas)) dans le domaine de la vision, il y a maintenant d ej a quelques ann ees, puis pour m'avoir fait con ance tout au long de ma pr esence dans l' equipe movi. Je remercie egalement, et de fa con aussi enthousiaste, Patrick Gros pour ses encouragements, sa disponibilit e et ses relexions pertinentes et constructives pendant ces ann ees ainsi que pour son amiti e et l'int erêt constant qu'il a port e a mon travail. Je remercie vivement les personnes qui m'ont fait l'honneur d'avoir particip e a mon jury : mes rapporteurs Mm. Patrick Gros et Thierry Vi eville pour leurs commentaires constructifs sur le manuscrit, Mme. Marie-Paule Cani pour l'avoir pr esid e, ainsi que Mm. Andrew Zisserman et Michael Lindenbaum d'avoir et e examinateurs. Roger Mohr a egalement et e un bon chef et coll egue et je le remercie pour la con ance qu'il m'a accord ee, sur le plan scienti que d'une part, et sur le plan technique et administratif d'autre part, lorsqu'il m'a permis de prendre de r eelles responsabilit es dans son equipe. Merci aussi au personnel de l'inria Rhône-Alpes d'avoir assur e nos exceptionnelles conditions de travail, tout particuli erement notre assistante V eronique Roux, pour son e cacit e et sa bonne humeur a toute epreuve. Il est di cile de citer toutes les personnes que j'ai côtoy ees et qui ont pu m'aider. Je tiens a remercier toute l' equipe movi pour l'ambiance chaleureuse qui r egne dans son sein. Je la remercie plus particuli erement de m'avoir pardonn e mes sauts d'humeur quand mes charges d'administration etait lourdes, et pour l' echange scienti que constant et enrichissant qui la caract erise. Merci en particulier a mon ami et co-trans-bureau { c.-a.-d. voisin d'aile de bâtiment ou, plus pr ecis ement, occupant du bureau situ e juste en face du mien { Bart Lamiroy pour nos nombreux fous rires. Il ne manquera pas de remarquer le clin d' il que je lui fais dans ces remerciements. J'esp ere que l'amiti e qui s'est construite pendant ces ann ees perdurera maintenant que nos chemins se sont s epar es. Merci egalement a Fr ed erick Martin pour la collaboration fructueuse et les echanges scienti ques et epist emologiques qui ont beaucoup contribu e a ce travail, et a Yves Dufournaud qui, avec ses questions et r elexions techniques, a su me surprendre et qui m'a incit e a aller toujours plus loin dans la r e exion scienti que. A ma m ere, 8 6 Auto-etalonnage St er eo 6.1 R esum e de ((Closed-form Solutions for the Euclidean Calibration of a Stereo Rig

Session details: Multimodal devices and sensors (Oral)

Navigating in virtual environments using a vision-based interface

... Science and Artificial Intilligens, MIT 200 Technology Square, Cambridge, MA 20139, USA konra... more ... Science and Artificial Intilligens, MIT 200 Technology Square, Cambridge, MA 20139, USA konrad@csail.mit.edu, demirdji@csail.mit.edu, trevor@csail.mit ... First we performed a Wizard ofOZ (WOz) study with the major aim of gathering qualitative data like: what gestures are most ...