Jeroen Lichtenauer - Academia.edu (original) (raw)
Papers by Jeroen Lichtenauer
Since the introduction of particle filtering for object tracking, a lot of improvements have been... more Since the introduction of particle filtering for object tracking, a lot of improvements have been suggested. However, the definition of the observation likelihood function, needed for determining the particle weights, has received little attention. Because particle weights determine how the particles are re-sampled, the likelihood function has a strong influence on the tracking performance. We show experimental results for three different tracking tasks for different parameter values of the assumed observation model. The results show a large influence of the model parameters on the tracking performance. Optimizing the likelihood function can give significant tracking improvement. Different optimal parameter settings are observed for the three different tracking tasks. Consequently, when performing multiple tasks a tradeoff must be made for the parameter setting. In practical situations where robust tracking must be achieved with a limited amount of particles, the true observation probability is not always the optimal likelihood function.
We have developed a prototype for a learning environment for deaf and hard of hearing children. T... more We have developed a prototype for a learning environment for deaf and hard of hearing children. This demonstration consists of hands-on experience with the prototype. In total, there are three exercises: 1) an introduction of all pictures and corresponding signs, 2) multiple choice signto-picture and 3) performing the sign that corresponds to the picture shown on the screen. The live recognition from a wide-angle stereo camera provides immediate feedback for the third exercise where the sign must be performed. Figure 1. Interaction with the leaning application by touch screen.
Since the introduction of particle filtering for object tracking, a lot of improvements have been... more Since the introduction of particle filtering for object tracking, a lot of improvements have been suggested. However, the definition of the observation likelihood function, needed for determining the particle weights, has received little attention. Because particle weights determine how the particles are re-sampled, the likelihood function has a strong influence on the tracking performance. We show experimental results for three different tracking tasks for different parameter values of the assumed observation model. The results show a large influence of the model parameters on the tracking performance. Optimizing the likelihood function can give significant tracking improvement. Different optimal parameter settings are observed for the three different tracking tasks. Consequently, when performing multiple tasks a tradeoff must be made for the parameter setting. In practical situations where robust tracking must be achieved with a limited amount of particles, the true observation probability is not always the optimal likelihood function.
A 3D visual hand gesture recognition method is proposed that detects correctly performed signs fr... more A 3D visual hand gesture recognition method is proposed that detects correctly performed signs from stereo camera input. Hand tracking is based on skin detection with an adaptive chrominance model to get high accuracy. Informative high level motion properties are extracted to simplify the classification task. Each example is mapped onto a fixed reference sign by Dynamic Time Warping, to get precise time correspondences. The classification is done by combining weak classifiers based on robust statistics. Each base classifier assumes a uniform distribution of a single feature, determined by winsorization on the noisy training set. The operating point of the classifier is determined by stretching the uniform distributions of the base classifiers instead of changing the threshold on the total posterior likelihood. In a cross validation with 120 signs performed by 70 different persons, 95% of the test signs were correctly detected at a false positive rate of 5%.
We present a method to automatically construct a sign language classifier for a previously unseen... more We present a method to automatically construct a sign language classifier for a previously unseen sign. The only required input of a new sign is one example, performed by a sign language tutor. The method works by comparing the measurements of the new sign to signs that have been trained on a large number of persons. The parameters of the respective trained classifier models are used to construct a classification model for the new sign. We show that the performance of a classifier constructed from an instructed sign is significantly better than that of Dynamic Time Warping (DTW) with the same sign. Using only a single example, the proposed method has a performance comparable to a regular training with five examples, while being more stable because of the larger source of information.
Usually, object detection is performed directly on (normalized) gray values or gray primitives li... more Usually, object detection is performed directly on (normalized) gray values or gray primitives like gradients or Haar-like features. In that case the learning of relationships between gray primitives, that describe the structure of the object, is the complete responsibility of the classifier. We propose to apply more knowledge about the image structure in the preprocessing step, by computing local isophote directions and curvatures, in order to supply the classifier with much more informative image structure features. However, a periodic feature space, like orientation, is unsuited for common classification methods. Therefore, we split orientation into two more suitable components. Experiments show that the isophote features result in better detection performance than intensities, gradients or Haarlike features.
Many real-time image processing applications are con-fronted with performance limitations when im... more Many real-time image processing applications are con-fronted with performance limitations when implemented in software. The skin segmentation algorithm utilized in hand gesture recognition as developed by the ICT department of Delft University of Technology presents an ...
Sigir Forum, 2011
In this paper we introduce a multi-modal database for the analysis of human interaction, in parti... more In this paper we introduce a multi-modal database for the analysis of human interaction, in particular mimicry, and elaborate on the theoretical hypotheses of the relationship between the occurrence of mimicry and human affect. The recorded experiments are designed to explore this relationship. The corpus is recorded with 18 synchronised audio and video sensors, and is annotated for many different phenomena, including dialogue acts, turn-taking, affect, head gestures, hand gestures, body movement and facial expression. ...
IEEE Transactions on Affective Computing, 2012
MAHNOB-HCI is a multimodal database recorded in response to affective stimuli with the goal of em... more MAHNOB-HCI is a multimodal database recorded in response to affective stimuli with the goal of emotion recognition and implicit tagging research. A multimodal setup was arranged for synchronized recording of face videos, audio signals, eye gaze data, and peripheral/central nervous system physiological signals. Twenty-seven participants from both genders and different cultural backgrounds participated in two experiments. In the first experiment, they watched 20 emotional videos and self-reported their felt emotions using arousal, valence, dominance, and predictability as well as emotional keywords. In the second experiment, short videos and images were shown once without any tag and then with correct or incorrect tags. Agreement or disagreement with the displayed tags was assessed by the participants. The recorded videos and bodily responses were segmented and stored in a database. The database is made available to the academic community via a web-based system. The collected data were analyzed and single modality and modality fusion results for both emotion recognition and implicit tagging experiments are reported. These results show the potential uses of the recorded modalities and the significance of the emotion elicitation protocol.
IEEE Transactions on Image Processing, 2011
Conventional marker-based optical motion capture methods rely on scene attenuation (e.g. by infra... more Conventional marker-based optical motion capture methods rely on scene attenuation (e.g. by infrared-pass filtering). This renders the images useless for development and testing of machine vision methods under natural conditions. Unfortunately, combining, calibrating and synchronising a system for motion capture with a separate camera is a costly and cumbersome task. To overcome this problem, we present a framework for efficient, omnidirectional head-pose initialisation and tracking in the presence of missing and false positive marker detections. As such, it finally enables easy, accurate and synchronous head-motion capture as ground truth with or input for other machine vision algorithms.
Image and Vision Computing, 2011
Applications such as surveillance and human behaviour analysis require high-bandwidth recording f... more Applications such as surveillance and human behaviour analysis require high-bandwidth recording from multiple cameras, as well as from other sensors. In turn, sensor fusion has increased the required accuracy of synchronisation between sensors. Using commercial off-the-shelf components may compromise quality and accuracy due to several challenges, such as dealing with the combined data rate from multiple sensors; unknown offset and rate discrepancies between independent hardware clocks; the absence of trigger inputs or -outputs in the hardware; as well as the different methods for time-stamping the recorded data. To achieve accurate synchronisation, we centralise the synchronisation task by recording all trigger-or timestamp signals with a multi-channel audio interface. For sensors that don't have an external trigger signal, we let the computer that captures the sensor data periodically generate timestamp signals from its serial port output. These signals can also be used as a common time base to synchronise multiple asynchronous audio interfaces. Furthermore, we show that a consumer PC can currently capture 8-bit video data with 1024× 1024 spatial-and 59.1 Hz temporal resolution, from at least 14 cameras, together with 8 channels of 24-bit audio at 96 kHz. We thus improve the quality/cost ratio of multi-sensor systems data capture systems. j o u r n a l h o m e p a g e : w w w. e l s ev i e r. c o m / l o c a t e / i m av i s Please cite this article as: J. Lichtenauer, et al., Cost-effective solution to synchronised audio-visual data capture using multiple sensors, Image Vis.
Applications such as surveillance and human motion capture require high-bandwidth recording from ... more Applications such as surveillance and human motion capture require high-bandwidth recording from multiple cameras. Furthermore, the recent increase in research on sensor fusion has raised the demand on synchronization accuracy between video, audio and other sensor modalities. Previously, capturing synchronized, high resolution video from multiple cameras required complex, inflexible and expensive solutions. Our experiments show that a single PC, built from contemporary low-cost computer hardware, could currently handle up to 470MB/s of input data. This allows capturing from 18 cameras of 780x580pixels at 60fps each, or 36 cameras at 30fps. Furthermore, we achieve accurate synchronization between audio, video and additional sensors, by recording audio together with sensor trigger-or timestamp signals, using a multi-channel audio input. In this way, each sensor modality can be captured with separate software and hardware, allowing maximal flexibility with minimal cost.
One way of recovering watermarks in geometrically distorted images is by performing a geometrical... more One way of recovering watermarks in geometrically distorted images is by performing a geometrical search. In addition to the computational cost required for this method, this paper considers the more important problem of false positives. The maximal number of detections that can be performed in a geometrical search is bounded by the maximum false positive detection probability required by the watermark application. We show that image and key dependency in the watermark detector leads to different false positive detection probabilities for geometrical searches for different images and keys. Furthermore, the image and key dependency of the tested watermark detector increases the random-imagerandom-key false positive detection probability, compared to the Bernoulli experiment that was used as a model.
Since the introduction of particle filtering for object tracking, a lot of improvements have been... more Since the introduction of particle filtering for object tracking, a lot of improvements have been suggested. However, the definition of the observation likelihood function, needed for determining the particle weights, has received little attention. Because particle weights determine how the particles are re-sampled, the likelihood function has a strong influence on the tracking performance. We show experimental results for three different tracking tasks for different parameter values of the assumed observation model. The results show a large influence of the model parameters on the tracking performance. Optimizing the likelihood function can give significant tracking improvement. Different optimal parameter settings are observed for the three different tracking tasks. Consequently, when performing multiple tasks a tradeoff must be made for the parameter setting. In practical situations where robust tracking must be achieved with a limited amount of particles, the true observation probability is not always the optimal likelihood function.
We have developed a prototype for a learning environment for deaf and hard of hearing children. T... more We have developed a prototype for a learning environment for deaf and hard of hearing children. This demonstration consists of hands-on experience with the prototype. In total, there are three exercises: 1) an introduction of all pictures and corresponding signs, 2) multiple choice signto-picture and 3) performing the sign that corresponds to the picture shown on the screen. The live recognition from a wide-angle stereo camera provides immediate feedback for the third exercise where the sign must be performed. Figure 1. Interaction with the leaning application by touch screen.
Since the introduction of particle filtering for object tracking, a lot of improvements have been... more Since the introduction of particle filtering for object tracking, a lot of improvements have been suggested. However, the definition of the observation likelihood function, needed for determining the particle weights, has received little attention. Because particle weights determine how the particles are re-sampled, the likelihood function has a strong influence on the tracking performance. We show experimental results for three different tracking tasks for different parameter values of the assumed observation model. The results show a large influence of the model parameters on the tracking performance. Optimizing the likelihood function can give significant tracking improvement. Different optimal parameter settings are observed for the three different tracking tasks. Consequently, when performing multiple tasks a tradeoff must be made for the parameter setting. In practical situations where robust tracking must be achieved with a limited amount of particles, the true observation probability is not always the optimal likelihood function.
A 3D visual hand gesture recognition method is proposed that detects correctly performed signs fr... more A 3D visual hand gesture recognition method is proposed that detects correctly performed signs from stereo camera input. Hand tracking is based on skin detection with an adaptive chrominance model to get high accuracy. Informative high level motion properties are extracted to simplify the classification task. Each example is mapped onto a fixed reference sign by Dynamic Time Warping, to get precise time correspondences. The classification is done by combining weak classifiers based on robust statistics. Each base classifier assumes a uniform distribution of a single feature, determined by winsorization on the noisy training set. The operating point of the classifier is determined by stretching the uniform distributions of the base classifiers instead of changing the threshold on the total posterior likelihood. In a cross validation with 120 signs performed by 70 different persons, 95% of the test signs were correctly detected at a false positive rate of 5%.
We present a method to automatically construct a sign language classifier for a previously unseen... more We present a method to automatically construct a sign language classifier for a previously unseen sign. The only required input of a new sign is one example, performed by a sign language tutor. The method works by comparing the measurements of the new sign to signs that have been trained on a large number of persons. The parameters of the respective trained classifier models are used to construct a classification model for the new sign. We show that the performance of a classifier constructed from an instructed sign is significantly better than that of Dynamic Time Warping (DTW) with the same sign. Using only a single example, the proposed method has a performance comparable to a regular training with five examples, while being more stable because of the larger source of information.
Usually, object detection is performed directly on (normalized) gray values or gray primitives li... more Usually, object detection is performed directly on (normalized) gray values or gray primitives like gradients or Haar-like features. In that case the learning of relationships between gray primitives, that describe the structure of the object, is the complete responsibility of the classifier. We propose to apply more knowledge about the image structure in the preprocessing step, by computing local isophote directions and curvatures, in order to supply the classifier with much more informative image structure features. However, a periodic feature space, like orientation, is unsuited for common classification methods. Therefore, we split orientation into two more suitable components. Experiments show that the isophote features result in better detection performance than intensities, gradients or Haarlike features.
Many real-time image processing applications are con-fronted with performance limitations when im... more Many real-time image processing applications are con-fronted with performance limitations when implemented in software. The skin segmentation algorithm utilized in hand gesture recognition as developed by the ICT department of Delft University of Technology presents an ...
Sigir Forum, 2011
In this paper we introduce a multi-modal database for the analysis of human interaction, in parti... more In this paper we introduce a multi-modal database for the analysis of human interaction, in particular mimicry, and elaborate on the theoretical hypotheses of the relationship between the occurrence of mimicry and human affect. The recorded experiments are designed to explore this relationship. The corpus is recorded with 18 synchronised audio and video sensors, and is annotated for many different phenomena, including dialogue acts, turn-taking, affect, head gestures, hand gestures, body movement and facial expression. ...
IEEE Transactions on Affective Computing, 2012
MAHNOB-HCI is a multimodal database recorded in response to affective stimuli with the goal of em... more MAHNOB-HCI is a multimodal database recorded in response to affective stimuli with the goal of emotion recognition and implicit tagging research. A multimodal setup was arranged for synchronized recording of face videos, audio signals, eye gaze data, and peripheral/central nervous system physiological signals. Twenty-seven participants from both genders and different cultural backgrounds participated in two experiments. In the first experiment, they watched 20 emotional videos and self-reported their felt emotions using arousal, valence, dominance, and predictability as well as emotional keywords. In the second experiment, short videos and images were shown once without any tag and then with correct or incorrect tags. Agreement or disagreement with the displayed tags was assessed by the participants. The recorded videos and bodily responses were segmented and stored in a database. The database is made available to the academic community via a web-based system. The collected data were analyzed and single modality and modality fusion results for both emotion recognition and implicit tagging experiments are reported. These results show the potential uses of the recorded modalities and the significance of the emotion elicitation protocol.
IEEE Transactions on Image Processing, 2011
Conventional marker-based optical motion capture methods rely on scene attenuation (e.g. by infra... more Conventional marker-based optical motion capture methods rely on scene attenuation (e.g. by infrared-pass filtering). This renders the images useless for development and testing of machine vision methods under natural conditions. Unfortunately, combining, calibrating and synchronising a system for motion capture with a separate camera is a costly and cumbersome task. To overcome this problem, we present a framework for efficient, omnidirectional head-pose initialisation and tracking in the presence of missing and false positive marker detections. As such, it finally enables easy, accurate and synchronous head-motion capture as ground truth with or input for other machine vision algorithms.
Image and Vision Computing, 2011
Applications such as surveillance and human behaviour analysis require high-bandwidth recording f... more Applications such as surveillance and human behaviour analysis require high-bandwidth recording from multiple cameras, as well as from other sensors. In turn, sensor fusion has increased the required accuracy of synchronisation between sensors. Using commercial off-the-shelf components may compromise quality and accuracy due to several challenges, such as dealing with the combined data rate from multiple sensors; unknown offset and rate discrepancies between independent hardware clocks; the absence of trigger inputs or -outputs in the hardware; as well as the different methods for time-stamping the recorded data. To achieve accurate synchronisation, we centralise the synchronisation task by recording all trigger-or timestamp signals with a multi-channel audio interface. For sensors that don't have an external trigger signal, we let the computer that captures the sensor data periodically generate timestamp signals from its serial port output. These signals can also be used as a common time base to synchronise multiple asynchronous audio interfaces. Furthermore, we show that a consumer PC can currently capture 8-bit video data with 1024× 1024 spatial-and 59.1 Hz temporal resolution, from at least 14 cameras, together with 8 channels of 24-bit audio at 96 kHz. We thus improve the quality/cost ratio of multi-sensor systems data capture systems. j o u r n a l h o m e p a g e : w w w. e l s ev i e r. c o m / l o c a t e / i m av i s Please cite this article as: J. Lichtenauer, et al., Cost-effective solution to synchronised audio-visual data capture using multiple sensors, Image Vis.
Applications such as surveillance and human motion capture require high-bandwidth recording from ... more Applications such as surveillance and human motion capture require high-bandwidth recording from multiple cameras. Furthermore, the recent increase in research on sensor fusion has raised the demand on synchronization accuracy between video, audio and other sensor modalities. Previously, capturing synchronized, high resolution video from multiple cameras required complex, inflexible and expensive solutions. Our experiments show that a single PC, built from contemporary low-cost computer hardware, could currently handle up to 470MB/s of input data. This allows capturing from 18 cameras of 780x580pixels at 60fps each, or 36 cameras at 30fps. Furthermore, we achieve accurate synchronization between audio, video and additional sensors, by recording audio together with sensor trigger-or timestamp signals, using a multi-channel audio input. In this way, each sensor modality can be captured with separate software and hardware, allowing maximal flexibility with minimal cost.
One way of recovering watermarks in geometrically distorted images is by performing a geometrical... more One way of recovering watermarks in geometrically distorted images is by performing a geometrical search. In addition to the computational cost required for this method, this paper considers the more important problem of false positives. The maximal number of detections that can be performed in a geometrical search is bounded by the maximum false positive detection probability required by the watermark application. We show that image and key dependency in the watermark detector leads to different false positive detection probabilities for geometrical searches for different images and keys. Furthermore, the image and key dependency of the tested watermark detector increases the random-imagerandom-key false positive detection probability, compared to the Bernoulli experiment that was used as a model.