Jan Neumann | University of Maryland, College Park (original) (raw)
Papers by Jan Neumann
The capacity to robustly detect humans in video is a critical component of automated visual surve... more The capacity to robustly detect humans in video is a critical component of automated visual surveillance systems. This paper describes a bilattice based logical reasoning approach that exploits contextual information and knowledge about interactions between humans, and augments it with the output of different low level detectors for human detection. Detections from low level parts-based detectors are treated as logical facts and used to reason explicitly about the presence or absence of humans in the scene. Positive and negative information from different sources, as well as uncertainties from detections and logical rules, are integrated within the bilattice framework. This approach also generates proofs or justifications for each hypothesis it proposes. These justifications (or lack thereof) are further employed by the system to explain and validate, or reject potential hypotheses. This allows the system to explicitly reason about complex interactions between humans and handle occlusions. These proofs are also available to the end user as an explanation of why the system thinks a particular hypothesis is actually a human. We employ a boosted cascade of gradient histograms based detector to detect individual body parts. We have applied this framework to analyze the presence of humans in static images from different datasets.
DAGM Symposium Symposium for Pattern Recognition, 2000
A comparison is made of global and local methods for the shape analysis of logos in an image data... more A comparison is made of global and local methods for the shape analysis of logos in an image database. The qualities of the methods are judged by using the shape signatures to define a similarity metric on the logos. As representatives for the two classes of methods, we use the negative shape method which is based on local shape information
Deformable Avatars, 2000
Abstract We demonstrate a method to compute three-dimensional (3D) motion fields on a face. Twelv... more Abstract We demonstrate a method to compute three-dimensional (3D) motion fields on a face. Twelve synchronized and calibrated cameras are positioned around a talking person, and observe its head in motion. We represent the head as a deformable mesh, which is ...
Lecture Notes in Computer Science, 2001
ABSTRACT
Pattern Recognition Letters, 2002
A comparison is made of global and local methods for the shape analysis of logos in an image data... more A comparison is made of global and local methods for the shape analysis of logos in an image database. The qualities of the methods are judged by using the shape signatures to define a similarity metric on the logos. As representatives for the two classes of methods, we use the negative shape method which is based on local shape information and a wavelet-based method which makes use of global information. We apply both methods to images with different kinds of degradations and examine how a given degradation highlights the strengths and shortcomings of each method. Finally, we use these results to combine information from both methods and develop a new method which is based on the relative performances of the two methods.
Lecture Notes in Computer Science, 2002
We examine the influence of camera design on the estimation of the motion and structure of a scen... more We examine the influence of camera design on the estimation of the motion and structure of a scene from video data. Every camera captures a subset of the light rays passing though some volume in space. By relating the differential structure of the time varying space of light rays to different known and new camera designs, we can establish a hierarchy of cameras. This hierarchy is based upon the stability and complexity of the computations necessary to estimate structure and motion. At the low end of this hierarchy is the standard planar pinhole camera for which the structure from motion problem is non-linear and ill-posed. At the high end is a camera, which we call the full field of view polydioptric camera, for which the problem is linear and stable. We develop design suggestions for the polydioptric camera, and based upon this new design we propose a linear algorithm for structure-from-motion estimation, which combines differential motion estimation with differential stereo.
Deformable Avatars, 2001
Abstract We demonstrate a method to compute three-dimensional (3D) motion fields on a face. Twelv... more Abstract We demonstrate a method to compute three-dimensional (3D) motion fields on a face. Twelve synchronized and calibrated cameras are positioned around a talking person, and observe its head in motion. We represent the head as a deformable mesh, which is ...
Computer Vision and Pattern Recognition, 2007
Combinations of microphones and cameras allow the joint audio visual sensing of a scene. Such arr... more Combinations of microphones and cameras allow the joint audio visual sensing of a scene. Such arrangements of sensors are common in biological organisms and in appli- cations such as meeting recording and surveillance where both modalities are necessary to provide scene understand- ing. Microphone arrays provide geometrical information on the source location, and allow the sound sources in the scene
2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE Cat. No.04CH37566), 2004
We describe a compound eye vision sensor for 3D ego motion computation. Inspired by eyes of insec... more We describe a compound eye vision sensor for 3D ego motion computation. Inspired by eyes of insects, we show that the compound eye sampling geometry is optimal for 3D camera motion estimation. This optimality allows us to estimate the 3D camera motion in a scene-independent and robust manner by utilizing linear equations. The mathematical model of the new sensor can be implemented in analog networks resulting in a compact computational sensor for instantaneous 3D ego motion measurements in full six degrees of freedom.
2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops), 2011
We present an approach for automatic annotation of commercial videos from an arts-and-crafts doma... more We present an approach for automatic annotation of commercial videos from an arts-and-crafts domain with the aid of textual descriptions. The main focus is on recognizing both manipulation actions (e.g. cut, draw, glue) and the tools that are used to perform these actions (e.g. markers, brushes, glue bottle). We demonstrate how multiple visual cues such as motion descriptors, object presence, and hand poses can be combined with the help of contextual priors that are automatically extracted from associated transcripts or online instructions. Using these diverse features and linguistic information we propose several increasingly complex computational models for recognizing elementary manipulation actions and composite activities, as well as their temporal order. The approach is evaluated on a novel dataset of comprised of 27 episodes of PBS Sprout TV, each containing on average 8 manipulation actions.
2007 IEEE Conference on Computer Vision and Pattern Recognition, 2007
The capacity to robustly detect humans in video is a critical component of automated visual surve... more The capacity to robustly detect humans in video is a critical component of automated visual surveillance systems. This paper describes a bilattice based logical reasoning approach that exploits contextual information and knowledge about interactions between humans, and augments it with the output of different low level detectors for human detection. Detections from low level parts-based detectors are treated as logical facts and used to reason explicitly about the presence or absence of humans in the scene. Positive and negative information from different sources, as well as uncertainties from detections and logical rules, are integrated within the bilattice framework. This approach also generates proofs or justifications for each hypothesis it proposes. These justifications (or lack thereof) are further employed by the system to explain and validate, or reject potential hypotheses. This allows the system to explicitly reason about complex interactions between humans and handle occlusions. These proofs are also available to the end user as an explanation of why the system thinks a particular hypothesis is actually a human. We employ a boosted cascade of gradient histograms based detector to detect individual body parts. We have applied this framework to analyze the presence of humans in static images from different datasets.
Advances in Video, Audio, and Imagery Analysis for Search, Data Mining, Surveillance, and Authoring, 2012
ABSTRACT This chapter contains sections titled: Introduction and Motivation Related Research Sema... more ABSTRACT This chapter contains sections titled: Introduction and Motivation Related Research Semantic Multimedia Extraction Using Audio or Closed -Captions Semantic Multimedia Extraction Using Video Conclusion and Future Work
The Visual Computer International Journal of Computer Graphics, 2003
More and more processing of visual information is nowadays done by computers, but the images capt... more More and more processing of visual information is nowadays done by computers, but the images captured by conventional cameras are still based on the pinhole principle inspired by our own eyes. This principle though is not necessarily the optimal image formation principle for automated processing of visual information. Each camera samples the space of light rays according to some pattern. If we understand the structure of the space formed by the light rays passing through a volume of space, we can determine the camera, or in other words the sampling pattern of light rays, that is optimal with regard to a given task. In this work we analyze the differential structure of the space of time-varying light rays described by the plenoptic function and use this analysis to relate the rigid motion of an imaging device to the derivatives of the plenoptic function. The results can be used to define a hierarchy of camera models with respect to the structure from motion problem and formulate a linear, scene-independent estimation problem for the rigid motion of the sensor purely in terms of the captured images.
International Journal of Computer Vision, 2011
Predicate logic based reasoning approaches provide a means of formally specifying domain knowledg... more Predicate logic based reasoning approaches provide a means of formally specifying domain knowledge and manipulating symbolic information to explicitly reason about different concepts of interest. Extension of traditional binary predicate logics with the bilattice formalism permits the handling of uncertainty in reasoning, thereby facilitating their application to computer vision problems. In this paper, we propose using first order predicate logics, extended with a bilattice based uncertainty handling formalism, as a means of formally encoding pattern grammars, to parse a set of image features, and detect the presence of different patterns of interest. Detections from low level feature detectors are treated as logical facts and, in conjunction with logical rules, used to drive the reasoning. Positive and negative information from different sources, as well as uncertainties from detections, are integrated within the bilattice framework. We show that this approach can also generate proofs or justifications, in the form of parse trees, for each hypothesis it proposes thus permitting direct analysis of the final solution in linguistic form. Automated logical rule weight learning is an important aspect of the application of such
Computer Vision and Image Understanding, 2004
The view-independent visualization of 3D scenes is most often based on rendering accurate 3-dimen... more The view-independent visualization of 3D scenes is most often based on rendering accurate 3-dimensional models or utilizes image-based rendering techniques. To compute the 3D structure of a scene from a moving vision sensor or to use imagebased rendering approaches, we need to be able to estimate the motion of the sensor from the recorded image information with high accuracy, a problem that has been well-studied. In this work, we investigate the relationship between camera design and our ability to perform accurate 3D photography, by examining the influence of camera design on the estimation of the motion and structure of a scene from video data. By relating the differential structure of the time varying plenoptic function to different known and new camera designs, we can establish a hierarchy of cameras based upon the stability and complexity of the computations necessary to estimate structure and motion. At the low end of this hierarchy is the standard planar pinhole camera for which the structure from motion problem is non-linear and ill-posed. At the high end is a camera, which we call the full field of view polydioptric camera, for which the motion estimation problem can be solved independently of the depth of the scene which leads to fast and robust algorithms for 3D Photography. In between are multiple view cameras with a large field of view which we have built, as well as omni-directional sensors.
Natural eye designs are optimized with regard to the tasks the eye-carrying organism has to perfo... more Natural eye designs are optimized with regard to the tasks the eye-carrying organism has to perform for survival. This optimization has been performed by the process of natural evolution over many millions of years. Every eye captures a subset of the space of light rays. The information contained in this subset and the accuracy to which the eye can extract the necessary information determines an upper limit on how well an organism can perform a given task. In this work we propose a new methodology for camera design. By interpreting eyes as sample patterns in light ray space we can phrase the problem of eye design in a signal processing framework. This allows us to develop mathematical criteria for optimal eye design, which in turn enables us to build the best eye for a given task without the trial and error phase of natural evolution. The principle is evaluated on the task of 3D ego-motion estimation.
The capacity to robustly detect humans in video is a critical component of automated visual surve... more The capacity to robustly detect humans in video is a critical component of automated visual surveillance systems. This paper describes a bilattice based logical reasoning approach that exploits contextual information and knowledge about interactions between humans, and augments it with the output of different low level detectors for human detection. Detections from low level parts-based detectors are treated as logical facts and used to reason explicitly about the presence or absence of humans in the scene. Positive and negative information from different sources, as well as uncertainties from detections and logical rules, are integrated within the bilattice framework. This approach also generates proofs or justifications for each hypothesis it proposes. These justifications (or lack thereof) are further employed by the system to explain and validate, or reject potential hypotheses. This allows the system to explicitly reason about complex interactions between humans and handle occlusions. These proofs are also available to the end user as an explanation of why the system thinks a particular hypothesis is actually a human. We employ a boosted cascade of gradient histograms based detector to detect individual body parts. We have applied this framework to analyze the presence of humans in static images from different datasets.
DAGM Symposium Symposium for Pattern Recognition, 2000
A comparison is made of global and local methods for the shape analysis of logos in an image data... more A comparison is made of global and local methods for the shape analysis of logos in an image database. The qualities of the methods are judged by using the shape signatures to define a similarity metric on the logos. As representatives for the two classes of methods, we use the negative shape method which is based on local shape information
Deformable Avatars, 2000
Abstract We demonstrate a method to compute three-dimensional (3D) motion fields on a face. Twelv... more Abstract We demonstrate a method to compute three-dimensional (3D) motion fields on a face. Twelve synchronized and calibrated cameras are positioned around a talking person, and observe its head in motion. We represent the head as a deformable mesh, which is ...
Lecture Notes in Computer Science, 2001
ABSTRACT
Pattern Recognition Letters, 2002
A comparison is made of global and local methods for the shape analysis of logos in an image data... more A comparison is made of global and local methods for the shape analysis of logos in an image database. The qualities of the methods are judged by using the shape signatures to define a similarity metric on the logos. As representatives for the two classes of methods, we use the negative shape method which is based on local shape information and a wavelet-based method which makes use of global information. We apply both methods to images with different kinds of degradations and examine how a given degradation highlights the strengths and shortcomings of each method. Finally, we use these results to combine information from both methods and develop a new method which is based on the relative performances of the two methods.
Lecture Notes in Computer Science, 2002
We examine the influence of camera design on the estimation of the motion and structure of a scen... more We examine the influence of camera design on the estimation of the motion and structure of a scene from video data. Every camera captures a subset of the light rays passing though some volume in space. By relating the differential structure of the time varying space of light rays to different known and new camera designs, we can establish a hierarchy of cameras. This hierarchy is based upon the stability and complexity of the computations necessary to estimate structure and motion. At the low end of this hierarchy is the standard planar pinhole camera for which the structure from motion problem is non-linear and ill-posed. At the high end is a camera, which we call the full field of view polydioptric camera, for which the problem is linear and stable. We develop design suggestions for the polydioptric camera, and based upon this new design we propose a linear algorithm for structure-from-motion estimation, which combines differential motion estimation with differential stereo.
Deformable Avatars, 2001
Abstract We demonstrate a method to compute three-dimensional (3D) motion fields on a face. Twelv... more Abstract We demonstrate a method to compute three-dimensional (3D) motion fields on a face. Twelve synchronized and calibrated cameras are positioned around a talking person, and observe its head in motion. We represent the head as a deformable mesh, which is ...
Computer Vision and Pattern Recognition, 2007
Combinations of microphones and cameras allow the joint audio visual sensing of a scene. Such arr... more Combinations of microphones and cameras allow the joint audio visual sensing of a scene. Such arrangements of sensors are common in biological organisms and in appli- cations such as meeting recording and surveillance where both modalities are necessary to provide scene understand- ing. Microphone arrays provide geometrical information on the source location, and allow the sound sources in the scene
2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE Cat. No.04CH37566), 2004
We describe a compound eye vision sensor for 3D ego motion computation. Inspired by eyes of insec... more We describe a compound eye vision sensor for 3D ego motion computation. Inspired by eyes of insects, we show that the compound eye sampling geometry is optimal for 3D camera motion estimation. This optimality allows us to estimate the 3D camera motion in a scene-independent and robust manner by utilizing linear equations. The mathematical model of the new sensor can be implemented in analog networks resulting in a compact computational sensor for instantaneous 3D ego motion measurements in full six degrees of freedom.
2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops), 2011
We present an approach for automatic annotation of commercial videos from an arts-and-crafts doma... more We present an approach for automatic annotation of commercial videos from an arts-and-crafts domain with the aid of textual descriptions. The main focus is on recognizing both manipulation actions (e.g. cut, draw, glue) and the tools that are used to perform these actions (e.g. markers, brushes, glue bottle). We demonstrate how multiple visual cues such as motion descriptors, object presence, and hand poses can be combined with the help of contextual priors that are automatically extracted from associated transcripts or online instructions. Using these diverse features and linguistic information we propose several increasingly complex computational models for recognizing elementary manipulation actions and composite activities, as well as their temporal order. The approach is evaluated on a novel dataset of comprised of 27 episodes of PBS Sprout TV, each containing on average 8 manipulation actions.
2007 IEEE Conference on Computer Vision and Pattern Recognition, 2007
The capacity to robustly detect humans in video is a critical component of automated visual surve... more The capacity to robustly detect humans in video is a critical component of automated visual surveillance systems. This paper describes a bilattice based logical reasoning approach that exploits contextual information and knowledge about interactions between humans, and augments it with the output of different low level detectors for human detection. Detections from low level parts-based detectors are treated as logical facts and used to reason explicitly about the presence or absence of humans in the scene. Positive and negative information from different sources, as well as uncertainties from detections and logical rules, are integrated within the bilattice framework. This approach also generates proofs or justifications for each hypothesis it proposes. These justifications (or lack thereof) are further employed by the system to explain and validate, or reject potential hypotheses. This allows the system to explicitly reason about complex interactions between humans and handle occlusions. These proofs are also available to the end user as an explanation of why the system thinks a particular hypothesis is actually a human. We employ a boosted cascade of gradient histograms based detector to detect individual body parts. We have applied this framework to analyze the presence of humans in static images from different datasets.
Advances in Video, Audio, and Imagery Analysis for Search, Data Mining, Surveillance, and Authoring, 2012
ABSTRACT This chapter contains sections titled: Introduction and Motivation Related Research Sema... more ABSTRACT This chapter contains sections titled: Introduction and Motivation Related Research Semantic Multimedia Extraction Using Audio or Closed -Captions Semantic Multimedia Extraction Using Video Conclusion and Future Work
The Visual Computer International Journal of Computer Graphics, 2003
More and more processing of visual information is nowadays done by computers, but the images capt... more More and more processing of visual information is nowadays done by computers, but the images captured by conventional cameras are still based on the pinhole principle inspired by our own eyes. This principle though is not necessarily the optimal image formation principle for automated processing of visual information. Each camera samples the space of light rays according to some pattern. If we understand the structure of the space formed by the light rays passing through a volume of space, we can determine the camera, or in other words the sampling pattern of light rays, that is optimal with regard to a given task. In this work we analyze the differential structure of the space of time-varying light rays described by the plenoptic function and use this analysis to relate the rigid motion of an imaging device to the derivatives of the plenoptic function. The results can be used to define a hierarchy of camera models with respect to the structure from motion problem and formulate a linear, scene-independent estimation problem for the rigid motion of the sensor purely in terms of the captured images.
International Journal of Computer Vision, 2011
Predicate logic based reasoning approaches provide a means of formally specifying domain knowledg... more Predicate logic based reasoning approaches provide a means of formally specifying domain knowledge and manipulating symbolic information to explicitly reason about different concepts of interest. Extension of traditional binary predicate logics with the bilattice formalism permits the handling of uncertainty in reasoning, thereby facilitating their application to computer vision problems. In this paper, we propose using first order predicate logics, extended with a bilattice based uncertainty handling formalism, as a means of formally encoding pattern grammars, to parse a set of image features, and detect the presence of different patterns of interest. Detections from low level feature detectors are treated as logical facts and, in conjunction with logical rules, used to drive the reasoning. Positive and negative information from different sources, as well as uncertainties from detections, are integrated within the bilattice framework. We show that this approach can also generate proofs or justifications, in the form of parse trees, for each hypothesis it proposes thus permitting direct analysis of the final solution in linguistic form. Automated logical rule weight learning is an important aspect of the application of such
Computer Vision and Image Understanding, 2004
The view-independent visualization of 3D scenes is most often based on rendering accurate 3-dimen... more The view-independent visualization of 3D scenes is most often based on rendering accurate 3-dimensional models or utilizes image-based rendering techniques. To compute the 3D structure of a scene from a moving vision sensor or to use imagebased rendering approaches, we need to be able to estimate the motion of the sensor from the recorded image information with high accuracy, a problem that has been well-studied. In this work, we investigate the relationship between camera design and our ability to perform accurate 3D photography, by examining the influence of camera design on the estimation of the motion and structure of a scene from video data. By relating the differential structure of the time varying plenoptic function to different known and new camera designs, we can establish a hierarchy of cameras based upon the stability and complexity of the computations necessary to estimate structure and motion. At the low end of this hierarchy is the standard planar pinhole camera for which the structure from motion problem is non-linear and ill-posed. At the high end is a camera, which we call the full field of view polydioptric camera, for which the motion estimation problem can be solved independently of the depth of the scene which leads to fast and robust algorithms for 3D Photography. In between are multiple view cameras with a large field of view which we have built, as well as omni-directional sensors.
Natural eye designs are optimized with regard to the tasks the eye-carrying organism has to perfo... more Natural eye designs are optimized with regard to the tasks the eye-carrying organism has to perform for survival. This optimization has been performed by the process of natural evolution over many millions of years. Every eye captures a subset of the space of light rays. The information contained in this subset and the accuracy to which the eye can extract the necessary information determines an upper limit on how well an organism can perform a given task. In this work we propose a new methodology for camera design. By interpreting eyes as sample patterns in light ray space we can phrase the problem of eye design in a signal processing framework. This allows us to develop mathematical criteria for optimal eye design, which in turn enables us to build the best eye for a given task without the trial and error phase of natural evolution. The principle is evaluated on the task of 3D ego-motion estimation.