Faisal Qureshi - Academia.edu (original) (raw)
Uploads
articles by Faisal Qureshi
For robot manipulators, an OLP system provides a detailed 3D simulation test-bed for visualizatio... more For robot manipulators, an OLP system provides a detailed 3D simulation test-bed for visualization and analysis of various what-if scenarios. This paper discusses the prototype OLP system developed at the Informatics Complex for a six degrees of freedom cylindrical serial manipulator. Paradigms from classical robotics literature have been used to develop models of kinematics and to plan manipulator trajectories. 3D graphics techniques have been used in an object-oriented programming environment; Microsoft Windows was chosen as the platform for the OLP system. The simulator provides a complete and flexible view of the manipulator, with menu driven task simulation capability, and tools for visual and mathematical analysis of manipulator's performance. Mathematical models for kinematics and trajectory planning for the manipulator have effectively been integrated with the developed 3D simulation software.
We propose a new method for remote sensing image matching. The proposed method uses encoder subne... more We propose a new method for remote sensing image matching. The proposed method uses encoder subnetwork of an autoencoder pre-trained on GTCrossView data to construct image features. A discriminator network trained on University of California Merced Land Use/Land Cover dataset (LandUse) and High-resolution Satellite Scene dataset (SatScene) computes a match score between a pair of computed image features. We also propose a new network unit, called residual-dyad, and empirically demonstrate that networks that use residual-dyad units outperform those that do not. We compare our approach with both traditional and more recent learning-based schemes on LandUse and SatScene datasets, and the proposed method achieves state-of-the-art result in terms of mean average precision and ANMRR metrics. Specifically, our method achieves an overall improvement in performance of 11.26% and 22.41%, respectively, for LandUse and SatScene benchmark datasets.
This paper presents a new method for efficiently computing large-displacement optical flow. The m... more This paper presents a new method for efficiently computing large-displacement optical flow. The method uses dominant motion patterns to identify a sparse set of subvolumes within the cost volume and restricts subsequent Edge-Aware Filtering (EAF) to these sub-volumes. The method uses an extension of PatchMatch to filter these sub-volumes. The fact that our method only applies EAF to a small fraction of the entire cost volume boosts runtime performance. We also show that computational complexity is linear in the size of the images and does not depend upon the size of the label space. We evaluate the proposed technique on MPI Sintel, Middlebury and KITTI benchmarks and show that our method achieves accuracy comparable to those of several recent state-of-the-art methods, while posting significantly faster runtimes.
Optimal scale selection for image segmentation is an essential component of the Object-Based Imag... more Optimal scale selection for image segmentation is an essential component of the Object-Based Image Analysis (OBIA) and interpretation. An optimal segmentation scale is a scale at which image objects, overall, best represent real-world ground objects and features across the entire image. At this scale, the intra-object variance is ideally lowest and the inter-object spatial autocorrelation is ideally highest, and a change in the scale could cause an abrupt change in these measures. Unsupervised parameter optimization methods typically use global measures of spatial and spectral properties calculated from all image objects in all bands as the target criteria to determine the optimal segmentation scale. However, no studies consider the effect of noise in image spectral bands on the segmentation assessment and scale selection. Furthermore, these global measures could be affected by outliers or extreme values from a small number of objects. These issues may lead to incorrect assessment and selection of optimal scales and cause the uncertainties in subsequent segmentation and classification results. These issues become more pronounced when segmenting hyperspectral data with large spectral variability across the spectrum. In this study, we propose an enhanced method that 1) incorporates the band's inverse noise weighting in the segmentation and 2) detects and removes outliers before determining segmentation scale parameters. The proposed method is evaluated on three well-established segmentation approachesk-means, mean-shift, and watershed. The generated segments are validated by comparing them with reference polygons using normalized over-segmentation (OS), under-segmentation (US), and the Euclidean Distance (ED) indices. The results demonstrate that this proposed scale selection method produces more accurate and reliable segmentation results. The approach can be applied to other segmentation selection criteria and are useful for automatic multi-parameter tuning and optimal scale parameter selections in OBIA methods in remote sensing.
We present a new framework for capturing videos using sensor-rich mobile devices, such as smartph... more We present a new framework for capturing videos using sensor-rich mobile devices, such as smartphones, tablets, etc. Many of today's mobile devices are equipped with a variety of sensors, including accelerometers, magnetometers and gyroscopes, which are rarely used during video capture for anything more than video stabilization. We demonstrate that these sensors, together with the information that can be extracted from the recorded video via computer vision techniques, provide a rich source of data that can be leveraged to automatically edit and "clean up" the captured video. Sensor data, for example, can be used to identify undesirable video segments that are then hidden from view. We showcase an Android video recording app that captures sensor data during video recording and is capable of automatically constructing final-cuts from the recorded video. The app uses the captured sensor data plus computer vision algorithms, such as focus analysis, face detection, etc., to filter out undesirable segments and keep visually appealing portions of the captured video to create a final cut. We also show how information from various sensors and computer vision routines can be combined to create different final cuts with little or no user input.
Virtual Vision advocates developing visually and behaviorally realistic 3D synthetic environments... more Virtual Vision advocates developing visually and behaviorally realistic 3D synthetic environments to serve the needs of computer vision research. Virtual vision, especially, is well-suited for studying large-scale camera networks. A virtual vision simulator capable of generating "realistic" synthetic imagery from real-life scenes, involving pedestrians and other objects, is the sine qua non of carrying out virtual vision research. Here we develop a distributed, customizable virtual vision simulator capable of simulating pedestrian traffic in a variety of 3D environments. Virtual cameras deployed in this synthetic environment generate imagery using state-of-theart computer graphics techniques, boasting realistic lighting effects, shadows, etc. The synthetic imagery is fed into a visual analysis pipeline that currently supports pedestrian detection and tracking. The results of this analysis can then be used for subsequent processing, such as camera control, coordination, and handoff. It is important to bear in mind that our visual analysis pipeline is designed to handle real world imagery without any modifications. Consequently, it closely mimics the performance of visual analysis routines that one might deploy on physical cameras. Our virtual vision simulator is realized as a collection of modules that communicate with each other over the network. Consequently, we can deploy our simulator over a network of computers, allowing us to simulate much larger camera networks and much more complex scenes then is otherwise possible.
In recent years, many deep learning techniques have been applied to the image inpainting problem:... more In recent years, many deep learning techniques have been applied to the image inpainting problem: the task of filling incomplete regions of an image. However, these models struggle to recover and/or preserve image structure especially when significant portions of the image are missing. We propose a two-stage model that separates the inpainting problem into structure prediction and image completion. Similar to sketch art, our model first predicts the image structure of the missing region in the form of edge maps. Predicted edge maps are passed to the second stage to guide the inpainting process. We evaluate our model endto-end over publicly available datasets CelebA, CelebHQ, Places2, and Paris StreetView on images up to a resolution of 512 × 512. We demonstrate that this approach outperforms current state-of-the-art techniques quantitatively and qualitatively.
We present a method for extracting three-dimensional flight trajectories of liquid droplets from ... more We present a method for extracting three-dimensional flight trajectories of liquid droplets from video data. A high-speed stereo camera pair records videos of experimental reconstructions of projectile impacts and ensuing droplet scattering. After background removal and segmentation of individual droplets in each video frame, we introduce a model-based matching technique to accumulate image paths for individual droplets. Our motion detection algorithm is designed to deal gracefully with the lack of feature points, with the similarity of droplets in shape, size, and color, and with incomplete droplet paths due to noise, occlusions, etc. The final reconstruction algorithm pairs two-dimensional paths accumulated from each of the two cameras' videos to reconstruct trajectories in three dimensions. The reconstructed droplet trajectories constitute a starting point for a physically accurate model of blood droplet flight for forensic bloodstain pattern analysis. (a) (b) (c) Figure 1: (a) BB pellet impacting ballistic gel containing transfer blood. (b) Tracking individual blood droplets in high-speed video (1300 frames per second). (c) Reconstructed blood droplet trajectories. Notice the effects of gravity and viscous drag forces even for short trajectories.
In this paper, we propose a novel approach that learns to sequentially attend to different Convol... more In this paper, we propose a novel approach that learns to sequentially attend to different Convolutional Neural Networks (CNN) layers (i.e., "what" feature abstraction to attend to) and different spatial locations of the selected feature map (i.e., "where") to perform the task at hand. Specifically, at each Recurrent Neural Network step, both a CNN layer and localized spatial region within it are selected for further processing. We demonstrate the effectiveness of this approach on two computer vision tasks: (i) image-based six degrees of freedom camera pose regression and (ii) indoor scene classification. Empirically, we show that combining the "what" and "where" aspects of attention improves network performance on both tasks. We evaluate our method on standard benchmarks for camera localization (Cambridge, 7-Scenes, and TUM-LSI) and for scene classification (MIT-67 Indoor Scenes). For camera localization our approach reduces the median error by 18.8% for position and 8.2% for orientation (averaged over all scenes), and for scene classification, it improves the mean accuracy by 3.4% over previous methods.
This paper explores the exciting possibility of using Google Earth as a software laboratory for s... more This paper explores the exciting possibility of using Google Earth as a software laboratory for studying wide-area scene analysis using near-ground aerial imagery. To this end we present a new image mosaicing algorithm capable of generating large mosaics from imagery captured by a near-ground aerial vehicle. Our algorithm eschews camera calibration and can handle strong parallax effects visible in the captured imagery. The imagery is generated by simulating an aerial vehicle flying over the New York city within the Google Earth environment. We also evaluate the proposed approach on a real dataset captured by a physical aerial vehicle, demonstrating that the algorithm that was initially developing using synthetic imagery does indeed work on real data.
We present a new scheme for partitioning geo-tagged reference image database in an effort to spee... more We present a new scheme for partitioning geo-tagged reference image database in an effort to speed up (query) image localization while maintaining acceptable localization accuracy. Our method learns a topic model over the reference database, which in turn is used to divide the reference database into scene groups. Each scene group consists of "visually similar" images as determined by the topic model. Next raw Scale-Invariant Feature Transform (SIFT) features are collected from every image in a scene group a Fast Library for Approximate Nearest Neightbours (FLANN) index is constructed. Given a query image, first its scene group is determined using the topic model and then its SIFT features are matched against the corresponding FLANN index. The query image is localized using the location information from the visually similar images in the reference database. We evaluate our approach on Google Map Street View dataset and demonstrate that our method outperforms a competing technique.
This article is an open access article distributed under the terms and conditions of the Creative... more This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY
We developed a new method for extracting 3D flight trajectories of droplets using high-speed ster... more We developed a new method for extracting 3D flight trajectories of droplets using high-speed stereo capture. We noticed that traditional multi-camera tracking techniques fare poorly on our problem, in part due to the fact that all droplets have very similar shapes, sizes and appearances. Our method uses local motion models to track individual droplets in each frame. 2D tracks are used to learn a global, non-linear motion model, which in turn can be used to estimate the 3D locations of individual droplets even when these are not visible in any camera. We have evaluated the proposed method on both synthetic and real data and our method is able to reconstruct 3D flight trajectories of hundreds of droplets. The proposed technique solves for both the 3D trajectory of a droplet and its motion model concomitantly, and we have found it to be superior to 3D reconstruction via triangulation. Furthermore, the learned global motion model allows us to relax the simultaneity assumptions of stereo camera systems. Our results suggest that, even when full stereo information is available, our unsynchronized reconstruction using the global motion model can significantly improve the 3D estimation accuracy.
We explore the use of neural networks to solve the Laplace equation in a twodimensional geometry.... more We explore the use of neural networks to solve the Laplace equation in a twodimensional geometry. Specifically, we study a PDE problem that models the electric potential inside the slit-well nanofluidic device. Such devices are typically used to separate polymer mixtures by molecular size. Processes like these are commonly studied using GPU-accelerated coarse-grained particle simulations, for which GPU memory is a bottleneck. We compare the memory required to represent the field using neural networks to that needed to store solutions obtained using the finite element method. We find that even simple fully-connected neural networks can achieve accuracy to memory consumption ratios comparable to the good finite element solutions. These preliminary results demonstrate an industrial application that would benefit greatly from compact neural network representation techniques.
Here we introduce a framework for preserving privacy in video surveillance. Raw video footage is ... more Here we introduce a framework for preserving privacy in video surveillance. Raw video footage is decomposed into a background and one or more objectvideo streams. Such object-centric decomposition of the incoming video footage opens up new possibilities to provide visual surveillance of an area without compromising the privacy of the individuals present in that area. Object-video streams allow us to render the scene in a variety of ways: 1) individuals in the scene can be represented as blobs, obscuring their identities; 2) foreground objects can be color coded to convey subtle scene information to the operator, again without revealing the identities of the individuals present in the scene; 3) the scene can be partially rendered, i.e., revealing the identities of some individuals, while preserving the anonymity of others, etc. We evaluate our approach in a virtual train station environment populated by autonomous, lifelike virtual pedestrians. We also demonstrate our approach on real video footage. Lastly, we show that Microsoft Kinect sensor can be used to decompose the incoming video footage into object-video streams.
There is a large growth in hardware and software systems capable of producing vast amounts of ima... more There is a large growth in hardware and software systems capable of producing vast amounts of image and video data. These systems are rich sources of continuous image and video streams. This motivates researchers to build scalable computer vision systems that utilize data-streaming concepts for processing of visual data streams. However, several challenges exist in building large-scale computer vision systems. For example, computer vision algorithms have different accuracy and speed profiles depending on the content, type, and speed of incoming data. Also, it is not clear how to adaptively tune these algorithms in large-scale systems. These challenges exist because we lack formal frameworks for building and optimizing large-scale visual processing. This paper presents formal methods and algorithms that aim to overcome these challenges and improve building and optimizing large-scale computer vision systems. We describe a formal algebra framework for the mathematical description of computer vision pipelines for processing image and video streams. The algebra naturally describes feedback control and provides a formal and abstract method for optimizing computer vision pipelines. We then show that a general optimizer can be used with the feedback-control mechanisms of our stream algebra to provide a common online parameter optimization method for computer vision pipelines.
We present a novel technique for image driven shot retrieval in video data. Specifically, given a... more We present a novel technique for image driven shot retrieval in video data. Specifically, given a query image, our method can efficiently pick the video segment containing that image. Video is first divided into shots. Each shot is described using an embedded hidden Markov model (EHMM). The EHMM is trained on GIST-like descriptors of frames in that shot. The trained EHMM computes the likelihood that a query image belongs to the shot. A Support Vector Machine classifier is trained for each EHMM. The classifier provides a yes/no decision given the likelihood value produced by its EHMM. Given a collection of shot models from one or more videos, the proposed technique can efficiently decide whether or not an image belongs to a video by identifying the shot most likely to contain that image. The proposed technique is evaluated on a realistic dataset.
This paper addresses the problem of indexing and querying very large databases of binary vectors.... more This paper addresses the problem of indexing and querying very large databases of binary vectors. Such databases of binary vectors are a common occurrence in domains such as information retrieval and computer vision. We propose an indexing structure consisting of a compressed bitwise trie and a hash table for supporting range queries in Hamming space. The index structure, which can be updated incrementally, is able to solve the range queries for any radius. Our approach significantly outperforms state-of-the-art approaches.
For robot manipulators, an OLP system provides a detailed 3D simulation test-bed for visualizatio... more For robot manipulators, an OLP system provides a detailed 3D simulation test-bed for visualization and analysis of various what-if scenarios. This paper discusses the prototype OLP system developed at the Informatics Complex for a six degrees of freedom cylindrical serial manipulator. Paradigms from classical robotics literature have been used to develop models of kinematics and to plan manipulator trajectories. 3D graphics techniques have been used in an object-oriented programming environment; Microsoft Windows was chosen as the platform for the OLP system. The simulator provides a complete and flexible view of the manipulator, with menu driven task simulation capability, and tools for visual and mathematical analysis of manipulator's performance. Mathematical models for kinematics and trajectory planning for the manipulator have effectively been integrated with the developed 3D simulation software.
We propose a new method for remote sensing image matching. The proposed method uses encoder subne... more We propose a new method for remote sensing image matching. The proposed method uses encoder subnetwork of an autoencoder pre-trained on GTCrossView data to construct image features. A discriminator network trained on University of California Merced Land Use/Land Cover dataset (LandUse) and High-resolution Satellite Scene dataset (SatScene) computes a match score between a pair of computed image features. We also propose a new network unit, called residual-dyad, and empirically demonstrate that networks that use residual-dyad units outperform those that do not. We compare our approach with both traditional and more recent learning-based schemes on LandUse and SatScene datasets, and the proposed method achieves state-of-the-art result in terms of mean average precision and ANMRR metrics. Specifically, our method achieves an overall improvement in performance of 11.26% and 22.41%, respectively, for LandUse and SatScene benchmark datasets.
This paper presents a new method for efficiently computing large-displacement optical flow. The m... more This paper presents a new method for efficiently computing large-displacement optical flow. The method uses dominant motion patterns to identify a sparse set of subvolumes within the cost volume and restricts subsequent Edge-Aware Filtering (EAF) to these sub-volumes. The method uses an extension of PatchMatch to filter these sub-volumes. The fact that our method only applies EAF to a small fraction of the entire cost volume boosts runtime performance. We also show that computational complexity is linear in the size of the images and does not depend upon the size of the label space. We evaluate the proposed technique on MPI Sintel, Middlebury and KITTI benchmarks and show that our method achieves accuracy comparable to those of several recent state-of-the-art methods, while posting significantly faster runtimes.
Optimal scale selection for image segmentation is an essential component of the Object-Based Imag... more Optimal scale selection for image segmentation is an essential component of the Object-Based Image Analysis (OBIA) and interpretation. An optimal segmentation scale is a scale at which image objects, overall, best represent real-world ground objects and features across the entire image. At this scale, the intra-object variance is ideally lowest and the inter-object spatial autocorrelation is ideally highest, and a change in the scale could cause an abrupt change in these measures. Unsupervised parameter optimization methods typically use global measures of spatial and spectral properties calculated from all image objects in all bands as the target criteria to determine the optimal segmentation scale. However, no studies consider the effect of noise in image spectral bands on the segmentation assessment and scale selection. Furthermore, these global measures could be affected by outliers or extreme values from a small number of objects. These issues may lead to incorrect assessment and selection of optimal scales and cause the uncertainties in subsequent segmentation and classification results. These issues become more pronounced when segmenting hyperspectral data with large spectral variability across the spectrum. In this study, we propose an enhanced method that 1) incorporates the band's inverse noise weighting in the segmentation and 2) detects and removes outliers before determining segmentation scale parameters. The proposed method is evaluated on three well-established segmentation approachesk-means, mean-shift, and watershed. The generated segments are validated by comparing them with reference polygons using normalized over-segmentation (OS), under-segmentation (US), and the Euclidean Distance (ED) indices. The results demonstrate that this proposed scale selection method produces more accurate and reliable segmentation results. The approach can be applied to other segmentation selection criteria and are useful for automatic multi-parameter tuning and optimal scale parameter selections in OBIA methods in remote sensing.
We present a new framework for capturing videos using sensor-rich mobile devices, such as smartph... more We present a new framework for capturing videos using sensor-rich mobile devices, such as smartphones, tablets, etc. Many of today's mobile devices are equipped with a variety of sensors, including accelerometers, magnetometers and gyroscopes, which are rarely used during video capture for anything more than video stabilization. We demonstrate that these sensors, together with the information that can be extracted from the recorded video via computer vision techniques, provide a rich source of data that can be leveraged to automatically edit and "clean up" the captured video. Sensor data, for example, can be used to identify undesirable video segments that are then hidden from view. We showcase an Android video recording app that captures sensor data during video recording and is capable of automatically constructing final-cuts from the recorded video. The app uses the captured sensor data plus computer vision algorithms, such as focus analysis, face detection, etc., to filter out undesirable segments and keep visually appealing portions of the captured video to create a final cut. We also show how information from various sensors and computer vision routines can be combined to create different final cuts with little or no user input.
Virtual Vision advocates developing visually and behaviorally realistic 3D synthetic environments... more Virtual Vision advocates developing visually and behaviorally realistic 3D synthetic environments to serve the needs of computer vision research. Virtual vision, especially, is well-suited for studying large-scale camera networks. A virtual vision simulator capable of generating "realistic" synthetic imagery from real-life scenes, involving pedestrians and other objects, is the sine qua non of carrying out virtual vision research. Here we develop a distributed, customizable virtual vision simulator capable of simulating pedestrian traffic in a variety of 3D environments. Virtual cameras deployed in this synthetic environment generate imagery using state-of-theart computer graphics techniques, boasting realistic lighting effects, shadows, etc. The synthetic imagery is fed into a visual analysis pipeline that currently supports pedestrian detection and tracking. The results of this analysis can then be used for subsequent processing, such as camera control, coordination, and handoff. It is important to bear in mind that our visual analysis pipeline is designed to handle real world imagery without any modifications. Consequently, it closely mimics the performance of visual analysis routines that one might deploy on physical cameras. Our virtual vision simulator is realized as a collection of modules that communicate with each other over the network. Consequently, we can deploy our simulator over a network of computers, allowing us to simulate much larger camera networks and much more complex scenes then is otherwise possible.
In recent years, many deep learning techniques have been applied to the image inpainting problem:... more In recent years, many deep learning techniques have been applied to the image inpainting problem: the task of filling incomplete regions of an image. However, these models struggle to recover and/or preserve image structure especially when significant portions of the image are missing. We propose a two-stage model that separates the inpainting problem into structure prediction and image completion. Similar to sketch art, our model first predicts the image structure of the missing region in the form of edge maps. Predicted edge maps are passed to the second stage to guide the inpainting process. We evaluate our model endto-end over publicly available datasets CelebA, CelebHQ, Places2, and Paris StreetView on images up to a resolution of 512 × 512. We demonstrate that this approach outperforms current state-of-the-art techniques quantitatively and qualitatively.
We present a method for extracting three-dimensional flight trajectories of liquid droplets from ... more We present a method for extracting three-dimensional flight trajectories of liquid droplets from video data. A high-speed stereo camera pair records videos of experimental reconstructions of projectile impacts and ensuing droplet scattering. After background removal and segmentation of individual droplets in each video frame, we introduce a model-based matching technique to accumulate image paths for individual droplets. Our motion detection algorithm is designed to deal gracefully with the lack of feature points, with the similarity of droplets in shape, size, and color, and with incomplete droplet paths due to noise, occlusions, etc. The final reconstruction algorithm pairs two-dimensional paths accumulated from each of the two cameras' videos to reconstruct trajectories in three dimensions. The reconstructed droplet trajectories constitute a starting point for a physically accurate model of blood droplet flight for forensic bloodstain pattern analysis. (a) (b) (c) Figure 1: (a) BB pellet impacting ballistic gel containing transfer blood. (b) Tracking individual blood droplets in high-speed video (1300 frames per second). (c) Reconstructed blood droplet trajectories. Notice the effects of gravity and viscous drag forces even for short trajectories.
In this paper, we propose a novel approach that learns to sequentially attend to different Convol... more In this paper, we propose a novel approach that learns to sequentially attend to different Convolutional Neural Networks (CNN) layers (i.e., "what" feature abstraction to attend to) and different spatial locations of the selected feature map (i.e., "where") to perform the task at hand. Specifically, at each Recurrent Neural Network step, both a CNN layer and localized spatial region within it are selected for further processing. We demonstrate the effectiveness of this approach on two computer vision tasks: (i) image-based six degrees of freedom camera pose regression and (ii) indoor scene classification. Empirically, we show that combining the "what" and "where" aspects of attention improves network performance on both tasks. We evaluate our method on standard benchmarks for camera localization (Cambridge, 7-Scenes, and TUM-LSI) and for scene classification (MIT-67 Indoor Scenes). For camera localization our approach reduces the median error by 18.8% for position and 8.2% for orientation (averaged over all scenes), and for scene classification, it improves the mean accuracy by 3.4% over previous methods.
This paper explores the exciting possibility of using Google Earth as a software laboratory for s... more This paper explores the exciting possibility of using Google Earth as a software laboratory for studying wide-area scene analysis using near-ground aerial imagery. To this end we present a new image mosaicing algorithm capable of generating large mosaics from imagery captured by a near-ground aerial vehicle. Our algorithm eschews camera calibration and can handle strong parallax effects visible in the captured imagery. The imagery is generated by simulating an aerial vehicle flying over the New York city within the Google Earth environment. We also evaluate the proposed approach on a real dataset captured by a physical aerial vehicle, demonstrating that the algorithm that was initially developing using synthetic imagery does indeed work on real data.
We present a new scheme for partitioning geo-tagged reference image database in an effort to spee... more We present a new scheme for partitioning geo-tagged reference image database in an effort to speed up (query) image localization while maintaining acceptable localization accuracy. Our method learns a topic model over the reference database, which in turn is used to divide the reference database into scene groups. Each scene group consists of "visually similar" images as determined by the topic model. Next raw Scale-Invariant Feature Transform (SIFT) features are collected from every image in a scene group a Fast Library for Approximate Nearest Neightbours (FLANN) index is constructed. Given a query image, first its scene group is determined using the topic model and then its SIFT features are matched against the corresponding FLANN index. The query image is localized using the location information from the visually similar images in the reference database. We evaluate our approach on Google Map Street View dataset and demonstrate that our method outperforms a competing technique.
This article is an open access article distributed under the terms and conditions of the Creative... more This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY
We developed a new method for extracting 3D flight trajectories of droplets using high-speed ster... more We developed a new method for extracting 3D flight trajectories of droplets using high-speed stereo capture. We noticed that traditional multi-camera tracking techniques fare poorly on our problem, in part due to the fact that all droplets have very similar shapes, sizes and appearances. Our method uses local motion models to track individual droplets in each frame. 2D tracks are used to learn a global, non-linear motion model, which in turn can be used to estimate the 3D locations of individual droplets even when these are not visible in any camera. We have evaluated the proposed method on both synthetic and real data and our method is able to reconstruct 3D flight trajectories of hundreds of droplets. The proposed technique solves for both the 3D trajectory of a droplet and its motion model concomitantly, and we have found it to be superior to 3D reconstruction via triangulation. Furthermore, the learned global motion model allows us to relax the simultaneity assumptions of stereo camera systems. Our results suggest that, even when full stereo information is available, our unsynchronized reconstruction using the global motion model can significantly improve the 3D estimation accuracy.
We explore the use of neural networks to solve the Laplace equation in a twodimensional geometry.... more We explore the use of neural networks to solve the Laplace equation in a twodimensional geometry. Specifically, we study a PDE problem that models the electric potential inside the slit-well nanofluidic device. Such devices are typically used to separate polymer mixtures by molecular size. Processes like these are commonly studied using GPU-accelerated coarse-grained particle simulations, for which GPU memory is a bottleneck. We compare the memory required to represent the field using neural networks to that needed to store solutions obtained using the finite element method. We find that even simple fully-connected neural networks can achieve accuracy to memory consumption ratios comparable to the good finite element solutions. These preliminary results demonstrate an industrial application that would benefit greatly from compact neural network representation techniques.
Here we introduce a framework for preserving privacy in video surveillance. Raw video footage is ... more Here we introduce a framework for preserving privacy in video surveillance. Raw video footage is decomposed into a background and one or more objectvideo streams. Such object-centric decomposition of the incoming video footage opens up new possibilities to provide visual surveillance of an area without compromising the privacy of the individuals present in that area. Object-video streams allow us to render the scene in a variety of ways: 1) individuals in the scene can be represented as blobs, obscuring their identities; 2) foreground objects can be color coded to convey subtle scene information to the operator, again without revealing the identities of the individuals present in the scene; 3) the scene can be partially rendered, i.e., revealing the identities of some individuals, while preserving the anonymity of others, etc. We evaluate our approach in a virtual train station environment populated by autonomous, lifelike virtual pedestrians. We also demonstrate our approach on real video footage. Lastly, we show that Microsoft Kinect sensor can be used to decompose the incoming video footage into object-video streams.
There is a large growth in hardware and software systems capable of producing vast amounts of ima... more There is a large growth in hardware and software systems capable of producing vast amounts of image and video data. These systems are rich sources of continuous image and video streams. This motivates researchers to build scalable computer vision systems that utilize data-streaming concepts for processing of visual data streams. However, several challenges exist in building large-scale computer vision systems. For example, computer vision algorithms have different accuracy and speed profiles depending on the content, type, and speed of incoming data. Also, it is not clear how to adaptively tune these algorithms in large-scale systems. These challenges exist because we lack formal frameworks for building and optimizing large-scale visual processing. This paper presents formal methods and algorithms that aim to overcome these challenges and improve building and optimizing large-scale computer vision systems. We describe a formal algebra framework for the mathematical description of computer vision pipelines for processing image and video streams. The algebra naturally describes feedback control and provides a formal and abstract method for optimizing computer vision pipelines. We then show that a general optimizer can be used with the feedback-control mechanisms of our stream algebra to provide a common online parameter optimization method for computer vision pipelines.
We present a novel technique for image driven shot retrieval in video data. Specifically, given a... more We present a novel technique for image driven shot retrieval in video data. Specifically, given a query image, our method can efficiently pick the video segment containing that image. Video is first divided into shots. Each shot is described using an embedded hidden Markov model (EHMM). The EHMM is trained on GIST-like descriptors of frames in that shot. The trained EHMM computes the likelihood that a query image belongs to the shot. A Support Vector Machine classifier is trained for each EHMM. The classifier provides a yes/no decision given the likelihood value produced by its EHMM. Given a collection of shot models from one or more videos, the proposed technique can efficiently decide whether or not an image belongs to a video by identifying the shot most likely to contain that image. The proposed technique is evaluated on a realistic dataset.
This paper addresses the problem of indexing and querying very large databases of binary vectors.... more This paper addresses the problem of indexing and querying very large databases of binary vectors. Such databases of binary vectors are a common occurrence in domains such as information retrieval and computer vision. We propose an indexing structure consisting of a compressed bitwise trie and a hash table for supporting range queries in Hamming space. The index structure, which can be updated incrementally, is able to solve the range queries for any radius. Our approach significantly outperforms state-of-the-art approaches.
We present a cognitively-controlled vision system that combines lowlevel object recognition and t... more We present a cognitively-controlled vision system that combines lowlevel object recognition and tracking with high-level symbolic reasoning with the practical purpose of solving difficult space robotics problems-satellite rendezvous and docking. The reasoning module, which encodes a model of the environment, performs deliberation to 1) guide the vision system in a task-directed manner, 2) activate vision modules depending on the progress of the task, 3) validate the performance of the vision system, and 4) suggest corrections to the vision system when the latter is performing poorly. Reasoning and related elements, among them intention, context, and memory, contribute to improve the performance (i.e., robustness, reliability, and usability). We demonstrate the vision system controlling a robotic arm that autonomously captures a free-flying satellite. Currently such operations are performed either manually or by constructing detailed control scripts. The manual approach is costly and exposes the astronauts to danger, while the scripted approach is tedious and error-prone. Therefore, there is substantial interest in performing these operations autonomously, and the work presented here is a step in this direction. To the best of our knowledge, this is the only satellite-capturing system that relies exclusively on vision to estimate the pose of the satellite and can deal with an uncooperative satellite.
Computer vision and sensor networks researchers are increasingly mo-tivated to investigate comple... more Computer vision and sensor networks researchers are increasingly mo-tivated to investigate complex multi-camera sensing and control issues that arise in the automatic visual surveillance of extensive, highly pop-ulated public spaces such as airports and train stations. However, they often encounter serious impediments to deploying and experimenting with large-scale physical camera networks in such real-world environ-ments. We propose an alternative approach called “Virtual Vision” that facilitates this type of research through the virtual reality simula-tion of populated urban spaces, camera sensor networks, and computer vision on commodity computers. We demonstrate the usefulness of our approach by developing two highly automated surveillance systems com-prising passive and active pan/tilt/zoom cameras that are deployed in a virtual train station environment populated by autonomous, lifelike vir-tual pedestrians. The easily reconfigurable virtual cameras distributed in this environ...
ArXiv, 2018
We present a framework for video-driven crowd synthesis. Motion vectors extracted from input crow... more We present a framework for video-driven crowd synthesis. Motion vectors extracted from input crowd video are processed to compute global motion paths. These paths encode the dominant motions observed in the input video. These paths are then fed into a behavior-based crowd simulation framework, which is responsible for synthesizing crowd animations that respect the motion patterns observed in the video. Our system synthesizes 3D virtual crowds by animating virtual humans along the trajectories returned by the crowd simulation framework. We also propose a new metric for comparing the "visual similarity" between the synthesized crowd and exemplar crowd. We demonstrate the proposed approach on crowd videos collected under different settings.
7 ABSTRACT | This paper presents our research towards smart 8 camera networks capable of carrying... more 7 ABSTRACT | This paper presents our research towards smart 8 camera networks capable of carrying out advanced surveil9 lance tasks with little or no human supervision. A unique 10 centerpiece of our work is the combination of computer 11 graphics, artificial life, and computer vision simulation tech12 nologies to develop such networks and experiment with them. 13 Specifically, we demonstrate a smart camera network com14 prising static and active simulated video surveillance cameras 15 that provides extensive coverage of a large virtual public space, 16 a train station populated by autonomously self-animating 17 virtual pedestrians. The realistically simulated network of 18 smart cameras performs persistent visual surveillance of 19 individual pedestrians with minimal intervention. Our innova20 tive camera control strategy naturally addresses camera 21 aggregation and handoff, is robust against camera and 22 communication failures, and requires no camera calibration, 23 detailed wor...
Computer Vision and Image Understanding, 2015
Academic Press Library in Signal Processing, 2014
Multi-camera systems are rapidly evolving from highly specialized wired networks of stationary pa... more Multi-camera systems are rapidly evolving from highly specialized wired networks of stationary passive and active cameras that provide visual coverage of the scene to ad hoc wireless networks of smart camera nodes, capable of near-autonomous operation in a variety of applications, such as urban and participatory sensing, disaster response, plant and animal habitat monitoring, etc. Whereas traditional multi-camera systems focus primarily on wide-area scene analysis, smart camera networks are also concerned with camera coordination and control, in-network processing and storage, and resourceaware visual analysis. Pre-recorded video, while useful, is inadequate in the study of camera control and coordination strategies. Rather, one needs online access to the entire network in order to control and study its behavior under different sensing regimes. This observation, together with the fact that most researchers who are motivated to study camera networks do not have access to physical camera networks of suitable complexity, led us to propose the "Virtual Vision" paradigm for camera networks research (see Figure 21.1). 4.21.1.1 Virtual vision Virtual vision advocates employing visually and behaviorally realistic 3D virtual environments, populated with lifelike, self-animating objects (pedestrians, automobiles, etc.), to carry out camera networks research. Camera networks are simulated in these environments by deploying virtual cameras that mimic the characteristics of physical cameras. Virtual vision offers several advantages over the use of physical camera networks during the ideation, prototyping, and evaluation phases of camera networks research, among them: • The virtual vision simulator runs on (high-end) commodity PCs, obviating the need to grapple with special-purpose hardware. • The virtual cameras are very easily instantiated, relocated, and reconfigured in the virtual environment. • The virtual world provides readily accessible ground-truth data for the purposes of algorithm/system validation. • Experiments are perfectly repeatable in the virtual world, so we can easily modify algorithms and/or their parameters and immediately determine the effect.
Proceedings of the Fourth ACM/IEEE International Conference on Distributed Smart Cameras - ICDSC '10, 2010
The paper develops an ad hoc network of active pan/tilt/zoom (PTZ) and passive wide field-of-view... more The paper develops an ad hoc network of active pan/tilt/zoom (PTZ) and passive wide field-of-view (FOV) cameras capable of carrying out observation tasks autonomously. The network is assumed to be uncalibrated, lacks a central controller, and relies upon local decision making at each node and inter-node negotiations for its overall behavior. To this end, we develop intelligent camera nodes (both active and passive) that can perform multiple observation tasks simultaneously. We also present a negotiation protocol that allows cameras nodes to setup collaborative tasks in a purely distributed manner. Camera assignments conflicts that invariably arise in such networks are naturally and gracefully handled through at-node processing and inter-node negotiations. We expect the proposed camera network to be highly scalable due to the lack of any centralized control.
IEEE Sensors Journal, 2015
2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance
The goals of this paper are twofold: (i) to present our initial efforts towards the realization o... more The goals of this paper are twofold: (i) to present our initial efforts towards the realization of a fully autonomous sensor network of dynamic video cameras capable of providing perceptive coverage of a large public space, and (ii) to further the cause of exploiting visually and behaviorally realistic virtual environments in the development and testing of machine vision systems. In particular, our proposed sensor network employs techniques that enable a collection of active (pan-tilt-zoom) cameras to collaborate in performing various visual surveillance tasks, such as keeping one or more pedestrians within view, with minimal reliance on a human operator. The network features local and global autonomy and lacks any central controller, which entails robustness and scalability. Its functionality is the result of local decision-making capabilities at each camera node and communication between the nodes. We demonstrate our surveillance system in a virtual train station environment populated by autonomous, lifelike virtual pedestrians. Our readily reconfigurable virtual cameras generate synthetic video feeds that emulate those generated by real surveillance cameras monitoring public spaces. This type of research would be difficult in the real world given the costs of deploying and experimenting with an appropriately complex camera network in a large public space the size of a train station.
IEEE Journal on Emerging and Selected Topics in Circuits and Systems, 2013
he conducted research for the Programmable Digital Camera project in the Information Systems Labo... more he conducted research for the Programmable Digital Camera project in the Information Systems Laboratory. He also consulted for industry in the areas of digital camera systems design and algorithms development. He is now an Associate Professor in Electrical and Electronic Engineering and the founding Director of the Imaging Systems Laboratory at The University of Hong Kong, with broad research interests around the theme of computational optics and imaging. During the 2010-2011 academic year, he taught at the Department of Electrical Engineering and Computer Science at Massachusetts Institute of Technology as a Visiting Associate
2010 6th IEEE International Conference on Distributed Computing in Sensor Systems Workshops (DCOSSW), 2010
We introduce an ad hoc network of active pan/tilt/zoom and passive cameras capable of carrying ou... more We introduce an ad hoc network of active pan/tilt/zoom and passive cameras capable of carrying out collaborative tasks through strictly local interactions. Camera interactions are modeled as negotiations between two or more cameras. Through negotiations camera agree upon how best to carry out an observation tasks. Our smart camera nodes are modeled as behavior-based agents and are capable of engaging in multiple negotiations and performing multiple tasks simultaneously. We expect the proposed camera network to be highly scalable due to the lack of any centralized control.
IEEE Journal on Emerging and Selected Topics in Circuits and Systems, 2013
We present a distributed virtual vision simulator capable of simulating large-scale camera networ... more We present a distributed virtual vision simulator capable of simulating large-scale camera networks. Our virtual vision simulator is capable of simulating pedestrian traffic in different 3D environments. Simulated cameras deployed in these virtual environments generate synthetic video feeds that are fed into a vision processing pipeline supporting pedestrian detection and tracking. The visual analysis results are then used for subsequent processing, such as camera control, coordination, and handoff. Our virtual vision simulator is realized as a collection of modules that communicate with each other over the network. Consequently, we can deploy our simulator over a network of computers, allowing us to simulate much larger camera networks and much more complex scenes then is otherwise possible. Specifically, we show that our proposed virtual vision simulator can model a camera network, comprising more than one hundred active pan/tilt/zoom and passive wide field-of-view cameras, deployed in an upper floor of an office tower in downtown Toronto.
2007 IEEE Conference on Computer Vision and Pattern Recognition, 2007
This paper advocates a Virtual Vision paradigm and demonstrates its usefulness in camera sensor n... more This paper advocates a Virtual Vision paradigm and demonstrates its usefulness in camera sensor network research. Virtual vision prescribes the use of a visually and behaviorally realistic virtual environment simulator in the design and evaluation of surveillance systems. Impediments to deploying and experimenting with appropriately complex camera networks makes virtual vision an attractive alternative for many vision researchers who are motivated to investigate high level multi-camera control issues within such networks. In particular, we present two prototype surveillance systems comprising passive and active pan/tilt/zoom cameras. We deploy these systems in a virtual train station environment populated by autonomous, lifelike virtual pedestrians. The easily reconfigurable virtual cameras situated throughout this environment generate synthetic video feeds that emulate those acquired by real surveillance cameras monitoring extensive public spaces. Our novel multicamera control strategies enable the cameras to collaborate in persistently observing pedestrians of interest that move across their fields of view and in capturing close-up videos of pedestrians as they travel through designated areas. The sensor networks support task-dependent camera node selection and aggregation through local decision-making and inter-node communication. Our approach to multi-camera control is robust to node failures and message loss.
Proceedings of the International Conference on Distributed Smart Cameras - ICDSC '14, 2014
This thesis explores the idea of conditional offers during camera handoff negotiations. In a depa... more This thesis explores the idea of conditional offers during camera handoff negotiations. In a departure from contract-net inspired negotiation models that have been proposed for camera handoffs, the current scheme assumes that each camera maintains the state of its neighbouring cameras. To this end, we develop a new short-term memory model for maintaining a camera's own state and the state of its neighbouring cameras. The fact that each camera is aware of its surrounding cameras is exploited to generate conditional offers during handoff negotiations. This can result in multiple rounds of negotiations during a single handoff, leading to successful handoffs in situations where one of the cameras that is being asked to take on one more task is unable to take on a new task without relinquishing an existing task. The results demonstrate the advantages of the proposed negotiation model over existing models for camera handoffs. First, I would like to thank my supervisor Dr. Faisal Z. Qureshi for allowing me to work with him over the past six years, first as an undergrad and later as a masters student. I am very grateful for the guidance, motivation and opportunities that he gave me. I would also like to acknowledge my fellow lab members
2009 Sixth IEEE International Conference on Advanced Video and Signal Based Surveillance, 2009
This paper presents a framework for preserving privacy in video surveillance. Raw video is decomp... more This paper presents a framework for preserving privacy in video surveillance. Raw video is decomposed into a background and one or more object-video streams. Objectvideo streams can be combined to render the scene in a variety of ways: 1) The original video can be reconstructed from object-video streams without any data loss; 2) individuals in the scene can be represented as blobs, obscuring their identities; 3) foreground objects can be color coded to convey subtle scene information to the operator, again without revealing the identities of the individuals present in the scene; 4) the scene can be partially rendered, i.e., revealing the identities of some individuals, while preserving the anonymity of others. We evaluate our approach in a virtual train station environment populated by autonomous, lifelike virtual pedestrians.
2011 8th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), 2011
We demonstrate a video surveillance systemcomprising passive and active pan/tilt/zoom (PTZ) camer... more We demonstrate a video surveillance systemcomprising passive and active pan/tilt/zoom (PTZ) cameras-that intelligently responds to scene complexity, automatically capturing higher resolution video when there are fewer people in the scene and capturing lower resolution video as the number of pedestrians present in the scene increases. To this end, we have developed behavior based-controllers for passive and active cameras, enabling these cameras to carry out multiple observation tasks simultaneously. The research presented herein is a step towards video surveillance systems-consisting of a heterogeneous set of sensors-that provide persistent coverage of large spaces, while optimizing surveillance data collection by tuning the sensing parameters of individual sensors (in a distributed manner) in response to scene activity.
Intelligent Multimedia Surveillance, 2013
Here we introduce a framework for preserving privacy in video surveillance. Raw video footage is ... more Here we introduce a framework for preserving privacy in video surveillance. Raw video footage is decomposed into a background and one or more objectvideo streams. Such object-centric decomposition of the incoming video footage opens up new possibilities to provide visual surveillance of an area without compromising the privacy of the individuals present in that area. Object-video streams allow us to render the scene in a variety of ways: 1) individuals in the scene can be represented as blobs, obscuring their identities; 2) foreground objects can be color coded to convey subtle scene information to the operator, again without revealing the identities of the individuals present in the scene; 3) the scene can be partially rendered, i.e., revealing the identities of some individuals, while preserving the anonymity of others, etc. We evaluate our approach in a virtual train station environment populated by autonomous, lifelike virtual pedestrians. We also demonstrate our approach on real video footage. Lastly, we show that Microsoft Kinect sensor can be used to decompose the incoming video footage into object-video streams.
Proceedings of the 11th communications and networking simulation symposium on - CNS '08, 2008
We present our progress on the Virtual Vision paradigm, which prescribes visually and behaviorall... more We present our progress on the Virtual Vision paradigm, which prescribes visually and behaviorally realistic virtual environments, called "reality emulators", as a simulation tool to facilitate research on large-scale camera sensor networks. We have successfully exploited a prototype reality emulator-a virtual train station populated with numerous autonomous pedestrians-to rapidly develop novel solutions to challenging problems, such as multi-camera control and scheduling for persistent human surveillance by nextgeneration networks of smart cameras deployed in extensive public spaces.
2012 Ninth Conference on Computer and Robot Vision, 2012
Virtual Vision advocates developing visually and behaviorally realistic 3D synthetic environments... more Virtual Vision advocates developing visually and behaviorally realistic 3D synthetic environments to serve the needs of computer vision research. Virtual vision, especially, is well-suited for studying large-scale camera networks. A virtual vision simulator capable of generating "realistic" synthetic imagery from real-life scenes, involving pedestrians and other objects, is the sine qua non of carrying out virtual vision research. Here we develop a distributed, customizable virtual vision simulator capable of simulating pedestrian traffic in a variety of 3D environments. Virtual cameras deployed in this synthetic environment generate imagery using state-of-theart computer graphics techniques, boasting realistic lighting effects, shadows, etc. The synthetic imagery is fed into a visual analysis pipeline that currently supports pedestrian detection and tracking. The results of this analysis can then be used for subsequent processing, such as camera control, coordination, and handoff. It is important to bear in mind that our visual analysis pipeline is designed to handle real world imagery without any modifications. Consequently, it closely mimics the performance of visual analysis routines that one might deploy on physical cameras. Our virtual vision simulator is realized as a collection of modules that communicate with each other over the network. Consequently, we can deploy our simulator over a network of computers, allowing us to simulate much larger camera networks and much more complex scenes then is otherwise possible.
Simulated smart cameras track the movement of simulated pedestrians in 4 a simulated train statio... more Simulated smart cameras track the movement of simulated pedestrians in 4 a simulated train station, allowing development of improved 5 control strategies for smart camera networks.