Mohsen Zand | Queen's University at Kingston (original) (raw)

Papers by Mohsen Zand

Cornell University - arXiv, Jul 14, 2022

We present ObjectBox, a novel single-stage anchor-free and highly generalizable object detection ... more We present ObjectBox, a novel single-stage anchor-free and highly generalizable object detection approach. As opposed to both existing anchor-based and anchor-free detectors, which are more biased toward specific object scales in their label assignments, we use only object center locations as positive samples and treat all objects equally in different feature levels regardless of the objects' sizes or shapes. Specifically, our label assignment strategy considers the object center locations as shape-and size-agnostic anchors in an anchor-free fashion, and allows learning to occur at all scales for every object. To support this, we define new regression targets as the distances from two corners of the center cell location to the four sides of the bounding box. Moreover, to handle scale-variant objects, we propose a tailored IoU loss to deal with boxes with different sizes. As a result, our proposed object detector does not need any dataset-dependent hyperparameters to be tuned across datasets. We evaluate our method on MS-COCO 2017 and PAS-CAL VOC 2012 datasets, and compare our results to state-of-the-art methods. We observe that ObjectBox performs favorably in comparison to prior works. Furthermore, we perform rigorous ablation experiments to evaluate different components of our method.

Cornell University - arXiv, Oct 14, 2022

Figure 1: Samples of RCVPose3D results on Occlusion LINEMOD: The meshes are applied with groundtr... more Figure 1: Samples of RCVPose3D results on Occlusion LINEMOD: The meshes are applied with groundtruth (GT) pose, the green points are applied with estimated poses, whereas the blue dots are projected GT poses. The color in Point Cloud and RGB images are for illustration only, as RGB data is not used for training or inference

Conditional Normalizing Flows (CNFs) are flexible generative models capable of representing compl... more Conditional Normalizing Flows (CNFs) are flexible generative models capable of representing complicated distributions with high dimensionality and large interdimensional correlations, making them appealing for structured output learning. Their effectiveness in modelling multivariates spatio-temporal structured data has yet to be completely investigated. We propose MotionFlow as a novel normalizing flows approach that autoregressively conditions the output distributions on the spatio-temporal input features. It combines deterministic and stochastic representations with CNFs to create a probabilistic neural generative approach that can model the variability seen in high-dimensional structured spatio-temporal data. We specifically propose to use conditional priors to factorize the latent space for the time dependent modeling. We also exploit the use of masked convolutions as autoregressive conditionals in CNFs. As a result, our method is able to define arbitrarily expressive output probability distributions under temporal dynamics in multivariate prediction tasks. We apply our method to different tasks, including trajectory prediction, motion prediction, time series forecasting, and binary segmentation, and demonstrate that our model is able to leverage normalizing flows to learn complicated time dependent conditional distributions.

Catalyzed by the development of digital technologies, the amounts of digital images being produce... more Catalyzed by the development of digital technologies, the amounts of digital images being produced, archived and transmitted are reaching enormous proportions. It is hence imperative to develop techniques that are able to index,and retrieve relevant images through user‘s information need. Image retrieval based on semantic learning of the image content has become a promising strategy to deal with these aspects recently. With semantic-based image retrieval (SBIR), the real semantic meanings of images are discovered and used to retrieve relevant images to the user query. Thus, digital images are automatically labeled by a set of semantic keywords describing the image content. Similar to the text document retrieval, these keywords are then collectively used to index,organize and locate images of interest from a database. Nevertheless,understanding and discovering the semantics of a visual scene are high-level cognitive tasks and hard to automate, which provide challenging researchop por...

The work in this paper deals with moving object detection (MOD) for single/multiple moving object... more The work in this paper deals with moving object detection (MOD) for single/multiple moving objects from unmanned aerial vehicles (UAV). The proposed technique aims to overcome limitations of traditional pairwise image registrationbased MOD approaches. The first limitation relates to how potential objects are detected by discovering corresponding regions between two consecutive frames. The commonly used gray level distance-based similarity measures might not cater well for the dynamic spatio-temporal differences of the camera and moving objects. The second limitation relates to object occlusion. Traditionally, when only frame-pairs are considered, some objects might disappear between two frames. However, such objects were actually occluded and reappear in a later frame and are not detected. This work attempts to address both issues by firstly converting each frame into a graph representation with nodes being segmented superpixel regions. Through this, object detection can be treated ...

A new method is proposed for human motion predition by learning temporal and spatial dependencies... more A new method is proposed for human motion predition by learning temporal and spatial dependencies in an end-to-end deep neural network. The joint connectivity is explicitly modeled using a novel autoregressive structured prediction representation based on flow-based generative models. We learn a latent space of complex body poses in consecutive frames which is conditioned on the high-dimensional structure input sequence. To construct each latent variable, the general and local smoothness of the joint positions are considered in a generative process using conditional normalizing flows. As a result, all frame-level and joint-level continuities in the sequence are preserved in the model. This enables us to parameterize the inter-frame and intra-frame relationships and joint connectivity for robust long-term predictions as well as short-term prediction. Our experiments on two challenging benchmark datasets of Human3.6M and AMASS demonstrate that our proposed method is able to effectivel...

A novel method is proposed to estimate and correct image radial distortion. The solution is based... more A novel method is proposed to estimate and correct image radial distortion. The solution is based upon an algebraic expansion of the homographic relationship between a planar pattern and its distorted projection into a single image, which is solved using the Direct Linear Transformation. The method requires ten point correspondences, is fully automatic and estimates both the first two parameters of the division model, and the center of distortion. Experimental results show the method to be more accurate than other recent one- and two-parameter point-based and plumb-line-based approaches.

Spatial Modeling in GIS and R for Earth and Environmental Sciences, 2019

Aerial videos captured using dynamic cameras commonly require background remodeling at every fram... more Aerial videos captured using dynamic cameras commonly require background remodeling at every frame. In addition, camera motion and the movement of multiple objects present an unstable imaging environment with varying motion patterns. This makes detecting multiple moving objects a difficult task. In this chapter, a two-step framework, termed the motion differences of matched region-based features (MDMRBF), is presented. Firstly, each frame goes through super-pixel segmentation to produce regions where each frame is then represented as a region adjacency graph structure of visual appearance and geometric properties. This representation is important for correspondence discovery between consecutive frames based on multigraph matching. Ultimately, each region is labeled as either a background or foreground (object) using a proposed graph-coloring algorithm. Two datasets, namely (1) the DARPA-VIVID dataset and (2) self-captured videos using an unmanned aerial vehiclemounted camera, have been used to validate the feasibility of MDMRBF. Comparison is also done with three existing detection algorithms where experiments show promising results with precision at 94%, and recall at 89%.

IEEE Transactions on Geoscience and Remote Sensing, 2017

Image registration has been long used as a basis for the detection of moving objects. Registratio... more Image registration has been long used as a basis for the detection of moving objects. Registration techniques attempt to discover correspondences between consecutive frame pairs based on image appearances under rigid and affine transformations. However, spatial information is often ignored, and different motions from multiple moving objects cannot be efficiently modeled. Moreover, image registration is not well suited to handle occlusion that can result in potential object misses. This paper proposes a novel approach to address these problems. First, segmented video frames from unmanned aerial vehicle captured video sequences are represented using region adjacency graphs of visual appearance and geometric properties. Correspondence matching (for visible and occluded regions) is then performed between graph sequences by using multigraph matching. After matching, region labeling is achieved by a proposed graph coloring algorithm which assigns a background or foreground label to the respective region. The intuition of the algorithm is that background scene and foreground moving objects exhibit different motion characteristics in a sequence, and hence, their spatial distances are expected to be varying with time. Experiments conducted on several DARPA VIVID video sequences as well as self-captured videos show that the proposed method is robust to unknown transformations, with significant improvements in overall precision and recall compared to existing works.

Journal of Visual Communication and Image Representation, 2015

In RBIR, texture features are crucial in determining the class a region belongs to since they can... more In RBIR, texture features are crucial in determining the class a region belongs to since they can overcome the limitations of color and shape features. Two robust approaches to model texture features are Gabor and curvelet features. Although both features are close to human visual perception, sufficient information needs to be extracted from their sub-bands for effective texture classification. Moreover, shape irregularity can be a problem since Gabor and curvelet transforms can only be applied on the regular shapes. In this paper, we propose an approach that uses both the Gabor wavelet and the curvelet transforms on the transferred regular shapes of the image regions. We also apply a fitting method to encode the sub-bands' information in the polynomial coefficients to create a texture feature vector with the maximum power of discrimination. Experiments on texture classification task with Image CLEF and Outex databases demonstrate the effectiveness of the proposed approach.

ijmer.com

SPECT is a tomography technique that can greatly show information about the metabolic activity in... more SPECT is a tomography technique that can greatly show information about the metabolic activity in the body and improve the clinical diagnosis. In this paper, a convolution subtraction technique is proposed for scatter compensation in SPECT imaging. In ...

ArXiv, 2021

We propose a novel keypoint voting scheme based on intersecting spheres, that is more accurate th... more We propose a novel keypoint voting scheme based on intersecting spheres, that is more accurate than existing schemes and allows for a smaller set of more disperse keypoints. The scheme forms the basis of the proposed RCVPose method for 6 DoF pose estimation of 3D objects in RGB-D data, which is particularly effective at handling occlusions. A CNN is trained to estimate the distance between the 3D point corresponding to the depth mode of each RGB pixel, and a set of 3 disperse keypoints defined in the object frame. At inference, a sphere of radius equal to this estimated distance is generated, centered at each 3D point. The surface of these spheres votes to increment a 3D accumulator space, the peaks of which indicate keypoint locations. The proposed radial voting scheme is more accurate than previous vector or offset schemes, and robust to disperse keypoints. Experiments demonstrate RCVPose to be highly accurate and competitive, achieving state-of-the-art results on LINEMOD (99.7%),...

ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

We propose a multitask approach for crowd counting and person localization in a unified framework... more We propose a multitask approach for crowd counting and person localization in a unified framework. As the detection and localization tasks are well-correlated and can be jointly tackled, our model benefits from a multitask solution by learning multiscale representations of encoded crowd images, and subsequently fusing them. In contrast to the relatively more popular density-based methods, our model uses point supervision to allow for crowd locations to be accurately identified. We test our model on two popular crowd counting datasets, ShanghaiTech A and B, and demonstrate that our method achieves strong results on both counting and localization tasks, with MSE measures of 110.7 and 15.0 for crowd counting and AP measures of 0.71 and 0.75 for localization, on ShanghaiTech A and B respectively. Our detailed ablation experiments show the impact of our multiscale approach as well as the effectiveness of the fusion

ArXiv, 2021

A novel object detection method is presented that handles freely rotated objects of arbitrary siz... more A novel object detection method is presented that handles freely rotated objects of arbitrary sizes, including tiny objects as small as 2 x 2 pixels. Such tiny objects appear frequently in remotely sensed images, and present a challenge to recent object detection algorithms. More importantly, current object detection methods have been designed originally to accommodate axis-aligned bounding box detection, and therefore fail to accurately localize oriented boxes that best describe freely rotated objects. In contrast, the proposed convolutional neural network (CNN) -based approach uses potential pixel information at multiple scale levels without the need for any external resources, such as anchor boxes. The method encodes the precise location and orientation of features of the target objects at grid cell locations. Unlike existing methods that regress the bounding box location and dimension, the proposed method learns all the required information by classification, which has the added...

ArXiv, 2021

This work presents a novel approach to improve the results of pose estimation by detecting and di... more This work presents a novel approach to improve the results of pose estimation by detecting and distinguishing between the occurrence of True and False Positive results. It achieves this by training a binary classifier on the output of an arbitrary pose estimation algorithm, and returns a binary label indicating the validity of the result. We demonstrate that our approach improves upon a state-of-the-art pose estimation result on the Sileane dataset, outperforming a variation of the alternative CullNet method by 4.15% in average class accuracy and 0.73% in overall accuracy at validation. Applying our method can also improve the pose estimation average precision results of Op-Net by 6.06% on average.

Proceedings of the 15th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, 2020

... Off-line recognition, on the other hand, is performed after the writing has been performed. .... more ... Off-line recognition, on the other hand, is performed after the writing has been performed. ... In both online and offline OCR systems, there are three main approaches for automatic ... the script consists of separated words which are aligned by a horizontal virtual line called &amp;amp;quot;baseline&amp;amp;quot;. ...

Proceedings of World Academy of …, 2008

Abstract—Optical character recognition of cursive scripts presents a number of challenging proble... more Abstract—Optical character recognition of cursive scripts presents a number of challenging problems in both segmentation and recognition processes in different languages, including Persian. In order to overcome these problems, we use a newly developed Persian word ...

IEEE Transactions on Image Processing, 2016

Semantic image segmentation is a fundamental yet challenging problem, which can be viewed as an e... more Semantic image segmentation is a fundamental yet challenging problem, which can be viewed as an extension of the conventional object detection with close relation to image segmentation and classication. It aims to partition images into non-overlapping regions that are assigned predefined semantic labels. Most of the existing approaches utilize and integrate lowlevel local features and high-level contextual cues, which are fed into an inference framework such as the Conditional Random Field (CRF). However, the lack of meaning in the primitives (i.e., pixels or superpixels) and cues provide low discriminatory capabilities since they are rarely object-consistent. Moreover, blind combinations of heterogeneous features and contextual cues exploitation through limited neighborhood relations in the CRFs tend to degrade labelling performance. This paper proposes an ontology-based semantic image segmentation approach called OBSIS that jointly models image segmentation and object detection. Specifically, a Dirichlet process mixture model transforms the low-level visual space into an intermediate semantic space, which drastically reduces feature dimensionality. These features are then individually weighed and independently learned within context, using multiple CRFs. The segmentation of images into object parts is hence reduced to a classification task where object inference is passed to an ontology model. This model resembles the way by which humans understand images through the combination of different cues, context models and rulebased learning of the ontologies. Experimental evaluations using the MSRC-21 and PASCAL VOC&amp;amp;amp;#39;2010 datasets show promising results.

Cornell University - arXiv, Jul 14, 2022

Cornell University - arXiv, Oct 14, 2022

Spatial Modeling in GIS and R for Earth and Environmental Sciences, 2019

IEEE Transactions on Geoscience and Remote Sensing, 2017

Journal of Visual Communication and Image Representation, 2015

ijmer.com

ArXiv, 2021

ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

ArXiv, 2021

Proceedings of the 15th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, 2020

Proceedings of World Academy of …, 2008

IEEE Transactions on Image Processing, 2016