Shujon Naha - Academia.edu (original) (raw)
Papers by Shujon Naha
arXiv (Cornell University), Nov 10, 2023
arXiv (Cornell University), Mar 31, 2020
Hand-object pose estimation (HOPE) aims to jointly detect the poses of both a hand and of a held ... more Hand-object pose estimation (HOPE) aims to jointly detect the poses of both a hand and of a held object. In this paper, we propose a lightweight model called HOPE-Net which jointly estimates hand and object pose in 2D and 3D in real-time. Our network uses a cascade of two adaptive graph convolutional neural networks, one to estimate 2D coordinates of the hand joints and object corners, followed by another to convert 2D coordinates to 3D. Our experiments show that through end-to-end training of the full network, we achieve better accuracy for both the 2D and 3D coordinate estimation problems. The proposed 2D to 3D graph convolution-based model could be applied to other 3D landmark detection problems, where it is possible to first predict the 2D keypoints and then transform them to 3D.
2016 23rd International Conference on Pattern Recognition (ICPR), 2016
We consider the problem of joint modeling of videos and their corresponding textual descriptions ... more We consider the problem of joint modeling of videos and their corresponding textual descriptions (e.g. sentences or phrases). Our approach consists of three components: the video representation, the textual representation, and a joint model that links videos and text. Our video representation uses the stateof-the-art deep 3D ConvNet to capture the semantic information in the video. Our textual representation uses the recent advancement in learning word and sentence vectors from large text corpus. The joint model is learned to score the correct (video, text) pairs higher than the incorrect ones. We demonstrate our approach in several applications: 1) retrieving sentences given a video; 2) retrieving videos given a sentence; 3) zero-shot action recognition in videos.
2015 12th Conference on Computer and Robot Vision, 2015
We consider the problem of zero-shot recognition of object categories from images. Given a set of... more We consider the problem of zero-shot recognition of object categories from images. Given a set of object categories (called "known classes") with training images, our goal is to learn a system to recognize another non-overlapping set of object categories (called "unknown classes") for which there are no training images. Our proposed approach exploits the recent work in natural language processing which has produced vector representations of words. Using the vector representations of object classes, we develop a method for transferring the appearance models from known object classes to unknown object classes. Our experimental results on three benchmark datasets show that our proposed method outperforms other competing approaches.
2014 17th International Conference on Computer and Information Technology (ICCIT), 2014
ABSTRACT
2011 IEEE Symposium on Computers & Informatics, 2011
The burgeoning growth of the e-Learning technologies implies the fact that the future education s... more The burgeoning growth of the e-Learning technologies implies the fact that the future education system will largely depend upon the electronic devices and computer aided technologies. It has already been proved that the computer aided teaching techniques are much more effective for the children than the traditional teaching system in most of the cases. A large number of software has been designed to assist the teachers in the classroom to teach and evaluate the students. Although those software systems are good enough for a class with normal (neurotypical) children, those very often fail to address the special needs of the autistic children. Hence, the autistic children face various challenges in participating with neurotypicals in the same classroom. We have addressed this problem by designing and implementing an intelligent classroom software, named "A- Class", which takes care of the diversity of tastes among the autistic children of a classroom and helps the teacher to teach in a class participated by both autistic and neurotypical children. In this paper we discuss the idea, design and implementation of A-Class based upon our five months of intervention with the autistic children at Autism Welfare Foundation (AWF) in Dhaka.
htibd.org
Abstract: The burgeoning growth of the e-Learning technologies implies the fact that the future e... more Abstract: The burgeoning growth of the e-Learning technologies implies the fact that the future education system will largely depend upon the electronic devices and computer aided technologies. It has already been proved that the computer aided teaching techniques are much more effective for the children than the traditional teaching system in most of the cases. A large number of software has been designed to assist the teachers in the classroom to teach and evaluate the students. Although those software systems are good ...
2016 23rd International Conference on Pattern Recognition (ICPR), 2016
We consider the problem of object figure-ground segmentation when the object categories are not a... more We consider the problem of object figure-ground segmentation when the object categories are not available during training (i.e. zero-shot). During training, we learn standard segmentation models for a handful of object categories (called "source objects") using existing semantic segmentation datasets. During testing, we are given images of objects (called "target objects") that are unseen during training. Our goal is to segment the target objects from the background. Our method learns to transfer the knowledge from the source objects to the target objects. Our experimental results demonstrate the effectiveness of our approach.
We consider the problem of semantic image segmentation using deep convolutional neural networks. ... more We consider the problem of semantic image segmentation using deep convolutional neural networks. We propose a novel network architecture called the label refinement network that predicts segmentation labels in a coarse-to-fine fashion at several resolutions. The segmentation labels at a coarse resolution are used together with convolutional features to obtain finer resolution segmentation labels. We define loss functions at several stages in the network to provide supervisions at different stages. Our experimental results on several standard datasets demonstrate that the proposed model provides an effective way of producing pixel-wise dense image labeling.
Numerous studies in cognitive development have provided converging evidence that Joint Attention ... more Numerous studies in cognitive development have provided converging evidence that Joint Attention (JA) is crucial for children to learn about the world together with their parents. However, a closer look reveals that, in the literature, JA has been operationally defined in different ways. For example, some definitions require explicit signals of “awareness” of being in JA—such as gaze following, while others simply define JA as shared gaze to an object or activity. But what if “awareness” is possible without gaze following? The present study examines egocentric images collected via headmounted eye-trackers during parent-child toy play. A Convolutional Neural Network model was used to process and learn to classify raw egocentric images as JA vs not JA. We demonstrate individual child and parent egocentric views can be classified as being part of a JA bout at above chance levels. This provides new evidence that an individual can be “aware” they are in JA based solely on the in-the-mome...
ArXiv, 2017
We consider the problem of semantic image segmentation using deep convolutional neural networks. ... more We consider the problem of semantic image segmentation using deep convolutional neural networks. We propose a novel network architecture called the label refinement network that predicts segmentation labels in a coarse-to-fine fashion at several resolutions. The segmentation labels at a coarse resolution are used together with convolutional features to obtain finer resolution segmentation labels. We define loss functions at several stages in the network to provide supervisions at different stages. Our experimental results on several standard datasets demonstrate that the proposed model provides an effective way of producing pixel-wise dense image labeling.
People have foveated vision and thus are generally able to attend to just a single object within ... more People have foveated vision and thus are generally able to attend to just a single object within their field of view at a time. Our goal is to learn a model that can automatically identify which object is being attended, given a person’s field of view captured by a first person camera. This problem is different from traditional salient object detection because our goal is not to identify all of the salient objects in the scene, but to identify the single object to which the camera wearer is attending. We present a model that learns based on very weak supervision, with just annotations of the label of the class that is attended in each frame, without bounding boxes or other spatial location information. We show that by learning disentangled representations for localization and classification, our model can effectively localize novel attended objects that were never seen during training. We propose a multi-stage knowledge distillation strategy to train our generalized localizer model....
In this thesis we discuss different aspects of zero-shot learning and propose solutions for three... more In this thesis we discuss different aspects of zero-shot learning and propose solutions for three challenging visual recognition problems: 1) unknown object recognition from images 2) novel action recognition from videos and 3) unseen object segmentation. In all of these three problems, we have two different sets of classes, the “known classes”, which are used in the training phase and the “unknown classes” for which there is no training instance. Our proposed approach exploits the available semantic relationships between known and unknown object classes and use them to transfer the appearance models from known object classes to unknown object classes to recognize unknown objects. We also propose an approach to recognize novel actions from videos by learning a joint model that links videos and text. Finally, we present a ranking based approach for zero-shot object segmentation. We represent each unknown object class as a semantic ranking of all the known classes and use this semanti...
While object part segmentation is useful for many applications, typical approaches require a larg... more While object part segmentation is useful for many applications, typical approaches require a large amount of labeled data to train a model for good performance. To reduce the labeling effort, weak supervision cues such as object keypoints have been used to generate pseudo-part annotations which can subsequently be used to train larger models. However, previous weakly-supervised part segmentation methods require the same object classes during both training and testing. We propose a new model to use key-point guidance for segmenting parts of novel object classes given that they have similar structures as seen objects — different types of four-legged animals, for example. We show that a non-parametric template matching approach is more effective than pixel classification for part segmentation, especially for small or less frequent parts. To evaluate the generalizability of our approach, we introduce two new datasets that contain 200 quadrupeds in total with both key-point and part segm...
ArXiv, 2018
Effective integration of local and global contextual information is crucial for semantic segmenta... more Effective integration of local and global contextual information is crucial for semantic segmentation and dense image labeling. We develop two encoder-decoder based deep learning architectures to address this problem. We first propose a network architecture called Label Refinement Network (LRN) that predicts segmentation labels in a coarse-to-fine fashion at several spatial resolutions. In this network, we also define loss functions at several stages to provide supervision at different stages of training. However, there are limits to the quality of refinement possible if ambiguous information is passed forward. In order to address this issue, we also propose Gated Feedback Refinement Network (G-FRNet) that addresses this limitation. Initially, G-FRNet makes a coarse-grained prediction which it progressively refines to recover details by effectively integrating local and global contextual information during the refinement stages. This is achieved by gate units proposed in this work, ...
2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)
2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
2014 17th International Conference on Computer and Information Technology (ICCIT), 2014
ABSTRACT
arXiv (Cornell University), Nov 10, 2023
arXiv (Cornell University), Mar 31, 2020
Hand-object pose estimation (HOPE) aims to jointly detect the poses of both a hand and of a held ... more Hand-object pose estimation (HOPE) aims to jointly detect the poses of both a hand and of a held object. In this paper, we propose a lightweight model called HOPE-Net which jointly estimates hand and object pose in 2D and 3D in real-time. Our network uses a cascade of two adaptive graph convolutional neural networks, one to estimate 2D coordinates of the hand joints and object corners, followed by another to convert 2D coordinates to 3D. Our experiments show that through end-to-end training of the full network, we achieve better accuracy for both the 2D and 3D coordinate estimation problems. The proposed 2D to 3D graph convolution-based model could be applied to other 3D landmark detection problems, where it is possible to first predict the 2D keypoints and then transform them to 3D.
2016 23rd International Conference on Pattern Recognition (ICPR), 2016
We consider the problem of joint modeling of videos and their corresponding textual descriptions ... more We consider the problem of joint modeling of videos and their corresponding textual descriptions (e.g. sentences or phrases). Our approach consists of three components: the video representation, the textual representation, and a joint model that links videos and text. Our video representation uses the stateof-the-art deep 3D ConvNet to capture the semantic information in the video. Our textual representation uses the recent advancement in learning word and sentence vectors from large text corpus. The joint model is learned to score the correct (video, text) pairs higher than the incorrect ones. We demonstrate our approach in several applications: 1) retrieving sentences given a video; 2) retrieving videos given a sentence; 3) zero-shot action recognition in videos.
2015 12th Conference on Computer and Robot Vision, 2015
We consider the problem of zero-shot recognition of object categories from images. Given a set of... more We consider the problem of zero-shot recognition of object categories from images. Given a set of object categories (called "known classes") with training images, our goal is to learn a system to recognize another non-overlapping set of object categories (called "unknown classes") for which there are no training images. Our proposed approach exploits the recent work in natural language processing which has produced vector representations of words. Using the vector representations of object classes, we develop a method for transferring the appearance models from known object classes to unknown object classes. Our experimental results on three benchmark datasets show that our proposed method outperforms other competing approaches.
2014 17th International Conference on Computer and Information Technology (ICCIT), 2014
ABSTRACT
2011 IEEE Symposium on Computers & Informatics, 2011
The burgeoning growth of the e-Learning technologies implies the fact that the future education s... more The burgeoning growth of the e-Learning technologies implies the fact that the future education system will largely depend upon the electronic devices and computer aided technologies. It has already been proved that the computer aided teaching techniques are much more effective for the children than the traditional teaching system in most of the cases. A large number of software has been designed to assist the teachers in the classroom to teach and evaluate the students. Although those software systems are good enough for a class with normal (neurotypical) children, those very often fail to address the special needs of the autistic children. Hence, the autistic children face various challenges in participating with neurotypicals in the same classroom. We have addressed this problem by designing and implementing an intelligent classroom software, named "A- Class", which takes care of the diversity of tastes among the autistic children of a classroom and helps the teacher to teach in a class participated by both autistic and neurotypical children. In this paper we discuss the idea, design and implementation of A-Class based upon our five months of intervention with the autistic children at Autism Welfare Foundation (AWF) in Dhaka.
htibd.org
Abstract: The burgeoning growth of the e-Learning technologies implies the fact that the future e... more Abstract: The burgeoning growth of the e-Learning technologies implies the fact that the future education system will largely depend upon the electronic devices and computer aided technologies. It has already been proved that the computer aided teaching techniques are much more effective for the children than the traditional teaching system in most of the cases. A large number of software has been designed to assist the teachers in the classroom to teach and evaluate the students. Although those software systems are good ...
2016 23rd International Conference on Pattern Recognition (ICPR), 2016
We consider the problem of object figure-ground segmentation when the object categories are not a... more We consider the problem of object figure-ground segmentation when the object categories are not available during training (i.e. zero-shot). During training, we learn standard segmentation models for a handful of object categories (called "source objects") using existing semantic segmentation datasets. During testing, we are given images of objects (called "target objects") that are unseen during training. Our goal is to segment the target objects from the background. Our method learns to transfer the knowledge from the source objects to the target objects. Our experimental results demonstrate the effectiveness of our approach.
We consider the problem of semantic image segmentation using deep convolutional neural networks. ... more We consider the problem of semantic image segmentation using deep convolutional neural networks. We propose a novel network architecture called the label refinement network that predicts segmentation labels in a coarse-to-fine fashion at several resolutions. The segmentation labels at a coarse resolution are used together with convolutional features to obtain finer resolution segmentation labels. We define loss functions at several stages in the network to provide supervisions at different stages. Our experimental results on several standard datasets demonstrate that the proposed model provides an effective way of producing pixel-wise dense image labeling.
Numerous studies in cognitive development have provided converging evidence that Joint Attention ... more Numerous studies in cognitive development have provided converging evidence that Joint Attention (JA) is crucial for children to learn about the world together with their parents. However, a closer look reveals that, in the literature, JA has been operationally defined in different ways. For example, some definitions require explicit signals of “awareness” of being in JA—such as gaze following, while others simply define JA as shared gaze to an object or activity. But what if “awareness” is possible without gaze following? The present study examines egocentric images collected via headmounted eye-trackers during parent-child toy play. A Convolutional Neural Network model was used to process and learn to classify raw egocentric images as JA vs not JA. We demonstrate individual child and parent egocentric views can be classified as being part of a JA bout at above chance levels. This provides new evidence that an individual can be “aware” they are in JA based solely on the in-the-mome...
ArXiv, 2017
We consider the problem of semantic image segmentation using deep convolutional neural networks. ... more We consider the problem of semantic image segmentation using deep convolutional neural networks. We propose a novel network architecture called the label refinement network that predicts segmentation labels in a coarse-to-fine fashion at several resolutions. The segmentation labels at a coarse resolution are used together with convolutional features to obtain finer resolution segmentation labels. We define loss functions at several stages in the network to provide supervisions at different stages. Our experimental results on several standard datasets demonstrate that the proposed model provides an effective way of producing pixel-wise dense image labeling.
People have foveated vision and thus are generally able to attend to just a single object within ... more People have foveated vision and thus are generally able to attend to just a single object within their field of view at a time. Our goal is to learn a model that can automatically identify which object is being attended, given a person’s field of view captured by a first person camera. This problem is different from traditional salient object detection because our goal is not to identify all of the salient objects in the scene, but to identify the single object to which the camera wearer is attending. We present a model that learns based on very weak supervision, with just annotations of the label of the class that is attended in each frame, without bounding boxes or other spatial location information. We show that by learning disentangled representations for localization and classification, our model can effectively localize novel attended objects that were never seen during training. We propose a multi-stage knowledge distillation strategy to train our generalized localizer model....
In this thesis we discuss different aspects of zero-shot learning and propose solutions for three... more In this thesis we discuss different aspects of zero-shot learning and propose solutions for three challenging visual recognition problems: 1) unknown object recognition from images 2) novel action recognition from videos and 3) unseen object segmentation. In all of these three problems, we have two different sets of classes, the “known classes”, which are used in the training phase and the “unknown classes” for which there is no training instance. Our proposed approach exploits the available semantic relationships between known and unknown object classes and use them to transfer the appearance models from known object classes to unknown object classes to recognize unknown objects. We also propose an approach to recognize novel actions from videos by learning a joint model that links videos and text. Finally, we present a ranking based approach for zero-shot object segmentation. We represent each unknown object class as a semantic ranking of all the known classes and use this semanti...
While object part segmentation is useful for many applications, typical approaches require a larg... more While object part segmentation is useful for many applications, typical approaches require a large amount of labeled data to train a model for good performance. To reduce the labeling effort, weak supervision cues such as object keypoints have been used to generate pseudo-part annotations which can subsequently be used to train larger models. However, previous weakly-supervised part segmentation methods require the same object classes during both training and testing. We propose a new model to use key-point guidance for segmenting parts of novel object classes given that they have similar structures as seen objects — different types of four-legged animals, for example. We show that a non-parametric template matching approach is more effective than pixel classification for part segmentation, especially for small or less frequent parts. To evaluate the generalizability of our approach, we introduce two new datasets that contain 200 quadrupeds in total with both key-point and part segm...
ArXiv, 2018
Effective integration of local and global contextual information is crucial for semantic segmenta... more Effective integration of local and global contextual information is crucial for semantic segmentation and dense image labeling. We develop two encoder-decoder based deep learning architectures to address this problem. We first propose a network architecture called Label Refinement Network (LRN) that predicts segmentation labels in a coarse-to-fine fashion at several spatial resolutions. In this network, we also define loss functions at several stages to provide supervision at different stages of training. However, there are limits to the quality of refinement possible if ambiguous information is passed forward. In order to address this issue, we also propose Gated Feedback Refinement Network (G-FRNet) that addresses this limitation. Initially, G-FRNet makes a coarse-grained prediction which it progressively refines to recover details by effectively integrating local and global contextual information during the refinement stages. This is achieved by gate units proposed in this work, ...
2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)
2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
2014 17th International Conference on Computer and Information Technology (ICCIT), 2014
ABSTRACT