Julius Wang - Profile on Academia.edu (original) (raw)

Papers by Julius Wang

Post-Attention Modulator for Dense Video Captioning

2022 26th International Conference on Pattern Recognition (ICPR)

arXiv (Cornell University), Oct 24, 2022

Weakly-supervised vision-language (V-L) pre-training (W-VLP) aims at learning cross-modal alignme... more Weakly-supervised vision-language (V-L) pre-training (W-VLP) aims at learning cross-modal alignment with little or no paired data, such as aligned images and captions. Recent W-VLP methods, which pair visual features with object tags, help achieve performances comparable with some VLP models trained with aligned pairs in various V-L downstream tasks. This, however, is not the case in crossmodal retrieval (XMR). We argue that the learning of such a W-VLP model is curbed and biased by the object tags of limited semantics. We address the lack of paired V-L data for model supervision with a novel Visual Vocabulary based Feature Hallucinator (WFH), which is trained via weak supervision as a W-VLP model, not requiring images paired with captions. WFH generates visual hallucinations from texts, which are then paired with the originally unpaired texts, allowing more diverse interactions across modalities. Empirically, WFH consistently boosts the prior W-VLP works, e.g. U-VisualBERT (U-VB), over a variety of V-L tasks, i.e. XMR, Visual Question Answering, etc. Notably, benchmarked with recall@{1,5,10}, it consistently improves U-VB on image-to-text and text-to-image retrieval on two popular datasets Flickr30K and MSCOCO. Meanwhile, it gains by at least 14.5% in cross-dataset generalization tests on these XMR tasks. Moreover, in other V-L downstream tasks considered, our WFH models are on par with models trained with paired V-L data, revealing the utility of unpaired data. These results demonstrate greater generalization of the proposed W-VLP model with WFH.

arXiv (Cornell University), Jun 1, 2022

Image Difference Captioning (IDC) aims at generating sentences to describe differences between tw... more Image Difference Captioning (IDC) aims at generating sentences to describe differences between two similar-looking images. Conventional approaches learn an IDC model with a pre-trained and usually frozen visual feature extractor. Accordingly, two major issues may arise: (1) a large domain gap usually exists between the pre-training datasets used for training such a visual encoder and that of the downstream IDC task, and (2) the visual feature extractor, when separately encoding two images, often does not effectively encode the visual changes between two images. Due to the excellent zero-shot performance of the recently proposed CLIP, we thus propose CLIP4IDC to transfer a CLIP model for the IDC task to address those issues. Different from directly fine-tuning CLIP to generate sentences, we introduce an adaptation training process to adapt CLIP's visual encoder to capture and align differences in image pairs based on the textual descriptions. Experiments on three IDC benchmark datasets, CLEVR-Change, Spot-the-Diff, and Image-Editing-Request, demonstrate the effectiveness of CLIP4IDC.

This thesis studies on-line multiple object tracking (MOT) problem which has been developed in nu... more This thesis studies on-line multiple object tracking (MOT) problem which has been developed in numerous real-world applications, such as emerging self-driving car agents or estimating a target's trajectory over time to identify its movement pattern. The challenges that an on-line MOT tracker always faces are: (1) being able to consistently and smoothly track the same target over time with the presence of occlusions, (2) being able to recover from fragmented tracks, (3) handling identity switches of the same target, and (4) being able to operate in real-time. This work aims to provide an efficient detect-and-track framework to address these challenges. To narrow down the classes of objects to be studied, but without losing the tracker's extendibility to a generic object, we pick pedestrians as the primary objects of interest. The proposed framework consists of four building blocks, i.e. object detection, object tracking, data association, and object re-identification. While most of the MOT frameworks make the assumption of the availability of the detector in every frame, the proposed MOT tracker operates with the detector being triggered only periodically, e.g. in every three frames, leading to improved efficiency. As for each building block, the detection is performed by Single Shot Detector (SSD), which has proven efficiency and efficacy on generic object classes. When the detector is triggered and active tracks exist, data association module identifies the correspondence of the objects detected by the detector and tracked by the tracker. In cases where newly detected objects cannot be identified as any of current tracks, the re-identification module then attempts to find the correspondence for them in the history track. The experiments show that the proposed framework is outperformed by the recently published on-line MOT trackers which are based on different object detectors. However, the results suggest that the proposed framework's performance does not degrade when the detector is partially unavailable and improves in certain conditions due to better temporal consistency. Based on these experiments, we are able to identify major shortcomings of the current framework, providing possible ways to improve it and directions for the future work.

Predicting a scene graph that captures visual entities and their interactions in an image has bee... more Predicting a scene graph that captures visual entities and their interactions in an image has been considered a crucial step towards full scene comprehension. Recent scene graph generation (SGG) models have shown their capability of capturing the most frequent relations among visual entities. However, the state-of-the-art results are still far from satisfactory, e.g. models can obtain 31% in overall recall R@100, whereas the likewise important mean class-wise recall mR@100 is only around 8% on Visual Genome (VG). The discrepancy between R and mR results urges to shift the focus from pursuing a high R to a high mR with a still competitive R. We suspect that the observed discrepancy stems from both the annotation bias and sparse annotations in VG, in which many visual entity pairs are either not annotated at all or only with a single relation when multiple ones could be valid. To address this particular issue, we propose a novel SGG training scheme that capitalizes on self-learned kno...

2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2017

This paper presents a framework for saliency estimation and fixation prediction in videos. The pr... more This paper presents a framework for saliency estimation and fixation prediction in videos. The proposed framework is based on a hierarchical feature representation obtained by stacking convolutional layers of independent subspace analysis (ISA) filters. The feature learning is thus unsupervised and independent of the task. To compute the saliency, we then employ a multiresolution saliency architecture that exploits both local and global saliency. That is, for a given image, an image pyramid is initially built. After that, for each resolution, both local and global saliency measures are computed to obtain a saliency map. The integration of saliency maps over the image pyramid provides the final video saliency. We first show that combining local and global saliency improves the results. We then compare the proposed model with several video saliency models and demonstrate that the proposed framework is capable of predicting video saliency effectively, outperforming all the other models.

1st International Workshop on Multimodal Understanding and Learning for Embodied Applications - MULEA '19, 2019

Dense captioning (DC), which provides a comprehensive context understanding of images by describi... more Dense captioning (DC), which provides a comprehensive context understanding of images by describing all salient visual groundings in an image, facilitates multimodal understanding and learning. As an extension of image captioning, DC is developed to discover richer sets of visual contents and to generate captions of wider diversity and increased details. The state-of-the-art models of DC consist of three stages: (1) region proposals, (2) region classification, and (3) caption generation for each proposal. They are typically built upon the following ideas: (a) guiding the caption generation with image-level features as the context cues along with regional features and (b) refining locations of region proposals with caption information. In this work, we propose (a) a joint visual-textual criterion exploited by the region classifier that further improves both region detection and caption accuracy, and (b) a Geometryaware Relational Exemplar attention (GREatt) mechanism to relate region proposals. The former helps the model learn a region classifier by effectively exploiting both visual groundings and caption descriptions. Rather than treating each region proposal in isolation, the latter relates regions in complementary relations, i.e. contextually dependent, visually supported and geometry relations, to enrich context information in regional representations. We conduct an extensive set of experiments and demonstrate that our proposed model improves the state-of-the-art by at least +5.3% in terms of the mean average precision on the Visual Genome dataset. CCS CONCEPTS • Computing methodologies → Scene understanding.

2019 IEEE International Conference on Image Processing (ICIP), Sep 1, 2019

The notorious incident of sudden infant death syndrome (SIDS) can easily happen to a newborn due ... more The notorious incident of sudden infant death syndrome (SIDS) can easily happen to a newborn due to many environmental factors. To prevent such tragic incidents from happening, we propose a multi-task deep learning framework that detects different facial traits and two life-threatening indicators, i.e. which facial parts are occluded or covered, by analyzing the infant head image. Furthermore, we extend and adapt the recently developed models that capture data-dependent uncertainty from noisy observations for our application. The experimental results show significant improvements on YunInfants dataset across most of the tasks over the models that simply adopt the regular cross-entropy losses without addressing the effect of the underlying uncertainties.

Post-Attention Modulator for Dense Video Captioning

2022 26th International Conference on Pattern Recognition (ICPR)

arXiv (Cornell University), Oct 24, 2022

arXiv (Cornell University), Jun 1, 2022

2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2017

1st International Workshop on Multimodal Understanding and Learning for Embodied Applications - MULEA '19, 2019

2019 IEEE International Conference on Image Processing (ICIP), Sep 1, 2019