Jingen Liu | University of Central Florida (original) (raw)
Papers by Jingen Liu
2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021
Virtual try-on methods aim to generate images of fashion models wearing arbitrary combinations of... more Virtual try-on methods aim to generate images of fashion models wearing arbitrary combinations of garments. This is a challenging task because the generated image must appear realistic and accurately display the interaction between garments. Prior works produce images that are filled with artifacts and fail to capture important visual details necessary for commercial applications. We propose Outfit Visualization Net (OVNet) to capture these important details (e.g. buttons, shading, textures, realistic hemlines, and interactions between garments) and produce high quality multiple-garment virtual try-on images. OVNet consists of 1) a semantic layout generator and 2) an image generation pipeline using multiple coordinated warps. We train the warper to output multiple warps using a cascade loss, which refines each successive warp to focus on poorly generated regions of a previous warp and yields consistent improvements in detail. In addition, we introduce a method for matching outfits with the most suitable model and produce significant improvements for both our and other previous try-on methods. Through quantitative and qualitative analysis, we demonstrate our method generates substantially higher-quality studio images compared to prior works for multi-garment outfits. An interactive interface powered by this method has been deployed on fashion e-commerce websites and received overwhelmingly positive feedback.
2011 International Conference on Computer Vision, 2011
We propose a novel statistical manifold modeling approach that is capable of classifying poses of... more We propose a novel statistical manifold modeling approach that is capable of classifying poses of object categories from video sequences by simultaneously minimizing the intra-class variability and maximizing inter-pose distance. Following the intuition that an object part based representation and a suitable part selection process may help achieve our purpose, we formulate the part selection problem from a statistical manifold modeling perspective and treat part selection as adjusting the manifold of the object (parameterized by pose) by means of the manifold "alignment" and "expansion" operations. We show that manifold alignment and expansion are equivalent to minimizing the intra-class distance given a pose while increasing the inter-pose distance given an object instance respectively. We formulate and solve this (otherwise intractable) part selection problem as a combinatorial optimization problem using graph analysis techniques. Quantitative and qualitative experimental analysis validates our theoretical claims.
In this paper, we describe our approaches and experiments in content-based copy detection (CBCD) ... more In this paper, we describe our approaches and experiments in content-based copy detection (CBCD) and surveillance event detection pilot (SEDP) tasks of TRECVID 2008. We have participated in the video-only CBCD task and four of the SEDP events. The CBCD method relies on sequences of invariant global image features and efficiently matching and ranking of those sequences. The normalized Hu-moments are proven to be invariant to many transformations, as well as certain level of noise, and thus are the basis of our system. The most crucial property of proposed CBCD system is that it relies on the sequence matching rather than independent frame correspondences. The experiments have shown that this approach is quite useful for matching videos under extensive and strong transformations which make single frame matching a challenging task. This methodology is proven to be fast and produce high F1 detection scores in the TRECVID 2008 task evaluation. We also submitted four individual surveillance event detection systems. "Person-Runs", "Object-Put", "Opposing-Flow" and "Take-Picture" are the four selected events. The systems rely on low level vision properties such as optical flow and image intensity as well as heuristics based on a given event and context.
2022 26th International Conference on Pattern Recognition (ICPR)
In this study, we aim to predict the plausible future action steps given an observation of the pa... more In this study, we aim to predict the plausible future action steps given an observation of the past and study the task of instructional activity anticipation. Unlike previous anticipation tasks that aim at action label prediction, our work targets at generating natural language outputs that provide interpretable and accurate descriptions of future action steps. It is a challenging task due to the lack of semantic information extracted from the instructional videos. To overcome this challenge, we propose a novel knowledge distillation framework to exploit the related external textual knowledge to assist the visual anticipation task. However, previous knowledge distillation techniques generally transfer information within the same modality. To bridge the gap between the visual and text modalities during the distillation process, we devise a novel cross-modal contrastive distillation (CCD) scheme, which facilitates knowledge distillation between teacher and student in heterogeneous modalities with the proposed crossmodal distillation loss. We evaluate our method on the Tasty Videos dataset. CCD improves the anticipation performance of the visual-alone student model by a large margin of 40.2% relatively in BLEU4. Our approach also outperforms the stateof-the-art approaches by a large margin.
Multimedia Tools and Applications
Unsupervised domain adaptive person re-identification has received significant attention due to i... more Unsupervised domain adaptive person re-identification has received significant attention due to its high practical value. In past years, by following the clustering and finetuning paradigm, researchers propose to utilize the teacher-student framework in their methods to decrease the domain gap between different person re-identification datasets. Inspired by recent teacher-student framework based methods, which try to mimic the human learning process either by making the student directly copy behavior from the teacher or selecting reliable learning materials, we propose to conduct further exploration to imitate the human learning process from different aspects, i.e., adaptively updating learning materials, selectively imitating teacher behaviors, and analyzing learning materials structures. The explored three components, collaborate together to constitute a new method for unsupervised domain adaptive person re-identification, which is called Human Learning Imitation framework. The experimental results on three benchmark datasets demonstrate the efficacy of our proposed method.
ArXiv, 2022
While action anticipation has garnered a lot of research interest recently, most of the works foc... more While action anticipation has garnered a lot of research interest recently, most of the works focus on anticipating future action directly through observed visual cues only. In this work, we take a step back to analyze how the human capability to anticipate the future can be transferred to machine learning algorithms. To incorporate this ability in intelligent systems a question worth pondering upon is how exactly do we anticipate? Is it by anticipating future actions from past experiences? Or is it by simulating possible scenarios based on cues from the present? A recent study on human psychology [1] explains that, in anticipating an occurrence, the human brain counts on both systems. In this work, we study the impact of each system for the task of action anticipation and introduce a paradigm to integrate them in a learning framework. We believe that intelligent systems designed by leveraging the psychological anticipation models will do a more nuanced job at the task of human acti...
Proceedings of the AAAI Conference on Artificial Intelligence
We propose a new zero-shot Event-Detection method by Multi-modal Distributional Semantic embeddin... more We propose a new zero-shot Event-Detection method by Multi-modal Distributional Semantic embedding of videos. Our model embeds object and action concepts as well as other available modalities from videos into a distributional semantic space. To our knowledge, this is the first Zero-Shot event detection model that is built on top of distributional semantics and extends it in the following directions: (a) semantic embedding of multimodal information in videos (with focus on the visual modalities), (b) semantic embedding of concepts definitions, and (c) retrieve videos by free text event query (e.g., "changing a vehicle tire") based on their content. We first embed the video into the multi-modal semantic space and then measure the similarity between videos with the event query in free text form. We validated our method on the large TRECVID MED (Multimedia Event Detection) challenge. Using only the event title as a query, our method outperformed the state-the-art that uses big...
2020 25th International Conference on Pattern Recognition (ICPR)
The current deep learning based visual tracking approaches have been very successful by learning ... more The current deep learning based visual tracking approaches have been very successful by learning the target classification and/or estimation model from a large amount of supervised training data in offline mode. However, most of them can still fail in tracking objects due to some more challenging issues such as dense distractor objects, confusing background, motion blurs, and so on. Inspired by the human "visual tracking" capability which leverages motion cues to distinguish the target from the background, we propose a Two-Stream Residual Convolutional Network (TS-RCN) for visual tracking, which successfully exploits both appearance and motion features for model update. Our TS-RCN can be integrated with existing deep learning based visual trackers. To further improve the tracking performance, we adopt a "wider" residual network ResNeXt as its feature extraction backbone. To the best of our knowledge, TS-RCN is the first end-to-end trainable two-stream visual tracking system, which makes full use of both appearance and motion features of the target. We have extensively evaluated the TS-RCN on most widely used benchmark datasets including VOT2018, VOT2019, and GOT-10K. The experiment results have successfully demonstrated that our two-stream model can greatly outperform the appearance based tracker, and it also achieves state-of-the-art performance. The tracking system can run at up to 38.1 FPS.
arXiv: Image and Video Processing, 2019
State-of-the-art methods for retinal vessel segmentation mainly rely on manually labeled vessels ... more State-of-the-art methods for retinal vessel segmentation mainly rely on manually labeled vessels as the ground truth for supervised training. The quality of manual labels plays an essential role in the segmentation accuracy, while in practice it could vary a lot and in turn could substantially mislead the training process and limit the segmentation accuracy. This paper aims to "purify" any comprehensive training set, which consists of data annotated by various observers, via refining low-quality manual labels in the dataset. To this end, we have developed a novel label refinement method based on an iterative generative adversarial network (GAN). Our iterative GAN is trained based on a set of high-quality patches (i.e. with consistent manual labels among different observers) and low-quality patches with noisy manual vessel labels. A simple yet effective method has been designed to simulate low-quality patches with noises which conform to the distribution of real noises from...
ArXiv, 2021
Some cognitive research has discovered that humans accomplish event segmentation as a side effect... more Some cognitive research has discovered that humans accomplish event segmentation as a side effect of event anticipation. Inspired by this discovery, we propose a simple yet effective end-to-end self-supervised learning framework for event segmentation/boundary detection. Unlike the mainstream clustering-based methods, our framework exploits a transformer-based feature reconstruction scheme to detect event boundary by reconstruction errors. This is consistent with the fact that humans spot new events by leveraging the deviation between their prediction and what is actually perceived. Thanks to their heterogeneity in semantics, the frames at boundaries are difficult to be reconstructed (generally with large reconstruction errors), which is favorable for event boundary detection. Additionally, since the reconstruction occurs on the semantic feature level instead of pixel level, we develop a temporal contrastive feature embedding module to learn the semantic visual representation for fr...
2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), 2021
The Visual Object Tracking challenge VOT2021 is the ninth annual tracker benchmarking activity or... more The Visual Object Tracking challenge VOT2021 is the ninth annual tracker benchmarking activity organized by the VOT initiative. Results of 71 trackers are presented; many are state-of-the-art trackers published at major computer vision conferences or in journals in recent years. The VOT2021 challenge was composed of four sub-challenges focusing on different tracking domains: (i) VOT-ST2021 challenge focused on short-term tracking in RGB, (ii) VOT-RT2021 challenge focused on "real-time" short-term tracking in RGB, (iii) VOT-LT2021 focused on long-term tracking, namely coping with target disappearance and reappearance and (iv) VOT-RGBD2021 challenge focused on long-term tracking in RGB and depth imagery. The VOT-ST2021 dataset was refreshed, while VOT-RGBD2021 introduces a training dataset and sequestered dataset for winner identification. The source code for most of the trackers, the datasets, the evaluation kit and the results along with the source code for most trackers are publicly available at the challenge website 1 .
Proceedings of the 29th ACM International Conference on Multimedia, 2021
In this workshop, we are addressing the trustworthy AI issues for Multimedia Computing. We aim to... more In this workshop, we are addressing the trustworthy AI issues for Multimedia Computing. We aim to bring together researchers in the trustworthy aspects of Multimedia Computing and facilitate discussions in injecting trusts into multimedia to develop trustworthy AI techniques that are reliable and acceptable to multimedia researchers and practitioners. Our scope is at the conjunction of multimedia, computer vision and trustworthy AI, including Explainability, Robustness and Safety, Data Privacy, Accountability and Transparency, and Fairness. Related Workshop Proceedings are available in the ACM DL at: http://dl.acm.org/citation.cfm?id=3475731 CCS CONCEPTS • Computing methodologies → Artificial intelligence; Computer vision.
2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020
Region sampling or weighting is significantly important to the success of modern region-based obj... more Region sampling or weighting is significantly important to the success of modern region-based object detectors. Unlike some previous works, which only focus on "hard" samples when optimizing the objective function, we argue that sample weighting should be data-dependent and taskdependent. The importance of a sample for the objective function optimization is determined by its uncertainties to both object classification and bounding box regression tasks. To this end, we devise a general loss function to cover most region-based object detectors with various sampling strategies, and then based on it we propose a unified sample weighting network to predict a sample's task weights. Our framework is simple yet effective. It leverages the samples' uncertainty distributions on classification loss, regression loss, IoU, and probability score, to predict sample weights. Our approach has several advantages: (i). It jointly learns sample weights for both classification and regression tasks, which differentiates it from most previous work. (ii). It is a data-driven process, so it avoids some manual parameter tuning. (iii). It can be effortlessly plugged into most object detectors and achieves noticeable performance improvements without affecting their inference time. Our approach has been thoroughly evaluated with recent object detection frameworks and it can consistently boost the detection accuracy. Code has been made available at https://github.com/ caiqi/sample-weighting-network.
Lecture Notes in Computer Science, 2019
Automated methods for detecting pulmonary embolisms (PEs) on CT pulmonary angiography (CTPA) imag... more Automated methods for detecting pulmonary embolisms (PEs) on CT pulmonary angiography (CTPA) images are of high demand. Existing methods typically employ separate steps for PE candidate detection and false positive removal, without considering the ability of the other step. As a result, most existing methods usually suffer from a high false positive rate in order to achieve an acceptable sensitivity. This study presents an end-to-end trainable convolutional neural network (CNN) where the two steps are optimized jointly. The proposed CNN consists of three concatenated subnets: 1) a novel 3D candidate proposal network for detecting cubes containing suspected PEs, 2) a 3D spatial transformation subnet for generating fixed-sized vessel-aligned image representation for candidates, and 3) a 2D classification network which takes the three cross-sections of the transformed cubes as input and eliminates false positives. We have evaluated our approach using the 20 CTPA test dataset from the PE challenge, achieving a sensitivity of 78.9%, 80.7% and 80.7% at 2 false positives per volume at 0mm, 2mm and 5mm localization error, which is superior to the state-of-the-art methods. We have further evaluated our system on our own dataset consisting of 129 CTPA data with a total of 269 emboli. Our system achieves a sensitivity of 63.2%, 78.9% and 86.8% at 2 false positives per volume at 0mm, 2mm and 5mm localization error.
2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019
In this paper, we propose a Customizable Architecture Search (CAS) approach to automatically gene... more In this paper, we propose a Customizable Architecture Search (CAS) approach to automatically generate a network architecture for semantic image segmentation. The generated network consists of a sequence of stacked computation cells. A computation cell is represented as a directed acyclic graph, in which each node is a hidden representation (i.e., feature map) and each edge is associated with an operation (e.g., convolution and pooling), which transforms data to a new layer. During the training, the CAS algorithm explores the search space for an optimized computation cell to build a network. The cells of the same type share one architecture but with different weights. In real applications, however, an optimization may need to be conducted under some constraints such as GPU time and model size. To this end, a cost corresponding to the constraint will be assigned to each operation. When an operation is selected during the search, its associated cost will be added to the objective. As a result, our CAS is able to search an optimized architecture with customized constraints. The approach has been thoroughly evaluated on Cityscapes and CamVid datasets, and demonstrates superior performance over several stateof-the-art techniques. More remarkably, our CAS achieves 72.3% mIoU on the Cityscapes dataset with speed of 108 FPS on an Nvidia TitanXp GPU.
IEEE Transactions on Medical Imaging, 2020
Vascular tree disentanglement and vessel type classification are two crucial steps of the graph-b... more Vascular tree disentanglement and vessel type classification are two crucial steps of the graph-based method for retinal artery-vein (A/V) separation. Existing approaches treat them as two independent tasks and mostly rely on ad hoc rules (e.g. change of vessel directions) and hand-crafted features (e.g. color, thickness) to handle them respectively. However, we argue that the two tasks are highly correlated and should be handled jointly since knowing the A/V type can unravel those highly entangled vascular trees, which in turn helps to infer the types of connected vessels that are hard to classify based on only appearance. Therefore, designing features and models isolatedly for the two tasks often leads to a suboptimal solution of A/V separation. In view of this, this paper proposes a multi-task siamese network which aims to learn the two tasks jointly and thus yields more robust deep features for accurate A/V separation. Specifically, we first introduce Convolution Along Vessel (CAV) to extract the visual features by convolving a fundus image along vessel segments, and the geometric features by tracking the directions of blood flow in vessels. The siamese network is then trained to learn multiple tasks: i) classifying A/V types of vessel segments using visual features only, and ii) estimating the similarity of every two connected segments by comparing their visual and geometric features in order to disentangle the vasculature into individual vessel trees. Finally, the results of two tasks mutually correct each other to accomplish final A/V separation. Experimental results demonstrate that our method can achieve accuracy values of 94.7%, 96.9%, and 94.5% on three major databases (DRIVE, INSPIRE, WIDE) respectively, which outperforms recent state-of-the-arts.
2012 IEEE Conference on Computer Vision and Pattern Recognition, 2012
Low-level appearance as well as spatio-temporal features , appropriately quantized and aggregated... more Low-level appearance as well as spatio-temporal features , appropriately quantized and aggregated into Bagof-Words (BoW) descriptors, have been shown to be effective in many detection and recognition tasks. However, their efficacy for complex event recognition in unconstrained videos have not been systematically evaluated. In this paper, we use the NIST TRECVID Multimedia Event Detection (MED11 [1]) open source dataset, containing annotated data for 15 high-level events, as the standardized test bed for evaluating the low-level features. This dataset contains a large number of user-generated video clips. We consider 7 different low-level features, both static and dynamic, using BoW descriptors within an SVM approach for event detection. We present performance results on the 15 MED11 events for each of the features as well as their combinations using a number of early and late fusion strategies and discuss their strengths and limitations.
2006 IEEE International Conference on Multimedia and Expo, 2006
In this paper, we present an integrated system for news video retrieval. The proposed system inco... more In this paper, we present an integrated system for news video retrieval. The proposed system incorporates both speech and visual information in the search mechanisms. The initial search is based on the automatic speech recognition (ASR) transcript of video. Based on the relevant shots selected from the initial search round, keyword histograms are automatically generated for the refinement of the search query, such that the reformulated query fits better to the target topic. We have also developed an image-based refinement module, which uses the region analysis of the video key-frames. SR-tree like indexing structure is constructed for the region features, and the imageto-image similarity is computed using the Earth Mover's Distance. By performing a series of relevance feedback processes, the set of the true relevant shots is expanded significantly. The proposed system has been applied to a large open-benchmark news video dataset, and very satisfactory improvements have been obtained by applying the proposed automatic query expansion and the region-based refinement.
2009 IEEE Conference on Computer Vision and Pattern Recognition, 2009
In this paper, we propose a novel approach for learning generic visual vocabulary. We use diffusi... more In this paper, we propose a novel approach for learning generic visual vocabulary. We use diffusion maps to automatically learn a semantic visual vocabulary from abundant quantized midlevel features. Each midlevel feature is represented by the vector of pointwise mutual information (PMI). In this midlevel feature space, we believe the features produced by similar sources must lie on a certain manifold. To capture the intrinsic geometric relations between features, we measure their dissimilarity using diffusion distance. The underlying idea is to embed the midlevel features into a semantic lower-dimensional space. Our goal is to construct a compact yet discriminative semantic visual vocabulary. Although the conventional approach using k-means is good for vocabulary construction, its performance is sensitive to the size of the visual vocabulary. In addition, the learnt visual words are not semantically meaningful since the clustering criterion is based on appearance similarity only. Our proposed approach can effectively overcome these problems by capturing the semantic and geometric relations of the feature space using diffusion maps. Unlike some of the supervised vocabulary construction approaches, and the unsupervised methods such as pLSA and LDA, diffusion maps can capture the local intrinsic geometric relations between the midlevel feature points on the manifold. We have tested our approach on the KTH action dataset, our own YouTube action dataset and the fifteen scene dataset, and have obtained very promising results. * In this online version of our CVPR 2009 paper we have updated section 2 in order to clarify the background material.
Intelligent Computing: Theory and Applications IV, 2006
Content-based video retrieval (CBVR) problems have gained significant importance in today's intel... more Content-based video retrieval (CBVR) problems have gained significant importance in today's intelligence world demanding further insight. Compared to the traditional video indexing systems, CBVR systems do not require the intensive human effort in the semantic annotation. In this paper, we propose the PEGASUS system. PEGASUS is an integrated news video search system containing two utilities: a fast multi-modality indexing system, and an interactive framework for the search on semantic topics. The indexing system is constructed based on the features from both the visual and speech portions of the videos. In the retrieval phase, the user submits a query generated from the desired semantic topic. The initial return by the system is based on the Automatic Speech Recognition information search.The results are then refined by performing a series of relevance feedback processes using other features, such as the optical character recognition (OCR) output, and global color statistics of the key-frames. The advantages of the PEGASUS system are that the queries are better formulated by key word histograms and the relevant result sets can be expanded using content analysis. We have participated in the TREC Video Retrieval Evaluation (TRECVID) forum, which has been organized by the U.S. National Institute of Standards and Technologies (NIST). Semantic topics have been tested on the PEGASUS system, and very satisfactory results were obtained.
2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021
Virtual try-on methods aim to generate images of fashion models wearing arbitrary combinations of... more Virtual try-on methods aim to generate images of fashion models wearing arbitrary combinations of garments. This is a challenging task because the generated image must appear realistic and accurately display the interaction between garments. Prior works produce images that are filled with artifacts and fail to capture important visual details necessary for commercial applications. We propose Outfit Visualization Net (OVNet) to capture these important details (e.g. buttons, shading, textures, realistic hemlines, and interactions between garments) and produce high quality multiple-garment virtual try-on images. OVNet consists of 1) a semantic layout generator and 2) an image generation pipeline using multiple coordinated warps. We train the warper to output multiple warps using a cascade loss, which refines each successive warp to focus on poorly generated regions of a previous warp and yields consistent improvements in detail. In addition, we introduce a method for matching outfits with the most suitable model and produce significant improvements for both our and other previous try-on methods. Through quantitative and qualitative analysis, we demonstrate our method generates substantially higher-quality studio images compared to prior works for multi-garment outfits. An interactive interface powered by this method has been deployed on fashion e-commerce websites and received overwhelmingly positive feedback.
2011 International Conference on Computer Vision, 2011
We propose a novel statistical manifold modeling approach that is capable of classifying poses of... more We propose a novel statistical manifold modeling approach that is capable of classifying poses of object categories from video sequences by simultaneously minimizing the intra-class variability and maximizing inter-pose distance. Following the intuition that an object part based representation and a suitable part selection process may help achieve our purpose, we formulate the part selection problem from a statistical manifold modeling perspective and treat part selection as adjusting the manifold of the object (parameterized by pose) by means of the manifold "alignment" and "expansion" operations. We show that manifold alignment and expansion are equivalent to minimizing the intra-class distance given a pose while increasing the inter-pose distance given an object instance respectively. We formulate and solve this (otherwise intractable) part selection problem as a combinatorial optimization problem using graph analysis techniques. Quantitative and qualitative experimental analysis validates our theoretical claims.
In this paper, we describe our approaches and experiments in content-based copy detection (CBCD) ... more In this paper, we describe our approaches and experiments in content-based copy detection (CBCD) and surveillance event detection pilot (SEDP) tasks of TRECVID 2008. We have participated in the video-only CBCD task and four of the SEDP events. The CBCD method relies on sequences of invariant global image features and efficiently matching and ranking of those sequences. The normalized Hu-moments are proven to be invariant to many transformations, as well as certain level of noise, and thus are the basis of our system. The most crucial property of proposed CBCD system is that it relies on the sequence matching rather than independent frame correspondences. The experiments have shown that this approach is quite useful for matching videos under extensive and strong transformations which make single frame matching a challenging task. This methodology is proven to be fast and produce high F1 detection scores in the TRECVID 2008 task evaluation. We also submitted four individual surveillance event detection systems. "Person-Runs", "Object-Put", "Opposing-Flow" and "Take-Picture" are the four selected events. The systems rely on low level vision properties such as optical flow and image intensity as well as heuristics based on a given event and context.
2022 26th International Conference on Pattern Recognition (ICPR)
In this study, we aim to predict the plausible future action steps given an observation of the pa... more In this study, we aim to predict the plausible future action steps given an observation of the past and study the task of instructional activity anticipation. Unlike previous anticipation tasks that aim at action label prediction, our work targets at generating natural language outputs that provide interpretable and accurate descriptions of future action steps. It is a challenging task due to the lack of semantic information extracted from the instructional videos. To overcome this challenge, we propose a novel knowledge distillation framework to exploit the related external textual knowledge to assist the visual anticipation task. However, previous knowledge distillation techniques generally transfer information within the same modality. To bridge the gap between the visual and text modalities during the distillation process, we devise a novel cross-modal contrastive distillation (CCD) scheme, which facilitates knowledge distillation between teacher and student in heterogeneous modalities with the proposed crossmodal distillation loss. We evaluate our method on the Tasty Videos dataset. CCD improves the anticipation performance of the visual-alone student model by a large margin of 40.2% relatively in BLEU4. Our approach also outperforms the stateof-the-art approaches by a large margin.
Multimedia Tools and Applications
Unsupervised domain adaptive person re-identification has received significant attention due to i... more Unsupervised domain adaptive person re-identification has received significant attention due to its high practical value. In past years, by following the clustering and finetuning paradigm, researchers propose to utilize the teacher-student framework in their methods to decrease the domain gap between different person re-identification datasets. Inspired by recent teacher-student framework based methods, which try to mimic the human learning process either by making the student directly copy behavior from the teacher or selecting reliable learning materials, we propose to conduct further exploration to imitate the human learning process from different aspects, i.e., adaptively updating learning materials, selectively imitating teacher behaviors, and analyzing learning materials structures. The explored three components, collaborate together to constitute a new method for unsupervised domain adaptive person re-identification, which is called Human Learning Imitation framework. The experimental results on three benchmark datasets demonstrate the efficacy of our proposed method.
ArXiv, 2022
While action anticipation has garnered a lot of research interest recently, most of the works foc... more While action anticipation has garnered a lot of research interest recently, most of the works focus on anticipating future action directly through observed visual cues only. In this work, we take a step back to analyze how the human capability to anticipate the future can be transferred to machine learning algorithms. To incorporate this ability in intelligent systems a question worth pondering upon is how exactly do we anticipate? Is it by anticipating future actions from past experiences? Or is it by simulating possible scenarios based on cues from the present? A recent study on human psychology [1] explains that, in anticipating an occurrence, the human brain counts on both systems. In this work, we study the impact of each system for the task of action anticipation and introduce a paradigm to integrate them in a learning framework. We believe that intelligent systems designed by leveraging the psychological anticipation models will do a more nuanced job at the task of human acti...
Proceedings of the AAAI Conference on Artificial Intelligence
We propose a new zero-shot Event-Detection method by Multi-modal Distributional Semantic embeddin... more We propose a new zero-shot Event-Detection method by Multi-modal Distributional Semantic embedding of videos. Our model embeds object and action concepts as well as other available modalities from videos into a distributional semantic space. To our knowledge, this is the first Zero-Shot event detection model that is built on top of distributional semantics and extends it in the following directions: (a) semantic embedding of multimodal information in videos (with focus on the visual modalities), (b) semantic embedding of concepts definitions, and (c) retrieve videos by free text event query (e.g., "changing a vehicle tire") based on their content. We first embed the video into the multi-modal semantic space and then measure the similarity between videos with the event query in free text form. We validated our method on the large TRECVID MED (Multimedia Event Detection) challenge. Using only the event title as a query, our method outperformed the state-the-art that uses big...
2020 25th International Conference on Pattern Recognition (ICPR)
The current deep learning based visual tracking approaches have been very successful by learning ... more The current deep learning based visual tracking approaches have been very successful by learning the target classification and/or estimation model from a large amount of supervised training data in offline mode. However, most of them can still fail in tracking objects due to some more challenging issues such as dense distractor objects, confusing background, motion blurs, and so on. Inspired by the human "visual tracking" capability which leverages motion cues to distinguish the target from the background, we propose a Two-Stream Residual Convolutional Network (TS-RCN) for visual tracking, which successfully exploits both appearance and motion features for model update. Our TS-RCN can be integrated with existing deep learning based visual trackers. To further improve the tracking performance, we adopt a "wider" residual network ResNeXt as its feature extraction backbone. To the best of our knowledge, TS-RCN is the first end-to-end trainable two-stream visual tracking system, which makes full use of both appearance and motion features of the target. We have extensively evaluated the TS-RCN on most widely used benchmark datasets including VOT2018, VOT2019, and GOT-10K. The experiment results have successfully demonstrated that our two-stream model can greatly outperform the appearance based tracker, and it also achieves state-of-the-art performance. The tracking system can run at up to 38.1 FPS.
arXiv: Image and Video Processing, 2019
State-of-the-art methods for retinal vessel segmentation mainly rely on manually labeled vessels ... more State-of-the-art methods for retinal vessel segmentation mainly rely on manually labeled vessels as the ground truth for supervised training. The quality of manual labels plays an essential role in the segmentation accuracy, while in practice it could vary a lot and in turn could substantially mislead the training process and limit the segmentation accuracy. This paper aims to "purify" any comprehensive training set, which consists of data annotated by various observers, via refining low-quality manual labels in the dataset. To this end, we have developed a novel label refinement method based on an iterative generative adversarial network (GAN). Our iterative GAN is trained based on a set of high-quality patches (i.e. with consistent manual labels among different observers) and low-quality patches with noisy manual vessel labels. A simple yet effective method has been designed to simulate low-quality patches with noises which conform to the distribution of real noises from...
ArXiv, 2021
Some cognitive research has discovered that humans accomplish event segmentation as a side effect... more Some cognitive research has discovered that humans accomplish event segmentation as a side effect of event anticipation. Inspired by this discovery, we propose a simple yet effective end-to-end self-supervised learning framework for event segmentation/boundary detection. Unlike the mainstream clustering-based methods, our framework exploits a transformer-based feature reconstruction scheme to detect event boundary by reconstruction errors. This is consistent with the fact that humans spot new events by leveraging the deviation between their prediction and what is actually perceived. Thanks to their heterogeneity in semantics, the frames at boundaries are difficult to be reconstructed (generally with large reconstruction errors), which is favorable for event boundary detection. Additionally, since the reconstruction occurs on the semantic feature level instead of pixel level, we develop a temporal contrastive feature embedding module to learn the semantic visual representation for fr...
2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), 2021
The Visual Object Tracking challenge VOT2021 is the ninth annual tracker benchmarking activity or... more The Visual Object Tracking challenge VOT2021 is the ninth annual tracker benchmarking activity organized by the VOT initiative. Results of 71 trackers are presented; many are state-of-the-art trackers published at major computer vision conferences or in journals in recent years. The VOT2021 challenge was composed of four sub-challenges focusing on different tracking domains: (i) VOT-ST2021 challenge focused on short-term tracking in RGB, (ii) VOT-RT2021 challenge focused on "real-time" short-term tracking in RGB, (iii) VOT-LT2021 focused on long-term tracking, namely coping with target disappearance and reappearance and (iv) VOT-RGBD2021 challenge focused on long-term tracking in RGB and depth imagery. The VOT-ST2021 dataset was refreshed, while VOT-RGBD2021 introduces a training dataset and sequestered dataset for winner identification. The source code for most of the trackers, the datasets, the evaluation kit and the results along with the source code for most trackers are publicly available at the challenge website 1 .
Proceedings of the 29th ACM International Conference on Multimedia, 2021
In this workshop, we are addressing the trustworthy AI issues for Multimedia Computing. We aim to... more In this workshop, we are addressing the trustworthy AI issues for Multimedia Computing. We aim to bring together researchers in the trustworthy aspects of Multimedia Computing and facilitate discussions in injecting trusts into multimedia to develop trustworthy AI techniques that are reliable and acceptable to multimedia researchers and practitioners. Our scope is at the conjunction of multimedia, computer vision and trustworthy AI, including Explainability, Robustness and Safety, Data Privacy, Accountability and Transparency, and Fairness. Related Workshop Proceedings are available in the ACM DL at: http://dl.acm.org/citation.cfm?id=3475731 CCS CONCEPTS • Computing methodologies → Artificial intelligence; Computer vision.
2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020
Region sampling or weighting is significantly important to the success of modern region-based obj... more Region sampling or weighting is significantly important to the success of modern region-based object detectors. Unlike some previous works, which only focus on "hard" samples when optimizing the objective function, we argue that sample weighting should be data-dependent and taskdependent. The importance of a sample for the objective function optimization is determined by its uncertainties to both object classification and bounding box regression tasks. To this end, we devise a general loss function to cover most region-based object detectors with various sampling strategies, and then based on it we propose a unified sample weighting network to predict a sample's task weights. Our framework is simple yet effective. It leverages the samples' uncertainty distributions on classification loss, regression loss, IoU, and probability score, to predict sample weights. Our approach has several advantages: (i). It jointly learns sample weights for both classification and regression tasks, which differentiates it from most previous work. (ii). It is a data-driven process, so it avoids some manual parameter tuning. (iii). It can be effortlessly plugged into most object detectors and achieves noticeable performance improvements without affecting their inference time. Our approach has been thoroughly evaluated with recent object detection frameworks and it can consistently boost the detection accuracy. Code has been made available at https://github.com/ caiqi/sample-weighting-network.
Lecture Notes in Computer Science, 2019
Automated methods for detecting pulmonary embolisms (PEs) on CT pulmonary angiography (CTPA) imag... more Automated methods for detecting pulmonary embolisms (PEs) on CT pulmonary angiography (CTPA) images are of high demand. Existing methods typically employ separate steps for PE candidate detection and false positive removal, without considering the ability of the other step. As a result, most existing methods usually suffer from a high false positive rate in order to achieve an acceptable sensitivity. This study presents an end-to-end trainable convolutional neural network (CNN) where the two steps are optimized jointly. The proposed CNN consists of three concatenated subnets: 1) a novel 3D candidate proposal network for detecting cubes containing suspected PEs, 2) a 3D spatial transformation subnet for generating fixed-sized vessel-aligned image representation for candidates, and 3) a 2D classification network which takes the three cross-sections of the transformed cubes as input and eliminates false positives. We have evaluated our approach using the 20 CTPA test dataset from the PE challenge, achieving a sensitivity of 78.9%, 80.7% and 80.7% at 2 false positives per volume at 0mm, 2mm and 5mm localization error, which is superior to the state-of-the-art methods. We have further evaluated our system on our own dataset consisting of 129 CTPA data with a total of 269 emboli. Our system achieves a sensitivity of 63.2%, 78.9% and 86.8% at 2 false positives per volume at 0mm, 2mm and 5mm localization error.
2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019
In this paper, we propose a Customizable Architecture Search (CAS) approach to automatically gene... more In this paper, we propose a Customizable Architecture Search (CAS) approach to automatically generate a network architecture for semantic image segmentation. The generated network consists of a sequence of stacked computation cells. A computation cell is represented as a directed acyclic graph, in which each node is a hidden representation (i.e., feature map) and each edge is associated with an operation (e.g., convolution and pooling), which transforms data to a new layer. During the training, the CAS algorithm explores the search space for an optimized computation cell to build a network. The cells of the same type share one architecture but with different weights. In real applications, however, an optimization may need to be conducted under some constraints such as GPU time and model size. To this end, a cost corresponding to the constraint will be assigned to each operation. When an operation is selected during the search, its associated cost will be added to the objective. As a result, our CAS is able to search an optimized architecture with customized constraints. The approach has been thoroughly evaluated on Cityscapes and CamVid datasets, and demonstrates superior performance over several stateof-the-art techniques. More remarkably, our CAS achieves 72.3% mIoU on the Cityscapes dataset with speed of 108 FPS on an Nvidia TitanXp GPU.
IEEE Transactions on Medical Imaging, 2020
Vascular tree disentanglement and vessel type classification are two crucial steps of the graph-b... more Vascular tree disentanglement and vessel type classification are two crucial steps of the graph-based method for retinal artery-vein (A/V) separation. Existing approaches treat them as two independent tasks and mostly rely on ad hoc rules (e.g. change of vessel directions) and hand-crafted features (e.g. color, thickness) to handle them respectively. However, we argue that the two tasks are highly correlated and should be handled jointly since knowing the A/V type can unravel those highly entangled vascular trees, which in turn helps to infer the types of connected vessels that are hard to classify based on only appearance. Therefore, designing features and models isolatedly for the two tasks often leads to a suboptimal solution of A/V separation. In view of this, this paper proposes a multi-task siamese network which aims to learn the two tasks jointly and thus yields more robust deep features for accurate A/V separation. Specifically, we first introduce Convolution Along Vessel (CAV) to extract the visual features by convolving a fundus image along vessel segments, and the geometric features by tracking the directions of blood flow in vessels. The siamese network is then trained to learn multiple tasks: i) classifying A/V types of vessel segments using visual features only, and ii) estimating the similarity of every two connected segments by comparing their visual and geometric features in order to disentangle the vasculature into individual vessel trees. Finally, the results of two tasks mutually correct each other to accomplish final A/V separation. Experimental results demonstrate that our method can achieve accuracy values of 94.7%, 96.9%, and 94.5% on three major databases (DRIVE, INSPIRE, WIDE) respectively, which outperforms recent state-of-the-arts.
2012 IEEE Conference on Computer Vision and Pattern Recognition, 2012
Low-level appearance as well as spatio-temporal features , appropriately quantized and aggregated... more Low-level appearance as well as spatio-temporal features , appropriately quantized and aggregated into Bagof-Words (BoW) descriptors, have been shown to be effective in many detection and recognition tasks. However, their efficacy for complex event recognition in unconstrained videos have not been systematically evaluated. In this paper, we use the NIST TRECVID Multimedia Event Detection (MED11 [1]) open source dataset, containing annotated data for 15 high-level events, as the standardized test bed for evaluating the low-level features. This dataset contains a large number of user-generated video clips. We consider 7 different low-level features, both static and dynamic, using BoW descriptors within an SVM approach for event detection. We present performance results on the 15 MED11 events for each of the features as well as their combinations using a number of early and late fusion strategies and discuss their strengths and limitations.
2006 IEEE International Conference on Multimedia and Expo, 2006
In this paper, we present an integrated system for news video retrieval. The proposed system inco... more In this paper, we present an integrated system for news video retrieval. The proposed system incorporates both speech and visual information in the search mechanisms. The initial search is based on the automatic speech recognition (ASR) transcript of video. Based on the relevant shots selected from the initial search round, keyword histograms are automatically generated for the refinement of the search query, such that the reformulated query fits better to the target topic. We have also developed an image-based refinement module, which uses the region analysis of the video key-frames. SR-tree like indexing structure is constructed for the region features, and the imageto-image similarity is computed using the Earth Mover's Distance. By performing a series of relevance feedback processes, the set of the true relevant shots is expanded significantly. The proposed system has been applied to a large open-benchmark news video dataset, and very satisfactory improvements have been obtained by applying the proposed automatic query expansion and the region-based refinement.
2009 IEEE Conference on Computer Vision and Pattern Recognition, 2009
In this paper, we propose a novel approach for learning generic visual vocabulary. We use diffusi... more In this paper, we propose a novel approach for learning generic visual vocabulary. We use diffusion maps to automatically learn a semantic visual vocabulary from abundant quantized midlevel features. Each midlevel feature is represented by the vector of pointwise mutual information (PMI). In this midlevel feature space, we believe the features produced by similar sources must lie on a certain manifold. To capture the intrinsic geometric relations between features, we measure their dissimilarity using diffusion distance. The underlying idea is to embed the midlevel features into a semantic lower-dimensional space. Our goal is to construct a compact yet discriminative semantic visual vocabulary. Although the conventional approach using k-means is good for vocabulary construction, its performance is sensitive to the size of the visual vocabulary. In addition, the learnt visual words are not semantically meaningful since the clustering criterion is based on appearance similarity only. Our proposed approach can effectively overcome these problems by capturing the semantic and geometric relations of the feature space using diffusion maps. Unlike some of the supervised vocabulary construction approaches, and the unsupervised methods such as pLSA and LDA, diffusion maps can capture the local intrinsic geometric relations between the midlevel feature points on the manifold. We have tested our approach on the KTH action dataset, our own YouTube action dataset and the fifteen scene dataset, and have obtained very promising results. * In this online version of our CVPR 2009 paper we have updated section 2 in order to clarify the background material.
Intelligent Computing: Theory and Applications IV, 2006
Content-based video retrieval (CBVR) problems have gained significant importance in today's intel... more Content-based video retrieval (CBVR) problems have gained significant importance in today's intelligence world demanding further insight. Compared to the traditional video indexing systems, CBVR systems do not require the intensive human effort in the semantic annotation. In this paper, we propose the PEGASUS system. PEGASUS is an integrated news video search system containing two utilities: a fast multi-modality indexing system, and an interactive framework for the search on semantic topics. The indexing system is constructed based on the features from both the visual and speech portions of the videos. In the retrieval phase, the user submits a query generated from the desired semantic topic. The initial return by the system is based on the Automatic Speech Recognition information search.The results are then refined by performing a series of relevance feedback processes using other features, such as the optical character recognition (OCR) output, and global color statistics of the key-frames. The advantages of the PEGASUS system are that the queries are better formulated by key word histograms and the relevant result sets can be expanded using content analysis. We have participated in the TREC Video Retrieval Evaluation (TRECVID) forum, which has been organized by the U.S. National Institute of Standards and Technologies (NIST). Semantic topics have been tested on the PEGASUS system, and very satisfactory results were obtained.