Changxing Ding | University of Technology Sydney (original) (raw)

Papers by Changxing Ding

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Scene graph generation (SGG) aims to detect objects and predict their pairwise relationships with... more Scene graph generation (SGG) aims to detect objects and predict their pairwise relationships within an image. Current SGG methods typically utilize graph neural networks (GNNs) to acquire context information between objects/relationships. Despite their effectiveness, however, current SGG methods only assume scene graph homophily while ignoring heterophily. Accordingly, in this paper, we propose a novel Heterophily Learning Network (HL-Net) to comprehensively explore the homophily and heterophily between objects/relationships in scene graphs. More specifically, HL-Net comprises the following 1) an adaptive reweighting transformer module, which adaptively integrates the information from different layers to exploit both the heterophily and homophily in objects; 2) a relationship feature propagation module that efficiently explores the connections between relationships by considering heterophily in order to refine the relationship representation; 3) a heterophily-aware message-passing scheme to further distinguish the heterophily and homophily between objects/relationships, thereby facilitating improved message passing in graphs. We conducted extensive experiments on two public datasets: Visual Genome (VG) and Open Images (OI). The experimental results demonstrate the superiority of our proposed HL-Net over existing state-of-the-art approaches. In more detail, HL-Net outperforms the secondbest competitors by 2.1% on the VG dataset for scene graph classification and 1.2% on the IO dataset for the final score.

2022 IEEE International Conference on Multimedia and Expo (ICME)

Proceedings of the 30th ACM International Conference on Multimedia

Text-to-image person re-identification (ReID) aims to search for pedestrian images of an interest... more Text-to-image person re-identification (ReID) aims to search for pedestrian images of an interested identity via textual descriptions. It is challenging due to both rich intra-modal variations and significant intermodal gaps. Existing works usually ignore the difference in feature granularity between the two modalities, i.e., the visual features are usually fine-grained while textual features are coarse, which is mainly responsible for the large intermodal gaps. In this paper, we propose an end-to-end framework based on transformers to learn granularity-unified representations for both modalities, denoted as LGUR. LGUR framework contains two modules: a Dictionary-based Granularity Alignment (DGA) module and a Prototype-based Granularity Unification (PGU) module. In DGA, in order to align the granularities of two modalities, we introduce a Multi-modality Shared Dictionary (MSD) to reconstruct both visual and textual features. Besides, DGA has two important factors, i.e., the cross-modality guidance and the foreground-centric reconstruction, to facilitate the optimization of MSD. In PGU, we adopt a set of shared and learnable prototypes as the queries to extract diverse and semantically aligned features for both modalities in the granularity-unified feature space, which further promotes the ReID performance. Comprehensive experiments show that our LGUR consistently outperforms state-of-the-arts by large margins on both CUHK-PEDES and ICFG-PEDES datasets. Code will be released at https://github.com/ZhiyinShao-H/LGUR. CCS CONCEPTS • Information systems → Top-k retrieval in databases.

Object point cloud classification has drawn great research attention since the release of benchma... more Object point cloud classification has drawn great research attention since the release of benchmarking datasets, such as the ModelNet and the ShapeNet. These benchmarks assume point clouds covering complete surfaces of object instances, for which plenty of high-performing methods have been developed. However, their settings deviate from those often met in practice, where, due to (self-)occlusion, a point cloud covering partial surface of an object is captured from an arbitrary view. We show in this paper that performance of existing point cloud classifiers drops drastically under the considered single-view, partial setting; the phenomenon is consistent with the observation that semantic category of a partial object surface is less ambiguous only when its distribution on the whole surface is clearly specified. To this end, we argue for a single-view, partial setting where supervised learning of object pose estimation should be accompanied with classification. Technically, we propose ...

Advanced face swapping methods have achieved appealing results. However, most of these methods ha... more Advanced face swapping methods have achieved appealing results. However, most of these methods have many parameters and computations, which makes it challenging to apply them in real-time applications or deploy them on edge devices like mobile phones. In this work, we propose a lightweight Identity-aware Dynamic Network (IDN) for subject-agnostic face swapping by dynamically adjusting the model parameters according to the identity information. In particular, we design an efficient Identity Injection Module (IIM) by introducing two dynamic neural network techniques, including the weights prediction and weights modulation. Once the IDN is updated, it can be applied to swap faces given any target image or video. The presented IDN contains only 0.50M parameters and needs 0.33G FLOPs per frame, making it capable for real-time video face swapping on mobile phones. In addition, we introduce a knowledge distillation-based method for stable training, and a loss reweighting module is employed...

IEEE Transactions on Multimedia, 2022

Unsupervised Domain Adaptive (UDA) object re-identification (Re-ID) aims at adapting a model trai... more Unsupervised Domain Adaptive (UDA) object re-identification (Re-ID) aims at adapting a model trained on a labeled source domain to an unlabeled target domain. State-of-the-art object Re-ID approaches adopt clustering algorithms to generate pseudo-labels for the unlabeled target domain. However, the inevitable label noise caused by the clustering procedure significantly degrades the discriminative power of Re-ID model. To address this problem, we propose an uncertainty-aware clustering framework (UCF) for UDA tasks. First, a novel hierarchical clustering scheme is proposed to promote clustering quality. Second, an uncertainty-aware collaborative instance selection method is introduced to select images with reliable labels for model training. Combining both techniques effectively reduces the impact of noisy labels. In addition, we introduce a strong baseline that features a compact contrastive loss. Our UCF method consistently achieves state-of-the-art performance in multiple UDA tasks for object Re-ID, and significantly reduces the gap between unsupervised and supervised Re-ID performance. In particular, the performance of our unsupervised UCF method in the MSMT17→Market1501 task is better than that of the fully supervised setting on Market1501. The code of UCF is available at https://github.com/Wang-pengfei/UCF.

Video surveillance-oriented biometrics is a very challenging task and has tremendous significance... more Video surveillance-oriented biometrics is a very challenging task and has tremendous significance to the security of public places. With the growing threat of crime and terrorism to public security, it is becoming more and more critical to develop and deploy reliable biometric techniques for video surveillance applications. Traditionally, it has been regarded as a very difficult problem. The low-quality of video frames and the rich intra-personal appearance variations impose significant challenge to previous biometric techniques, making them impractical to real-world video surveillance applications. Fortunately, recent advances in computer vision and machine learning algorithms as well as imaging hardware provide new inspirations and possibilities. In particular, the development of deep learning and the availability of big data open up great potential. Therefore, it is the time that this problem be re-evaluated. This special issue will provide a platform for researchers to exchange their innovative ideas and attractive improvements on video surveillance-oriented biometrics.

ArXiv, 2020

A point cloud is a popular shape representation adopted in 3D object classification, which covers... more A point cloud is a popular shape representation adopted in 3D object classification, which covers the whole surface of an object and is usually well aligned. However, such an assumption can be invalid in practice, as point clouds collected in real-world scenarios are typically scanned from visible object parts observed under arbitrary SO(3) viewpoint, which are thus incomplete due to self and inter-object occlusion. In light of this, this paper introduces a practical setting to classify partial point clouds of object instances under any poses. Compared to the classification of complete object point clouds, such a problem is made more challenging in view of geometric similarities of local shape across object classes and intra-class dissimilarities of geometries restricted by their observation view. We consider that specifying the location of partial point clouds on their object surface is essential to alleviate suffering from the aforementioned challenges, which can be solved via an ...

ArXiv, 2021

Text-to-image person re-identification (ReID) aims to search for images containing a person of in... more Text-to-image person re-identification (ReID) aims to search for images containing a person of interest using textual descriptions. However, due to the significant modality gap and the large intra-class variance in textual descriptions, text-to-image ReID remains a challenging problem. Accordingly, in this paper, we propose a Semantically Self-Aligned Network (SSAN) to handle the above problems. First, we propose a novel method that automatically extracts semantically aligned part-level features from the two modalities. Second, we design a multi-view nonlocal network that captures the relationships between body parts, thereby establishing better correspondences between body parts and noun phrases. Third, we introduce a Compound Ranking (CR) loss that makes use of textual descriptions for other images of the same identity to provide extra supervision, thereby effectively reducing the intra-class variance in textual features. Finally, to expedite future research in text-to-image ReID,...

2021 IEEE International Joint Conference on Biometrics (IJCB), 2021

The past few years have witnessed great progress in the domain of face recognition thanks to adva... more The past few years have witnessed great progress in the domain of face recognition thanks to advances in deep learning. However, cross pose face recognition remains a significant challenge. It is difficult for many deep learning algorithms to narrow the performance gap caused by pose variations; the main reasons for this relate to the intra-class discrepancy between face images in different poses and the pose imbalances of training datasets. Learning pose-robust features by traversing to the feature space of frontal faces provides an effective and cheap way to alleviate this problem. In this paper, we present a method for progressively transforming profile face representations to the canonical pose with an attentive pair-wise loss. First, to reduce the difficulty of directly transforming the profile face features into a frontal one, we propose to learn the feature residual between the source pose and its nearby pose in a blockby-block fashion, and thus traversing to the feature space of a smaller pose by adding the learned residual. Second, we propose an attentive pair-wise loss to guide the feature transformation progressing in the most effective direction. Finally, our proposed progressive module and attentive pair-wise loss are lightweight and easy to implement, adding only about 7.5% extra parameters. Evaluations on the CFP and CPLFW datasets demonstrate the superiority of our proposed method. Code is available at https://github.com/hjy1312/AGPM.

Nucleus segmentation is a challenging task due to the crowded distribution and blurry boundaries ... more Nucleus segmentation is a challenging task due to the crowded distribution and blurry boundaries of nuclei. Recent approaches represent nuclei by means of polygons to differentiate between touching and overlapping nuclei and have accordingly achieved promising performance. Each polygon is represented by a set of centroid-toboundary distances, which are in turn predicted by features of the centroid pixel for a single nucleus. However, using the centroid pixel alone does not provide sufficient contextual information for robust prediction. To handle this problem, we propose a Context-aware Polygon Proposal Network (CPP-Net) for nucleus segmentation. First, we sample a point set rather than one single pixel within each cell for distance prediction. This strategy substantially enhances contextual information and thereby improves the robustness of the prediction. Second, we propose a Confidencebased Weighting Module, which adaptively fuses the predictions from the sampled point set. Third...

Face recognition is one of the most important and promising biometric techniques. In face recogni... more Face recognition is one of the most important and promising biometric techniques. In face recognition, a similarity score is automatically calculated between face images to further decide their identity. Due to its non-invasive characteristics and ease of use, it has shown great potential in many real-world applications, e.g., video surveillance, access control systems, forensics and security, and social networks. This thesis addresses key challenges inherent in real-world face recognition systems including pose and illumination variations, occlusion, and image blur. To tackle these challenges, a series of robust face recognition algorithms are proposed. These can be summarized as follows: In Chapter 2, we present a novel, manually designed face image descriptor named “Dual-Cross Patterns” (DCP). DCP efficiently encodes the seconder-order statistics of facial textures in the most informative directions within a face image. It proves to be more descriptive and discriminative than pre...

IEEE Transactions on Image Processing, 2021

Existing part-aware person re-identification methods typically employ two separate steps: namely,... more Existing part-aware person re-identification methods typically employ two separate steps: namely, body part detection and part-level feature extraction. However, part detection introduces an additional computational cost and is inherently challenging for low-quality images. Accordingly, in this work, we propose a simple framework named Batch Coherence-Driven Network (BCD-Net) that bypasses body part detection during both the training and testing phases while still learning semantically aligned part features. Our key observation is that the statistics in a batch of images are stable, and therefore that batch-level constraints are robust. First, we introduce a batch coherence-guided channel attention (BCCA) module that highlights the relevant channels for each respective part from the output of a deep backbone model. We investigate channelpart correspondence using a batch of training images, then impose a novel batch-level supervision signal that helps BCCA to identify part-relevant channels. Second, the mean position of a body part is robust and consequently coherent between batches throughout the training process. Accordingly, we introduce a pair of regularization terms based on the semantic consistency between batches. The first term regularizes the high responses of BCD-Net for each part on one batch in order to constrain it within a predefined area, while the second encourages the aggregate of BCD-Net's responses for all parts covering the entire human body. The above constraints guide BCD-Net to learn diverse, complementary, and semantically aligned part-level features. Extensive experimental results demonstrate that BCD-Net consistently achieves state-of-the-art performance on four large-scale ReID benchmarks.

2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021

Modern human-object interaction (HOI) detection approaches can be divided into one-stage methods ... more Modern human-object interaction (HOI) detection approaches can be divided into one-stage methods and twostage ones. One-stage models are more efficient due to their straightforward architectures, but the two-stage models are still advantageous in accuracy. Existing one-stage models usually begin by detecting predefined interaction areas or points, and then attend to these areas only for interaction prediction; therefore, they lack reasoning steps that dynamically search for discriminative cues. In this paper, we propose a novel one-stage method, namely Glance and Gaze Network (GGNet), which adaptively models a set of actionaware points (ActPoints) via glance and gaze steps. The glance step quickly determines whether each pixel in the feature maps is an interaction point. The gaze step leverages feature maps produced by the glance step to adaptively infer ActPoints around each pixel in a progressive manner. Features of the refined ActPoints are aggregated for interaction prediction. Moreover, we design an actionaware approach that effectively matches each detected interaction with its associated human-object pair, along with a novel hard negative attentive loss to improve the optimization of GGNet. All the above operations are conducted simultaneously and efficiently for all pixels in the feature maps. Finally, GGNet outperforms state-of-the-art methods by significant margins on both V-COCO and HICO-DET benchmarks.

International Journal of Computer Vision, 2021

Human-Object Interaction (HOI) detection is important to human-centric scene understanding tasks.... more Human-Object Interaction (HOI) detection is important to human-centric scene understanding tasks. Existing works tend to assume that the same verb has similar visual characteristics in different HOI categories, an approach that ignores the diverse semantic meanings of the verb. To address this issue, in this paper, we propose a novel Polysemy Deciphering Network (PD-Net) that decodes the visual polysemy of verbs for HOI detection in three distinct ways. First, we refine features for HOI detection to be polysemyaware through the use of two novel modules: namely, Language Prior-guided Channel Attention (LPCA) and Language Prior-based Feature Augmentation (LPFA). LPCA highlights important elements in human and object appearance features for each HOI category to be identified; moreover, LPFA augments human pose and spatial features for HOI detection using language priors, enabling the verb classifiers to receive language hints that reduce intra-class variation for the same verb. Second, we introduce a novel Polysemy-Aware Modal Fusion module (PAMF), which guides PD-Net to make decisions based on feature types deemed more important according to the language priors. Third, we propose to relieve the verb polysemy problem through sharing verb classifiers for semantically similar HOI categories. Furthermore, to ex

IEEE Access, 2020

Automatic facial age estimation can be used in a wide range of real-world applications. However, ... more Automatic facial age estimation can be used in a wide range of real-world applications. However, this process is challenging due to the randomness and slowness of the aging process. Accordingly, in this paper, we propose a novel method aimed at overcoming the challenges associated with facial age estimation. First, we propose a novel age encoding method, referred to as 'Soft-ranking', which encodes two important properties of facial age, i.e., the ordinal property and the correlation between adjacent ages. Therefore, Soft-ranking provides a richer supervision signal for training deep models. Moreover, we carefully analyze existing evaluation protocols for age estimation, finding that the overlap in identity between the training and testing sets affects the relative performance of different age encoding methods. Moreover, we achieve state-of-the-art performance on four most popular age databases, i.e., Morph II, AgeDB, CLAP2015, and CLAP2016. INDEX TERMS Age estimation, age encoding, facial attribute analysis.

2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020

High-fidelity face completion is a challenging task due to the rich and subtle facial textures in... more High-fidelity face completion is a challenging task due to the rich and subtle facial textures involved. What makes it more complicated is the correlations between different facial components, for example, the symmetry in texture and structure between both eyes. While recent works adopted the attention mechanism to learn the contextual relations among elements of the face, they have largely overlooked the disastrous impacts of inaccurate attention scores; in addition, they fail to pay sufficient attention to key facial components, the completion results of which largely determine the authenticity of a face image. Accordingly, in this paper, we design a comprehensive framework for face completion based on the U-Net structure. Specifically, we propose a dual spatial attention module to efficiently learn the correlations between facial textures at multiple scales; moreover, we provide an oracle supervision signal to the attention module to ensure that the obtained attention scores are reasonable. Furthermore, we take the location of the facial components as prior knowledge and impose a multidiscriminator on these regions, with which the fidelity of facial components is significantly promoted. Extensive experiments on two high-resolution face datasets including CelebA-HQ and Flickr-Faces-HQ demonstrate that the proposed approach outperforms state-of-the-art methods by large margins.

Computer Vision – ECCV 2018, 2018

Triplet loss, popular for metric learning, has made a great success in many computer vision tasks... more Triplet loss, popular for metric learning, has made a great success in many computer vision tasks, such as fine-grained image classification, image retrieval, and face recognition. Considering that the number of triplets grows cubically with the size of training data, triplet selection is thus indispensable for efficiently training with triplet loss. However, in practice, the training is usually very sensitive to the selection of triplets, e.g., it almost does not converge with randomly selected triplets and selecting the hardest triplets also leads to bad local minima. We argue that the bias in the selection of triplets degrades the performance of learning with triplet loss. In this paper, we propose a new variant of triplet loss, which tries to reduce the bias in triplet selection by adaptively correcting the distribution shift on the selected triplets. We refer to this new triplet loss as adapted triplet loss. We conduct a number of experiments on MNIST and Fashion-MNIST for image classification, and on CARS196, CUB200-2011, and Stanford Online Products for image retrieval. The experimental results demonstrate the effectiveness of the proposed method.

Computer Vision – ECCV 2020, 2020

Human-Object Interaction (HOI) detection is important in human-centric scene understanding. Exist... more Human-Object Interaction (HOI) detection is important in human-centric scene understanding. Existing works typically assume that the same verb in different HOI categories has similar visual characteristics, while ignoring the diverse semantic meanings of the verb. To address this issue, in this paper, we propose a novel Polysemy Deciphering Network (PD-Net), which decodes the visual polysemy of verbs for HOI detection in three ways. First, PD-Net augments human pose and spatial features for HOI detection using language priors, enabling the verb classifiers to receive language hints that reduce the intra-class variation of the same verb. Second, we introduce a novel Polysemy Attention Module (PAM) that guides PD-Net to make decisions based on more important feature types according to the language priors. Finally, the above two strategies are applied to two types of classifiers for verb recognition, i.e., object-shared and object-specific verb classifiers, whose combination further relieves the verb polysemy problem. By deciphering the visual polysemy of verbs, we achieve the best performance on both HICO-DET and V-COCO datasets. In particular, PD-Net outperforms state-of-the-art approaches by 3.81% mAP in the Known-Object evaluation mode of HICO-DET.

IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020

Part-level representations are important for robust person re-identification (ReID), but in pract... more Part-level representations are important for robust person re-identification (ReID), but in practice feature quality suffers due to the body part misalignment problem. In this paper, we present a robust, compact, and easy-to-use method called the Multi-task Part-aware Network (MPN), which is designed to extract semantically aligned part-level features from pedestrian images. MPN solves the body part misalignment problem via multi-task learning (MTL) in the training stage. More specifically, it builds one main task (MT) and one auxiliary task (AT) for each body part on the top of the same backbone model. The ATs are equipped with a coarse prior of the body part locations for training images. ATs then transfer the concept of the body parts to the MTs via optimizing the MT parameters to identify part-relevant channels from the backbone model. Concept transfer is accomplished by means of two novel alignment strategies: namely, parameter space alignment via hard parameter sharing and feature space alignment in a class-wise manner. With the aid of the learned high-quality parameters, MTs can independently extract semantically aligned part-level features from relevant channels in the testing stage. MPN has three key advantages: 1) it does not need to conduct body part detection in the inference stage; 2) its model is very compact and efficient for both training and testing; 3) in the training stage, it requires only coarse priors of body part locations, which are easy to obtain. Systematic experiments on four large-scale ReID databases demonstrate that MPN consistently outperforms state-of-the-art approaches by significant margins.