Chaerin Kong | Seoul National University (original) (raw)

Papers by Chaerin Kong

Research paper thumbnail of ConcatPlexer: Additional Dim1 Batching for Faster ViTs

arXiv (Cornell University), Aug 21, 2023

Transformers have demonstrated tremendous success not only in the natural language processing (NL... more Transformers have demonstrated tremendous success not only in the natural language processing (NLP) domain but also the field of computer vision, igniting various creative approaches and applications. Yet, the superior performance and modeling flexibility of transformers came with a severe increase in computation costs, and hence several works have proposed methods to reduce this burden. Inspired by a cost-cutting method originally proposed for language models, Data Multiplexing (DataMUX), we propose a novel approach for efficient visual recognition that employs additional dim1 batching (i.e., concatenation) that greatly improves the throughput with little compromise in the accuracy. We first introduce a naive adaptation of DataMux for vision models, Image Multiplexer, and devise novel components to overcome its weaknesses, rendering our final model, Concat-Plexer, at the sweet spot between inference speed and accuracy. The ConcatPlexer was trained on ImageNet1K and CIFAR100 dataset and it achieved 23.5% less GFLOPs than ViT-B/16 with 69.5% and 83.4% validation accuracy, respectively.

Research paper thumbnail of Fashion Style Editing with Generative Human Prior

arXiv (Cornell University), Apr 2, 2024

Research paper thumbnail of AADiff: Audio-Aligned Video Synthesis with Text-to-Image Diffusion

arXiv (Cornell University), May 6, 2023

Research paper thumbnail of Unifying Vision-Language Representation Space with Single-Tower Transformer

Proceedings of the AAAI Conference on Artificial Intelligence

Contrastive learning is a form of distance learning that aims to learn invariant features from tw... more Contrastive learning is a form of distance learning that aims to learn invariant features from two related representations. In this work, we explore the hypothesis that an image and caption can be regarded as two different views of the underlying mutual information, and train a model to learn a unified vision-language representation space that encodes both modalities at once in a modality-agnostic manner. We first identify difficulties in learning a one-tower model for vision-language pretraining (VLP), and propose One Representation (OneR) as a simple yet effective framework for our goal. We discover intriguing properties that distinguish OneR from the previous works that have modality-specific representation spaces such as zero-shot localization, text-guided visual reasoning and multi-modal retrieval, and present analyses to provide insights into this new form of multi-modal representation learning. Thorough evaluations demonstrate the potential of a unified modality-agnostic VLP ...

Research paper thumbnail of Analyzing Multimodal Objectives Through the Lens of Generative Diffusion Guidance

arXiv (Cornell University), Feb 10, 2023

Research paper thumbnail of Unifying Vision-Language Representation Space with Single-tower Transformer

arXiv (Cornell University), Nov 20, 2022

Research paper thumbnail of Self-Distilled Self-supervised Representation Learning

2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

State-of-the-art frameworks in self-supervised learning have recently shown that fully utilizing ... more State-of-the-art frameworks in self-supervised learning have recently shown that fully utilizing transformer-based models can lead to performance boost compared to conventional CNN models. Striving to maximize the mutual information of two views of an image, existing works apply a contrastive loss to the final representations. Motivated by self-distillation in the supervised regime, we further exploit this by allowing the intermediate representations to learn from the final layer via the contrastive loss. Through selfdistillation, the intermediate layers are better suited for instance discrimination, making the performance of an earlyexited sub-network not much degraded from that of the full network. This renders the pretext task easier also for the final layer, leading to better representations. Our method, Self-Distilled Self-Supervised Learning (SDSSL), outperforms competitive baselines (SimCLR, BYOL and MoCo v3) using ViT on various tasks and datasets. In the linear evaluation and k-NN protocol, SDSSL not only leads to superior performance in the final layers, but also in most of the lower layers. Furthermore, qualitative and quantitative analyses show how representations are formed more effectively along the transformer layers. Code is available at https://github.com/hagiss/SDSSL.

Research paper thumbnail of Towards Efficient Neural Scene Graphs by Learning Consistency Fields

Cornell University - arXiv, Oct 8, 2022

Neural Radiance Fields (NeRF) achieves photo-realistic image rendering from novel views, and the ... more Neural Radiance Fields (NeRF) achieves photo-realistic image rendering from novel views, and the Neural Scene Graphs (NSG) [16] extends it to dynamic scenes (video) with multiple objects. Nevertheless, computationally heavy ray marching for every image frame becomes a huge burden. In this paper, taking advantage of significant redundancy across adjacent frames in videos, we propose a feature-reusing framework. From the first try of naively reusing the NSG features, however, we learn that it is crucial to disentangle object-intrinsic properties consistent across frames from transient ones. Our proposed method, Consistency-Field-based NSG (CF-NSG), reformulates neural radiance fields to additionally consider consistency fields. With disentangled representations, CF-NSG takes full advantage of the feature-reusing scheme and performs an extended degree of scene manipulation in a more controllable manner. We empirically verify that CF-NSG greatly improves the inference efficiency by using 85% less queries than NSG without notable degradation in rendering quality. Code will be available at https://github.com/ldynx/CF-NSG.

Research paper thumbnail of Leveraging Off-the-shelf Diffusion Model for Multi-attribute Fashion Image Manipulation

Cornell University - arXiv, Oct 11, 2022

Fashion attribute editing is a task that aims to convert the semantic attributes of a given fashi... more Fashion attribute editing is a task that aims to convert the semantic attributes of a given fashion image while preserving the irrelevant regions. Previous works typically employ conditional GANs where the generator explicitly learns the target attributes and directly execute the conversion. These approaches, however, are neither scalable nor generic as they operate only with few limited attributes and a separate generator is required for each dataset or attribute set. Inspired by the recent advancement of diffusion models, we explore the classifier-guided diffusion that leverages the offthe-shelf diffusion model pretrained on general visual semantics such as Imagenet. In order to achieve a generic editing pipeline, we pose this as multi-attribute image manipulation task, where the attribute ranges from item category, fabric, pattern to collar and neckline. We empirically show that conventional methods fail in our challenging setting, and study efficient adaptation scheme that involves recently introduced attention-pooling technique to obtain a multi-attribute classifier guidance. Based on this, we present a mask-free fashion attribute editing framework that leverages the classifier logits and the cross-attention map for manipulation. We empirically demonstrate that our framework achieves convincing sample quality and attribute alignments. 1 Classifiers are generally easier and more straightforward to train or finetune compared to generative models under limited data.

Research paper thumbnail of Few-shot Image Generation with Mixup-based Distance Learning

Producing diverse and realistic images with generative models such as GANs typically requires lar... more Producing diverse and realistic images with generative models such as GANs typically requires large scale training with vast amount of images. GANs trained with limited data can easily memorize few training samples and display undesirable properties like "stairlike" latent space where interpolation in the latent space yields discontinuous transitions in the output space. In this work, we consider a challenging task of pretraining-free few-shot image synthesis, and seek to train existing generative models with minimal overfitting and mode collapse. We propose mixup-based distance regularization on the feature space of both a generator and the counterpart discriminator that encourages the two players to reason not only about the scarce observed data points but the relative distances in the feature space they reside. Qualitative and quantitative evaluation on diverse datasets demonstrates that our method is generally applicable to existing models to enhance both fidelity and diversity under few-shot setting. Codes are available 3. .. .

Research paper thumbnail of Conservative Generator, Progressive Discriminator: Coordination of Adversaries in Few-shot Incremental Image Synthesis

In this work, we study the underrepresented task of generative incremental few-shot learning. To ... more In this work, we study the underrepresented task of generative incremental few-shot learning. To effectively handle the inherent challenges of incremental learning and few-shot learning, we propose a novel framework named ConPro that leverages the two-player nature of GANs. Specifically, we design a conservative generator that preserves past knowledge in a parameter-and compute-efficient manner, and a progressive discriminator that learns to reason semantic distances between past and present task samples, minimizing overfitting with few data points and pursuing good forward transfer. We present experiments to validate the effectiveness of the proposed framework.

Research paper thumbnail of Smoothing the Generative Latent Space with Mixup Based Distance Learning

arXiv: Computer Vision and Pattern Recognition, Nov 23, 2021

Producing diverse and realistic images with generative models such as GANs typically requires lar... more Producing diverse and realistic images with generative models such as GANs typically requires large scale training with vast amount of images. GANs trained with extremely limited data can easily overfit to few training samples and display undesirable properties like "stairlike" latent space where transitions in latent space suffer from discontinuity, occasionally yielding abrupt changes in outputs. In this work, we consider the situation where neither large scale dataset of our interest nor transferable source dataset is available, and seek to train existing generative models with minimal overfitting and mode collapse. We propose latent mixup-based distance regularization on the feature space of both a generator and the counterpart discriminator that encourages the two players to reason not only about the scarce observed data points but the relative distances in the feature space they reside. Qualitative and quantitative evaluation on diverse datasets demonstrates that our method is generally applicable to existing models to enhance both fidelity and diversity under the constraint of limited data. Code will be made public.

Research paper thumbnail of ConcatPlexer: Additional Dim1 Batching for Faster ViTs

arXiv (Cornell University), Aug 21, 2023

Transformers have demonstrated tremendous success not only in the natural language processing (NL... more Transformers have demonstrated tremendous success not only in the natural language processing (NLP) domain but also the field of computer vision, igniting various creative approaches and applications. Yet, the superior performance and modeling flexibility of transformers came with a severe increase in computation costs, and hence several works have proposed methods to reduce this burden. Inspired by a cost-cutting method originally proposed for language models, Data Multiplexing (DataMUX), we propose a novel approach for efficient visual recognition that employs additional dim1 batching (i.e., concatenation) that greatly improves the throughput with little compromise in the accuracy. We first introduce a naive adaptation of DataMux for vision models, Image Multiplexer, and devise novel components to overcome its weaknesses, rendering our final model, Concat-Plexer, at the sweet spot between inference speed and accuracy. The ConcatPlexer was trained on ImageNet1K and CIFAR100 dataset and it achieved 23.5% less GFLOPs than ViT-B/16 with 69.5% and 83.4% validation accuracy, respectively.

Research paper thumbnail of Fashion Style Editing with Generative Human Prior

arXiv (Cornell University), Apr 2, 2024

Research paper thumbnail of AADiff: Audio-Aligned Video Synthesis with Text-to-Image Diffusion

arXiv (Cornell University), May 6, 2023

Research paper thumbnail of Unifying Vision-Language Representation Space with Single-Tower Transformer

Proceedings of the AAAI Conference on Artificial Intelligence

Contrastive learning is a form of distance learning that aims to learn invariant features from tw... more Contrastive learning is a form of distance learning that aims to learn invariant features from two related representations. In this work, we explore the hypothesis that an image and caption can be regarded as two different views of the underlying mutual information, and train a model to learn a unified vision-language representation space that encodes both modalities at once in a modality-agnostic manner. We first identify difficulties in learning a one-tower model for vision-language pretraining (VLP), and propose One Representation (OneR) as a simple yet effective framework for our goal. We discover intriguing properties that distinguish OneR from the previous works that have modality-specific representation spaces such as zero-shot localization, text-guided visual reasoning and multi-modal retrieval, and present analyses to provide insights into this new form of multi-modal representation learning. Thorough evaluations demonstrate the potential of a unified modality-agnostic VLP ...

Research paper thumbnail of Analyzing Multimodal Objectives Through the Lens of Generative Diffusion Guidance

arXiv (Cornell University), Feb 10, 2023

Research paper thumbnail of Unifying Vision-Language Representation Space with Single-tower Transformer

arXiv (Cornell University), Nov 20, 2022

Research paper thumbnail of Self-Distilled Self-supervised Representation Learning

2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

State-of-the-art frameworks in self-supervised learning have recently shown that fully utilizing ... more State-of-the-art frameworks in self-supervised learning have recently shown that fully utilizing transformer-based models can lead to performance boost compared to conventional CNN models. Striving to maximize the mutual information of two views of an image, existing works apply a contrastive loss to the final representations. Motivated by self-distillation in the supervised regime, we further exploit this by allowing the intermediate representations to learn from the final layer via the contrastive loss. Through selfdistillation, the intermediate layers are better suited for instance discrimination, making the performance of an earlyexited sub-network not much degraded from that of the full network. This renders the pretext task easier also for the final layer, leading to better representations. Our method, Self-Distilled Self-Supervised Learning (SDSSL), outperforms competitive baselines (SimCLR, BYOL and MoCo v3) using ViT on various tasks and datasets. In the linear evaluation and k-NN protocol, SDSSL not only leads to superior performance in the final layers, but also in most of the lower layers. Furthermore, qualitative and quantitative analyses show how representations are formed more effectively along the transformer layers. Code is available at https://github.com/hagiss/SDSSL.

Research paper thumbnail of Towards Efficient Neural Scene Graphs by Learning Consistency Fields

Cornell University - arXiv, Oct 8, 2022

Neural Radiance Fields (NeRF) achieves photo-realistic image rendering from novel views, and the ... more Neural Radiance Fields (NeRF) achieves photo-realistic image rendering from novel views, and the Neural Scene Graphs (NSG) [16] extends it to dynamic scenes (video) with multiple objects. Nevertheless, computationally heavy ray marching for every image frame becomes a huge burden. In this paper, taking advantage of significant redundancy across adjacent frames in videos, we propose a feature-reusing framework. From the first try of naively reusing the NSG features, however, we learn that it is crucial to disentangle object-intrinsic properties consistent across frames from transient ones. Our proposed method, Consistency-Field-based NSG (CF-NSG), reformulates neural radiance fields to additionally consider consistency fields. With disentangled representations, CF-NSG takes full advantage of the feature-reusing scheme and performs an extended degree of scene manipulation in a more controllable manner. We empirically verify that CF-NSG greatly improves the inference efficiency by using 85% less queries than NSG without notable degradation in rendering quality. Code will be available at https://github.com/ldynx/CF-NSG.

Research paper thumbnail of Leveraging Off-the-shelf Diffusion Model for Multi-attribute Fashion Image Manipulation

Cornell University - arXiv, Oct 11, 2022

Fashion attribute editing is a task that aims to convert the semantic attributes of a given fashi... more Fashion attribute editing is a task that aims to convert the semantic attributes of a given fashion image while preserving the irrelevant regions. Previous works typically employ conditional GANs where the generator explicitly learns the target attributes and directly execute the conversion. These approaches, however, are neither scalable nor generic as they operate only with few limited attributes and a separate generator is required for each dataset or attribute set. Inspired by the recent advancement of diffusion models, we explore the classifier-guided diffusion that leverages the offthe-shelf diffusion model pretrained on general visual semantics such as Imagenet. In order to achieve a generic editing pipeline, we pose this as multi-attribute image manipulation task, where the attribute ranges from item category, fabric, pattern to collar and neckline. We empirically show that conventional methods fail in our challenging setting, and study efficient adaptation scheme that involves recently introduced attention-pooling technique to obtain a multi-attribute classifier guidance. Based on this, we present a mask-free fashion attribute editing framework that leverages the classifier logits and the cross-attention map for manipulation. We empirically demonstrate that our framework achieves convincing sample quality and attribute alignments. 1 Classifiers are generally easier and more straightforward to train or finetune compared to generative models under limited data.

Research paper thumbnail of Few-shot Image Generation with Mixup-based Distance Learning

Producing diverse and realistic images with generative models such as GANs typically requires lar... more Producing diverse and realistic images with generative models such as GANs typically requires large scale training with vast amount of images. GANs trained with limited data can easily memorize few training samples and display undesirable properties like "stairlike" latent space where interpolation in the latent space yields discontinuous transitions in the output space. In this work, we consider a challenging task of pretraining-free few-shot image synthesis, and seek to train existing generative models with minimal overfitting and mode collapse. We propose mixup-based distance regularization on the feature space of both a generator and the counterpart discriminator that encourages the two players to reason not only about the scarce observed data points but the relative distances in the feature space they reside. Qualitative and quantitative evaluation on diverse datasets demonstrates that our method is generally applicable to existing models to enhance both fidelity and diversity under few-shot setting. Codes are available 3. .. .

Research paper thumbnail of Conservative Generator, Progressive Discriminator: Coordination of Adversaries in Few-shot Incremental Image Synthesis

In this work, we study the underrepresented task of generative incremental few-shot learning. To ... more In this work, we study the underrepresented task of generative incremental few-shot learning. To effectively handle the inherent challenges of incremental learning and few-shot learning, we propose a novel framework named ConPro that leverages the two-player nature of GANs. Specifically, we design a conservative generator that preserves past knowledge in a parameter-and compute-efficient manner, and a progressive discriminator that learns to reason semantic distances between past and present task samples, minimizing overfitting with few data points and pursuing good forward transfer. We present experiments to validate the effectiveness of the proposed framework.

Research paper thumbnail of Smoothing the Generative Latent Space with Mixup Based Distance Learning

arXiv: Computer Vision and Pattern Recognition, Nov 23, 2021

Producing diverse and realistic images with generative models such as GANs typically requires lar... more Producing diverse and realistic images with generative models such as GANs typically requires large scale training with vast amount of images. GANs trained with extremely limited data can easily overfit to few training samples and display undesirable properties like "stairlike" latent space where transitions in latent space suffer from discontinuity, occasionally yielding abrupt changes in outputs. In this work, we consider the situation where neither large scale dataset of our interest nor transferable source dataset is available, and seek to train existing generative models with minimal overfitting and mode collapse. We propose latent mixup-based distance regularization on the feature space of both a generator and the counterpart discriminator that encourages the two players to reason not only about the scarce observed data points but the relative distances in the feature space they reside. Qualitative and quantitative evaluation on diverse datasets demonstrates that our method is generally applicable to existing models to enhance both fidelity and diversity under the constraint of limited data. Code will be made public.