Sanjana Jain - Academia.edu (original) (raw)
Papers by Sanjana Jain
arXiv (Cornell University), Oct 7, 2022
Artistic style transfer is usually performed between two images, a style image and a content imag... more Artistic style transfer is usually performed between two images, a style image and a content image. Recently, a model named CLIPstyler demonstrated that a natural language description of style could replace the necessity of a reference style image. However, their technique requires a lengthy optimisation procedure at run-time for each query, requiring multiple forward and backward passes through a network as well as expensive loss computations. In this work, we create a generalised text-based style transfer network capable of stylising images in a single forward pass for an arbitrary text input making the image stylisation process around 1000 times more efficient than CLIPstyler. We also demonstrate how our technique eliminates the issue of leakage of unwanted artefacts into some of the generated images from CLIPstyler, making them unusable. We also propose an optional fine-tuning step to improve the quality of the generated image. We qualitatively evaluate the performance of our framework and show that it can generate
arXiv (Cornell University), Sep 22, 2022
The core principle of Variational Inference (VI) is to convert the statistical inference problem ... more The core principle of Variational Inference (VI) is to convert the statistical inference problem of computing complex posterior probability densities into a tractable optimization problem. This property enables VI to be faster than several sampling-based techniques. However, the traditional VI algorithm is not scalable to large data sets and is unable to readily infer out-of-bounds data points without re-running the optimization process. Recent developments in the field, like stochastic-, black box-and amortized-VI, have helped address these issues. Generative modeling tasks nowadays widely make use of amortized VI for its efficiency and scalability, as it utilizes a parameterized function to learn the approximate posterior density parameters. With this paper, we review the mathematical foundations of various VI techniques to form the basis for understanding amortized VI. Additionally, we provide an overview of the recent trends that address several issues of amortized VI, such as the amortization gap, generalization issues, inconsistent representation learning, and posterior collapse. Finally, we analyze alternate divergence measures that improve VI optimization. 1. Conjugacy occurs when the posterior density is in the same family of probability density functions as the prior, but with new parameter values which have been updated to reflect the learning from the data.
IEEE Access, 2022
This paper presents a novel Transformer-based facial landmark localization network named Localiza... more This paper presents a novel Transformer-based facial landmark localization network named Localization Transformer (LOTR). The proposed framework is a direct coordinate regression approach leveraging a Transformer network to better utilize the spatial information in a feature map. An LOTR model consists of three main modules: 1) a visual backbone that converts an input image into a feature map, 2) a Transformer module that improves the feature representation from the visual backbone, and 3) a landmark prediction head that directly predicts landmark coordinates from the Transformer's representation. Given cropped-and-aligned face images, the proposed LOTR can be trained end-to-end without requiring any post-processing steps. This paper also introduces a loss function named smooth-Wing loss, which addresses the gradient discontinuity of the Wing loss, leading to better convergence than standard loss functions such as L1, L2, and Wing loss. Experimental results on the JD landmark dataset provided by the First Grand Challenge of 106-Point Facial Landmark Localization indicate the superiority of LOTR over the existing methods on the leaderboard and two recent heatmap-based approaches. On the WFLW dataset, the proposed LOTR framework demonstrates promising results compared with several state-of-theart methods. Additionally, we report an improvement in the performance of state-of-the-art face recognition systems when using our proposed LOTRs for face alignment. INDEX TERMS Artificial neural networks, computer vision, deep learning, face recognition, image processing, machine learning.
ArXiv, 2021
Accurate face landmark localization is an essential part of face recognition, reconstruction and ... more Accurate face landmark localization is an essential part of face recognition, reconstruction and morphing. To accurately localize face landmarks, we present our heatmap regression approach. Each model consists of a MobileNetV2 backbone followed by several upscaling layers, with different tricks to optimize both performance and inference cost. We use five naïve face landmarks from a publicly available face detector to position and align the face instead of using the bounding box like traditional methods. Moreover, we show by adding random rotation, displacement and scaling—after alignment—that the model is more sensitive to the face position than orientation. We also show that it is possible to reduce the upscaling complexity by using a mixture of deconvolution and pixel-shuffle layers without impeding localization performance. We present our state-of-the-art face landmark localization model (ranking second on The 2nd Grand Challenge of 106-Point Facial Landmark Localization validati...
arXiv (Cornell University), Oct 7, 2022
Artistic style transfer is usually performed between two images, a style image and a content imag... more Artistic style transfer is usually performed between two images, a style image and a content image. Recently, a model named CLIPstyler demonstrated that a natural language description of style could replace the necessity of a reference style image. However, their technique requires a lengthy optimisation procedure at run-time for each query, requiring multiple forward and backward passes through a network as well as expensive loss computations. In this work, we create a generalised text-based style transfer network capable of stylising images in a single forward pass for an arbitrary text input making the image stylisation process around 1000 times more efficient than CLIPstyler. We also demonstrate how our technique eliminates the issue of leakage of unwanted artefacts into some of the generated images from CLIPstyler, making them unusable. We also propose an optional fine-tuning step to improve the quality of the generated image. We qualitatively evaluate the performance of our framework and show that it can generate
arXiv (Cornell University), Sep 22, 2022
The core principle of Variational Inference (VI) is to convert the statistical inference problem ... more The core principle of Variational Inference (VI) is to convert the statistical inference problem of computing complex posterior probability densities into a tractable optimization problem. This property enables VI to be faster than several sampling-based techniques. However, the traditional VI algorithm is not scalable to large data sets and is unable to readily infer out-of-bounds data points without re-running the optimization process. Recent developments in the field, like stochastic-, black box-and amortized-VI, have helped address these issues. Generative modeling tasks nowadays widely make use of amortized VI for its efficiency and scalability, as it utilizes a parameterized function to learn the approximate posterior density parameters. With this paper, we review the mathematical foundations of various VI techniques to form the basis for understanding amortized VI. Additionally, we provide an overview of the recent trends that address several issues of amortized VI, such as the amortization gap, generalization issues, inconsistent representation learning, and posterior collapse. Finally, we analyze alternate divergence measures that improve VI optimization. 1. Conjugacy occurs when the posterior density is in the same family of probability density functions as the prior, but with new parameter values which have been updated to reflect the learning from the data.
IEEE Access, 2022
This paper presents a novel Transformer-based facial landmark localization network named Localiza... more This paper presents a novel Transformer-based facial landmark localization network named Localization Transformer (LOTR). The proposed framework is a direct coordinate regression approach leveraging a Transformer network to better utilize the spatial information in a feature map. An LOTR model consists of three main modules: 1) a visual backbone that converts an input image into a feature map, 2) a Transformer module that improves the feature representation from the visual backbone, and 3) a landmark prediction head that directly predicts landmark coordinates from the Transformer's representation. Given cropped-and-aligned face images, the proposed LOTR can be trained end-to-end without requiring any post-processing steps. This paper also introduces a loss function named smooth-Wing loss, which addresses the gradient discontinuity of the Wing loss, leading to better convergence than standard loss functions such as L1, L2, and Wing loss. Experimental results on the JD landmark dataset provided by the First Grand Challenge of 106-Point Facial Landmark Localization indicate the superiority of LOTR over the existing methods on the leaderboard and two recent heatmap-based approaches. On the WFLW dataset, the proposed LOTR framework demonstrates promising results compared with several state-of-theart methods. Additionally, we report an improvement in the performance of state-of-the-art face recognition systems when using our proposed LOTRs for face alignment. INDEX TERMS Artificial neural networks, computer vision, deep learning, face recognition, image processing, machine learning.
ArXiv, 2021
Accurate face landmark localization is an essential part of face recognition, reconstruction and ... more Accurate face landmark localization is an essential part of face recognition, reconstruction and morphing. To accurately localize face landmarks, we present our heatmap regression approach. Each model consists of a MobileNetV2 backbone followed by several upscaling layers, with different tricks to optimize both performance and inference cost. We use five naïve face landmarks from a publicly available face detector to position and align the face instead of using the bounding box like traditional methods. Moreover, we show by adding random rotation, displacement and scaling—after alignment—that the model is more sensitive to the face position than orientation. We also show that it is possible to reduce the upscaling complexity by using a mixture of deconvolution and pixel-shuffle layers without impeding localization performance. We present our state-of-the-art face landmark localization model (ranking second on The 2nd Grand Challenge of 106-Point Facial Landmark Localization validati...