Syamantak Kumar - Academia.edu (original) (raw)

Uploads

Papers by Syamantak Kumar

Research paper thumbnail of From Machine Translation to Code-Switching: Generating High-Quality Code-Switched Text

arXiv (Cornell University), Jul 14, 2021

Generating code-switched text is a problem of growing interest, especially given the scarcity of ... more Generating code-switched text is a problem of growing interest, especially given the scarcity of corpora containing large volumes of real code-switched text. In this work, we adapt a state-of-the-art neural machine translation model to generate Hindi-English codeswitched sentences starting from monolingual Hindi sentences. We outline a carefully designed curriculum of pretraining steps, including the use of synthetic code-switched text, that enable the model to generate high-quality code-switched text. Using text generated from our model as data augmentation, we show significant reductions in perplexity on a language modeling task, compared to using text from other generative models of CS text. We also show improvements using our text for a downstream code-switched natural language inference task. Our generated text is further subjected to a rigorous evaluation using a human evaluation study and a range of objective metrics, where we show performance comparable (and sometimes even superior) to codeswitched text obtained via crowd workers who are native Hindi speakers. * Work done while first two authors were students at IIT Bombay.

Research paper thumbnail of Streaming PCA for Markovian Data

arXiv (Cornell University), May 3, 2023

Since its inception in 1982, Oja's algorithm has become an established method for streaming princ... more Since its inception in 1982, Oja's algorithm has become an established method for streaming principle component analysis (PCA). We study the problem of streaming PCA, where the datapoints are sampled from an irreducible, aperiodic, and reversible Markov chain. Our goal is to estimate the top eigenvector of the unknown covariance matrix of the stationary distribution. This setting has implications in scenarios where data can solely be sampled from a Markov Chain Monte Carlo (MCMC) type algorithm, and the objective is to perform inference on parameters of the stationary distribution. Most convergence guarantees for Oja's algorithm in the literature assume that the data-points are sampled IID. For data streams with Markovian dependence, one typically downsamples the data to get a "nearly" independent data stream. In this paper, we obtain the first sharp rate for Oja's algorithm on the entire data, where we remove the logarithmic dependence on the sample size, n, resulting from throwing data away in downsampling strategies.

Research paper thumbnail of CS-747 Project Report

Research paper thumbnail of Adversarial Examples for Keyword Spotting Systems using GANs

In this project we implement a Generative Adversarial Network (GAN) which generates adversarial n... more In this project we implement a Generative Adversarial Network (GAN) which generates adversarial noise for a given Keyword Spotting (KWS) system and then retrain the KWS system on the adversarial examples to make it more robust.

Research paper thumbnail of A comparison of open source libraries ready for 3D reconstruction of wounds

Quantitative assessment is essential to ensure correct diagnosis and effective treatment of chron... more Quantitative assessment is essential to ensure correct diagnosis and effective treatment of chronic wounds. So far, devices with depth cameras and infrared sensors have been used for the computer-aided diagnosis of cutaneous wounds. However, these devices have limited accessibility and usage. On the other hand, smartphones are commonly available, and threedimensional (3D) reconstruction using smartphones can be an important tool for wound assessment. In this paper, we analyze various open source libraries for smartphone-based 3D reconstruction of wounds. For this, point clouds are obtained from cutaneous wound regions using Google ARCore and Structure from Motion (SfM) libraries. These point clouds are subjected to de-noising filters to remove outliers and to improve the density of the point cloud. Subsequently, surface reconstruction is performed on the point cloud to generate a 3D model. Six different mesh-reconstruction algorithms namely Delaunay triangulation, convex hull, point crust, Poisson surface reconstruction, alpha complex, and marching cubes are considered. The performances are evaluated using the quality metrics such as complexity, the density of point clouds, the accuracy of depth information and the efficacy of the reconstruction algorithm. The result shows that the point clouds are able to perform 3D reconstruction of wounds using open source libraries. It is found that the point clouds obtained from SfM have higher density and accuracy as compared to ARCore. Comparatively, the Poisson surface reconstruction is found to be the best algorithm for effective 3D reconstruction from the point clouds. However, research is still required on the techniques to enhance the quality of point clouds obtained through the smartphones and to reduce the computational cost associated with point cloud based 3D-reconstruction.

Research paper thumbnail of From Machine Translation to Code-Switching: Generating High-Quality Code-Switched Text

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2021

Generating code-switched text is a problem of growing interest, especially given the scarcity of ... more Generating code-switched text is a problem of growing interest, especially given the scarcity of corpora containing large volumes of real code-switched text. In this work, we adapt a state-of-the-art neural machine translation model to generate Hindi-English codeswitched sentences starting from monolingual Hindi sentences. We outline a carefully designed curriculum of pretraining steps, including the use of synthetic code-switched text, that enable the model to generate high-quality code-switched text. Using text generated from our model as data augmentation, we show significant reductions in perplexity on a language modeling task, compared to using text from other generative models of CS text. We also show improvements using our text for a downstream code-switched natural language inference task. Our generated text is further subjected to a rigorous evaluation using a human evaluation study and a range of objective metrics, where we show performance comparable (and sometimes even superior) to codeswitched text obtained via crowd workers who are native Hindi speakers. * Work done while first two authors were students at IIT Bombay.

Research paper thumbnail of From Machine Translation to Code-Switching: Generating High-Quality Code-Switched Text

arXiv (Cornell University), Jul 14, 2021

Generating code-switched text is a problem of growing interest, especially given the scarcity of ... more Generating code-switched text is a problem of growing interest, especially given the scarcity of corpora containing large volumes of real code-switched text. In this work, we adapt a state-of-the-art neural machine translation model to generate Hindi-English codeswitched sentences starting from monolingual Hindi sentences. We outline a carefully designed curriculum of pretraining steps, including the use of synthetic code-switched text, that enable the model to generate high-quality code-switched text. Using text generated from our model as data augmentation, we show significant reductions in perplexity on a language modeling task, compared to using text from other generative models of CS text. We also show improvements using our text for a downstream code-switched natural language inference task. Our generated text is further subjected to a rigorous evaluation using a human evaluation study and a range of objective metrics, where we show performance comparable (and sometimes even superior) to codeswitched text obtained via crowd workers who are native Hindi speakers. * Work done while first two authors were students at IIT Bombay.

Research paper thumbnail of Streaming PCA for Markovian Data

arXiv (Cornell University), May 3, 2023

Since its inception in 1982, Oja's algorithm has become an established method for streaming princ... more Since its inception in 1982, Oja's algorithm has become an established method for streaming principle component analysis (PCA). We study the problem of streaming PCA, where the datapoints are sampled from an irreducible, aperiodic, and reversible Markov chain. Our goal is to estimate the top eigenvector of the unknown covariance matrix of the stationary distribution. This setting has implications in scenarios where data can solely be sampled from a Markov Chain Monte Carlo (MCMC) type algorithm, and the objective is to perform inference on parameters of the stationary distribution. Most convergence guarantees for Oja's algorithm in the literature assume that the data-points are sampled IID. For data streams with Markovian dependence, one typically downsamples the data to get a "nearly" independent data stream. In this paper, we obtain the first sharp rate for Oja's algorithm on the entire data, where we remove the logarithmic dependence on the sample size, n, resulting from throwing data away in downsampling strategies.

Research paper thumbnail of CS-747 Project Report

Research paper thumbnail of Adversarial Examples for Keyword Spotting Systems using GANs

In this project we implement a Generative Adversarial Network (GAN) which generates adversarial n... more In this project we implement a Generative Adversarial Network (GAN) which generates adversarial noise for a given Keyword Spotting (KWS) system and then retrain the KWS system on the adversarial examples to make it more robust.

Research paper thumbnail of A comparison of open source libraries ready for 3D reconstruction of wounds

Quantitative assessment is essential to ensure correct diagnosis and effective treatment of chron... more Quantitative assessment is essential to ensure correct diagnosis and effective treatment of chronic wounds. So far, devices with depth cameras and infrared sensors have been used for the computer-aided diagnosis of cutaneous wounds. However, these devices have limited accessibility and usage. On the other hand, smartphones are commonly available, and threedimensional (3D) reconstruction using smartphones can be an important tool for wound assessment. In this paper, we analyze various open source libraries for smartphone-based 3D reconstruction of wounds. For this, point clouds are obtained from cutaneous wound regions using Google ARCore and Structure from Motion (SfM) libraries. These point clouds are subjected to de-noising filters to remove outliers and to improve the density of the point cloud. Subsequently, surface reconstruction is performed on the point cloud to generate a 3D model. Six different mesh-reconstruction algorithms namely Delaunay triangulation, convex hull, point crust, Poisson surface reconstruction, alpha complex, and marching cubes are considered. The performances are evaluated using the quality metrics such as complexity, the density of point clouds, the accuracy of depth information and the efficacy of the reconstruction algorithm. The result shows that the point clouds are able to perform 3D reconstruction of wounds using open source libraries. It is found that the point clouds obtained from SfM have higher density and accuracy as compared to ARCore. Comparatively, the Poisson surface reconstruction is found to be the best algorithm for effective 3D reconstruction from the point clouds. However, research is still required on the techniques to enhance the quality of point clouds obtained through the smartphones and to reduce the computational cost associated with point cloud based 3D-reconstruction.

Research paper thumbnail of From Machine Translation to Code-Switching: Generating High-Quality Code-Switched Text

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2021

Generating code-switched text is a problem of growing interest, especially given the scarcity of ... more Generating code-switched text is a problem of growing interest, especially given the scarcity of corpora containing large volumes of real code-switched text. In this work, we adapt a state-of-the-art neural machine translation model to generate Hindi-English codeswitched sentences starting from monolingual Hindi sentences. We outline a carefully designed curriculum of pretraining steps, including the use of synthetic code-switched text, that enable the model to generate high-quality code-switched text. Using text generated from our model as data augmentation, we show significant reductions in perplexity on a language modeling task, compared to using text from other generative models of CS text. We also show improvements using our text for a downstream code-switched natural language inference task. Our generated text is further subjected to a rigorous evaluation using a human evaluation study and a range of objective metrics, where we show performance comparable (and sometimes even superior) to codeswitched text obtained via crowd workers who are native Hindi speakers. * Work done while first two authors were students at IIT Bombay.