shantipriya parida | Utkal University (original) (raw)
Papers by shantipriya parida
Advances in computational intelligence and robotics book series, Feb 9, 2024
Smart innovation, systems and technologies, Dec 31, 2022
arXiv (Cornell University), May 21, 2023
The use of subword embedding has proved to be a major innovation in Neural Machine Translation (N... more The use of subword embedding has proved to be a major innovation in Neural Machine Translation (NMT). It helps NMT to learn better context vectors for Low Resource Languages (LRLs) so as to predict the target words by better modelling the morphologies of the two languages and also the morphosyntax transfer. Even so, their performance for translation in Indian language to Indian language scenario is still not as good as for resource-rich languages. One reason for this is the relative morphological richness of Indian languages, while another is that most of them fall into the extremely low resource or zero-shot categories. Since most major Indian languages use Indic or Brahmi origin scripts, the text written in them is highly phonetic in nature and phonetically similar in terms of abstract letters and their arrangements. We use these characteristics of Indian languages and their scripts to propose an approach based on common multilingual Latin-based encodings (WX notation) that take advantage of language similarity while addressing the morphological complexity issue in NMT. These multilingual Latin-based encodings in NMT, together with Byte Pair Embedding (BPE) allow us to better exploit their phonetic and orthographic as well as lexical similarities to improve the translation quality by projecting different but similar languages on the same orthographic-phonetic character space. We verify the proposed approach by demonstrating experiments on similar language pairs (Gujarati↔Hindi, Marathi↔Hindi, Nepali↔Hindi, Maithili↔Hindi, Punjabi↔Hindi, and Urdu↔Hindi) under low resource conditions. The proposed approach shows an improvement in a majority of cases, in one case as much as ∼10 BLEU points compared to baseline techniques for similar language pairs. We also get up to ∼1 BLEU points improvement on distant and zero-shot language pairs.
IGI Global eBooks, Feb 7, 2017
Classification of brain states obtained through functional magnetic resonance imaging (fMRI) pose... more Classification of brain states obtained through functional magnetic resonance imaging (fMRI) poses a serious challenges for neuroimaging community to uncover discriminating patterns of brain state activity that define independent thought processes. This challenge came into existence because of the large number of voxels in a typical fMRI scan, the classifier is presented with a massive feature set coupled with a relatively small training samples. One of the most popular research topics in last few years is the application of machine learning algorithms for mental states classification, decoding brain activation, and finding the variable of interest from fMRI data. In classification scenario, different algorithms have different biases, in the sequel performances differs across datasets, and for a particular dataset the accuracy varies from classifier to classifier. To overcome the limitations of individual techniques, hybridization or fusion of these machine learning techniques emerged in recent years which have shown promising result and open up new direction of research. This paper reviews the machine learning techniques ranging from individual classifiers, ensemble, and hybrid techniques used in cognitive classification with a well balance treatment of their applications, performance, and limitations. It also discusses many open research challenges for further research.
Smart innovation, systems and technologies, 2023
Smart Innovation, Systems and Technologies, 2022
arXiv (Cornell University), May 28, 2023
This paper presents HaVQA, the first multimodal dataset for visual question-answering (VQA) tasks... more This paper presents HaVQA, the first multimodal dataset for visual question-answering (VQA) tasks in the Hausa language. The dataset was created by manually translating 6,022 English question-answer pairs, which are associated with 1,555 unique images from the Visual Genome dataset. As a result, the dataset provides 12,044 gold standard English-Hausa parallel sentences that were translated in a fashion that guarantees their semantic match with the corresponding visual information. We conducted several baseline experiments on the dataset, including visual question answering, visual question elicitation, text-only and multimodal machine translation.
Language Resources and Evaluation
arXiv (Cornell University), Aug 2, 2022
This paper provides the system description of "Silo NLP's" submission to the Workshop on Asian Tr... more This paper provides the system description of "Silo NLP's" submission to the Workshop on Asian Translation (WAT2022). We have participated in the Indic Multimodal tasks (English→Hindi, English→Malayalam, and English→Bengali, Multimodal Translation). For text-only translation, we trained Transformers from scratch and fine-tuned mBART-50 models. For multimodal translation, we used the same mBART architecture, and extracted object tags from the images to use as visual features concatenated with the text sequence. Our submission tops many tasks including English→Hindi multimodal translation (evaluation test), English→Malayalam textonly and multimodal translation (evaluation test), English→Bengali multimodal translation (challenge test), and English→Bengali text-only translation (evaluation test). cz/bengali-visual-genome/ wat-2022-english-bengali-multim 2 All except the EN-HI and EN-ML text-only systems.
arXiv (Cornell University), May 2, 2022
Multi-modal Machine Translation (MMT) enables the use of visual information to enhance the qualit... more Multi-modal Machine Translation (MMT) enables the use of visual information to enhance the quality of translations. The visual information can serve as a valuable piece of context information to decrease the ambiguity of input sentences. Despite the increasing popularity of such a technique, good and sizeable datasets are scarce, limiting the full extent of their potential. Hausa, a Chadic language, is a member of the Afro-Asiatic language family. It is estimated that about 100 to 150 million people speak the language, with more than 80 million indigenous speakers. This is more than any of the other Chadic languages. Despite a large number of speakers, the Hausa language is considered low-resource in natural language processing (NLP). This is due to the absence of sufficient resources to implement most NLP tasks. While some datasets exist, they are either scarce, machine-generated, or in the religious domain. Therefore, there is a need to create training and evaluation data for implementing machine learning tasks and bridging the research gap in the language. This work presents the Hausa Visual Genome (HaVG), a dataset that contains the description of an image or a section within the image in Hausa and its equivalent in English. To prepare the dataset, we started by translating the English description of the images in the Hindi Visual Genome (HVG) into Hausa automatically. Afterward, the synthetic Hausa data was carefully post-edited considering the respective images. The dataset comprises 32,923 images and their descriptions that are divided into training, development, test, and challenge test set. The Hausa Visual Genome is the first dataset of its kind and can be used for Hausa-English machine translation, multi-modal research, and image description, among various other natural language processing and generation tasks.
Smart Innovation, Systems and Technologies, 2022
Smart Innovation, Systems and Technologies, 2022
Odia language is one of the 30 most spoken languages in the world. It is spoken in the Indian sta... more Odia language is one of the 30 most spoken languages in the world. It is spoken in the Indian state called Odisha. Odia language lacks online content and resources for natural language processing (NLP) research. There is a great need for a better language model for the low resource Odia language, which can be used for many downstream NLP tasks. In this paper, we introduce a Bert-based language model, pre-trained on 430'000 Odia sentences. We also evaluate the model on the well-known Kaggle Odia news classification dataset (BertOdia: 96%, RoBERTaOdia: 92%, and ULMFit: 91.9% classification accuracy), and perform a comparison study with multilingual Bidirectional Encoder Representations from Transformers (BERT) supporting Odia. The model will be released publicly for the researchers to explore other NLP tasks.
2022 OPJU International Technology Conference on Emerging Technologies for Sustainable Development (OTCON)
Advances in computational intelligence and robotics book series, Feb 9, 2024
Smart innovation, systems and technologies, Dec 31, 2022
arXiv (Cornell University), May 21, 2023
The use of subword embedding has proved to be a major innovation in Neural Machine Translation (N... more The use of subword embedding has proved to be a major innovation in Neural Machine Translation (NMT). It helps NMT to learn better context vectors for Low Resource Languages (LRLs) so as to predict the target words by better modelling the morphologies of the two languages and also the morphosyntax transfer. Even so, their performance for translation in Indian language to Indian language scenario is still not as good as for resource-rich languages. One reason for this is the relative morphological richness of Indian languages, while another is that most of them fall into the extremely low resource or zero-shot categories. Since most major Indian languages use Indic or Brahmi origin scripts, the text written in them is highly phonetic in nature and phonetically similar in terms of abstract letters and their arrangements. We use these characteristics of Indian languages and their scripts to propose an approach based on common multilingual Latin-based encodings (WX notation) that take advantage of language similarity while addressing the morphological complexity issue in NMT. These multilingual Latin-based encodings in NMT, together with Byte Pair Embedding (BPE) allow us to better exploit their phonetic and orthographic as well as lexical similarities to improve the translation quality by projecting different but similar languages on the same orthographic-phonetic character space. We verify the proposed approach by demonstrating experiments on similar language pairs (Gujarati↔Hindi, Marathi↔Hindi, Nepali↔Hindi, Maithili↔Hindi, Punjabi↔Hindi, and Urdu↔Hindi) under low resource conditions. The proposed approach shows an improvement in a majority of cases, in one case as much as ∼10 BLEU points compared to baseline techniques for similar language pairs. We also get up to ∼1 BLEU points improvement on distant and zero-shot language pairs.
IGI Global eBooks, Feb 7, 2017
Classification of brain states obtained through functional magnetic resonance imaging (fMRI) pose... more Classification of brain states obtained through functional magnetic resonance imaging (fMRI) poses a serious challenges for neuroimaging community to uncover discriminating patterns of brain state activity that define independent thought processes. This challenge came into existence because of the large number of voxels in a typical fMRI scan, the classifier is presented with a massive feature set coupled with a relatively small training samples. One of the most popular research topics in last few years is the application of machine learning algorithms for mental states classification, decoding brain activation, and finding the variable of interest from fMRI data. In classification scenario, different algorithms have different biases, in the sequel performances differs across datasets, and for a particular dataset the accuracy varies from classifier to classifier. To overcome the limitations of individual techniques, hybridization or fusion of these machine learning techniques emerged in recent years which have shown promising result and open up new direction of research. This paper reviews the machine learning techniques ranging from individual classifiers, ensemble, and hybrid techniques used in cognitive classification with a well balance treatment of their applications, performance, and limitations. It also discusses many open research challenges for further research.
Smart innovation, systems and technologies, 2023
Smart Innovation, Systems and Technologies, 2022
arXiv (Cornell University), May 28, 2023
This paper presents HaVQA, the first multimodal dataset for visual question-answering (VQA) tasks... more This paper presents HaVQA, the first multimodal dataset for visual question-answering (VQA) tasks in the Hausa language. The dataset was created by manually translating 6,022 English question-answer pairs, which are associated with 1,555 unique images from the Visual Genome dataset. As a result, the dataset provides 12,044 gold standard English-Hausa parallel sentences that were translated in a fashion that guarantees their semantic match with the corresponding visual information. We conducted several baseline experiments on the dataset, including visual question answering, visual question elicitation, text-only and multimodal machine translation.
Language Resources and Evaluation
arXiv (Cornell University), Aug 2, 2022
This paper provides the system description of "Silo NLP's" submission to the Workshop on Asian Tr... more This paper provides the system description of "Silo NLP's" submission to the Workshop on Asian Translation (WAT2022). We have participated in the Indic Multimodal tasks (English→Hindi, English→Malayalam, and English→Bengali, Multimodal Translation). For text-only translation, we trained Transformers from scratch and fine-tuned mBART-50 models. For multimodal translation, we used the same mBART architecture, and extracted object tags from the images to use as visual features concatenated with the text sequence. Our submission tops many tasks including English→Hindi multimodal translation (evaluation test), English→Malayalam textonly and multimodal translation (evaluation test), English→Bengali multimodal translation (challenge test), and English→Bengali text-only translation (evaluation test). cz/bengali-visual-genome/ wat-2022-english-bengali-multim 2 All except the EN-HI and EN-ML text-only systems.
arXiv (Cornell University), May 2, 2022
Multi-modal Machine Translation (MMT) enables the use of visual information to enhance the qualit... more Multi-modal Machine Translation (MMT) enables the use of visual information to enhance the quality of translations. The visual information can serve as a valuable piece of context information to decrease the ambiguity of input sentences. Despite the increasing popularity of such a technique, good and sizeable datasets are scarce, limiting the full extent of their potential. Hausa, a Chadic language, is a member of the Afro-Asiatic language family. It is estimated that about 100 to 150 million people speak the language, with more than 80 million indigenous speakers. This is more than any of the other Chadic languages. Despite a large number of speakers, the Hausa language is considered low-resource in natural language processing (NLP). This is due to the absence of sufficient resources to implement most NLP tasks. While some datasets exist, they are either scarce, machine-generated, or in the religious domain. Therefore, there is a need to create training and evaluation data for implementing machine learning tasks and bridging the research gap in the language. This work presents the Hausa Visual Genome (HaVG), a dataset that contains the description of an image or a section within the image in Hausa and its equivalent in English. To prepare the dataset, we started by translating the English description of the images in the Hindi Visual Genome (HVG) into Hausa automatically. Afterward, the synthetic Hausa data was carefully post-edited considering the respective images. The dataset comprises 32,923 images and their descriptions that are divided into training, development, test, and challenge test set. The Hausa Visual Genome is the first dataset of its kind and can be used for Hausa-English machine translation, multi-modal research, and image description, among various other natural language processing and generation tasks.
Smart Innovation, Systems and Technologies, 2022
Smart Innovation, Systems and Technologies, 2022
Odia language is one of the 30 most spoken languages in the world. It is spoken in the Indian sta... more Odia language is one of the 30 most spoken languages in the world. It is spoken in the Indian state called Odisha. Odia language lacks online content and resources for natural language processing (NLP) research. There is a great need for a better language model for the low resource Odia language, which can be used for many downstream NLP tasks. In this paper, we introduce a Bert-based language model, pre-trained on 430'000 Odia sentences. We also evaluate the model on the well-known Kaggle Odia news classification dataset (BertOdia: 96%, RoBERTaOdia: 92%, and ULMFit: 91.9% classification accuracy), and perform a comparison study with multilingual Bidirectional Encoder Representations from Transformers (BERT) supporting Odia. The model will be released publicly for the researchers to explore other NLP tasks.
2022 OPJU International Technology Conference on Emerging Technologies for Sustainable Development (OTCON)
Language plays a crucial role in preserving culture in our society and acts as a repository of tr... more Language plays a crucial role in preserving culture in our society and acts as a repository of traditional knowledge, memories, values, practices and unique worldviews that have been used, transformed and practiced in the form of language since millennia. It is estimated that there are more than 6000 spoken languages today that are in danger. UNESCO regularly publishes a list of endangered languages, based on a 5-scale classification system such as: vulnerable, definitely endangered, severely endangered, critically endangered & extinct. It is worrying to know that most of Odisha's indigenous languages are coming under these above factors. Safeguarding and reinvigorating these languages has become crucial for maintaining and preserving the cultural diversity of the society. On account of this, the present paper discusses how Natural Language Processing (NLP), in collaboration with linguistics, can help to revive endangered languages by developing a methodology to build a corpus for the lesserknown endangered indigenous languages of Odisha, some of which have no existing script. The purpose of this paper is to serve researchers/professionals working on low resource languages with a complete guideline for classifying languages and collecting corpus considering language diversity, style and achievable issues.
Asian Federation of Natural Language Processing, 2018
This paper describes the CUNI submission to WAT 2018 for the English-Hindi translation task using... more This paper describes the CUNI submission to WAT 2018 for the English-Hindi translation task using a transfer learning techniques which has proven effective under low resource conditions. We have used the Transformer model and utilized an English-Czech parallel corpus as additional data source. Our simple transfer learning approach first trains a "parent" model for a high-resource language pair (English-Czech) and then continues the training on the low-resource (English-Hindi) pair by replacing the training corpus. This setup improves the performance compared with the baseline and in combination with back-translation of Hindi monolingual data, it allowed us to win the English-Hindi task. The automatic scoring by BLEU did not correlate well with human judgments.
This paper presents a case study in translating short image captions of the Visual Genome dataset... more This paper presents a case study in translating short image captions of the Visual Genome dataset from English into Hindi using out-of-domain data sets of varying size. We experiment with three NMT models: the shallow and deep sequence-to-sequence and the Transformer model as implemented in Marian toolkit. Phrase-based Moses serves as the baseline. The results indicate that the Transformer model outperforms others in the large data setting in a number of automatic met-rics and manual evaluation, and it also produces the fewest truncated sentences. Transformer training is however very sensitive to the hyperparameters, so it requires more experimenting. The deep sequence-to-sequence model produced more flawless outputs in the small data setting and it was generally more stable, at the cost of more training iterations.