Utsab Barman | Dublin City University (original) (raw)

Papers by Utsab Barman

Research paper thumbnail of Automatic processing of code-mixed social media content

I hereby certify that this material, which I now submit for assessment on the programme of study ... more I hereby certify that this material, which I now submit for assessment on the programme of study leading to the award of Ph.D. is entirely my own work, that I have exercised reasonable care to ensure that the work is original, and does not to the best of my knowledge breach any law of copyright, and has not been taken from the work of others save and to the extent that such work has been cited and acknowledged within the text of my work.

Research paper thumbnail of Part-of-speech Tagging of Code-Mixed Social Media Text

Proceedings of the Second Workshop on Computational Approaches to Code Switching, 2016

Multilingual users of social media sometimes use multiple languages during conversation. Mixing m... more Multilingual users of social media sometimes use multiple languages during conversation. Mixing multiple languages in content is known as code-mixing. We annotate a subset of a trilingual code-mixed corpus (Barman et al., 2014) with part-of-speech (POS) tags. We investigate two state-of-the-art POS tagging techniques for code-mixed content and combine the features of the two systems to build a better POS tagger. Furthermore, we investigate the use of a joint model which performs language identification (LID) and partof-speech (POS) tagging simultaneously.

Research paper thumbnail of Automatic processing of code-mixed social media content

Code-mixing or language-mixing is a linguistic phenomenon where multiple language mix together du... more Code-mixing or language-mixing is a linguistic phenomenon where multiple language mix together during conversation. Standard natural language processing (NLP) tools such as part-of-speech (POS) tagger and parsers perform poorly because such tools are generally trained with monolingual content. Thus there is a need for code-mixed NLP. This research focuses on creating a code-mixed corpus in English-Hindi-Bengali and using it to develop a world-level language identifier and a POS tagger for such code-mixed content. The first target of this research is word-level language identification. A data set of romanised and code-mixed content written in English, Hindi and Bengali was created and annotated. Word-level language identification (LID) was performed on this data using dictionaries and machine learn- ing techniques. We find that among a dictionary-based system, a character-n-gram based linear model, a character-n-gram based first order Conditional Random Fields (CRF) and a recurrent n...

Research paper thumbnail of NER from Tweets: SRI-JU System @MSM 2013

Now a day Twitter has become an interesting source of experiment for different NLP experiments li... more Now a day Twitter has become an interesting source of experiment for different NLP experiments like entity extraction, user opinion analysis and more. Due to the noisy nature of user generated content it is hard to run standard NLP tools to obtain a better result. The task of named entity extraction from tweets is one of them. Traditional NER approaches on tweets do not perform well. Tweets are usually informal in nature and short (up to 140 characters). They often contain grammatical errors, misspellings, and unreliable capitalization. These unreliable linguistic features cause traditional methods to perform poorly on tweets. This article reports the author's participation in the Concept Extraction Challenge, Making Sense of micro posts (#MSM2013). Three different systems runs have been submitted. The first run is the baseline, second run is with capitalization and syntactic feature and the last run is with dictionary features. The last run yielded than all other. The accuracy ...

Research paper thumbnail of DCU-UVT: Word-Level Language Classification with Code-Mixed Data

Proceedings of the First Workshop on Computational Approaches to Code Switching, 2014

This paper describes the DCU-UVT team's participation in the Language Identification in Code-Swit... more This paper describes the DCU-UVT team's participation in the Language Identification in Code-Switched Data shared task in the Workshop on Computational Approaches to Code Switching. Wordlevel classification experiments were carried out using a simple dictionary-based method, linear kernel support vector machines (SVMs) with and without contextual clues, and a k-nearest neighbour approach. Based on these experiments, we select our SVM-based system with contextual clues as our final system and present results for the Nepali-English and Spanish-English datasets.

Research paper thumbnail of Code Mixing: A Challenge for Language Identification in the Language of Social Media

Proceedings of the First Workshop on Computational Approaches to Code Switching, 2014

In social media communication, multilingual speakers often switch between languages, and, in such... more In social media communication, multilingual speakers often switch between languages, and, in such an environment, automatic language identification becomes both a necessary and challenging task. In this paper, we describe our work in progress on the problem of automatic language identification for the language of social media. We describe a new dataset that we are in the process of creating, which contains Facebook posts and comments that exhibit code mixing between Bengali, English and Hindi. We also present some preliminary word-level language identification experiments using this dataset. Different techniques are employed, including a simple unsupervised dictionary-based approach, supervised word-level classification with and without contextual clues, and sequence labelling using Conditional Random Fields. We find that the dictionary-based approach is surpassed by supervised classification and sequence labelling, and that it is important to take contextual clues into consideration. Hiroshi Yamaguchi and Kumiko Tanaka-Ishii. 2012. Text segmentation by language using minimum description length. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1, pages 969-978. Association for Computational Linguistics.

Research paper thumbnail of DCU: Aspect-based Polarity Classification for SemEval Task 4

Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014), 2014

We describe the work carried out by DCU on the Aspect Based Sentiment Analysis task at SemEval 20... more We describe the work carried out by DCU on the Aspect Based Sentiment Analysis task at SemEval 2014. Our team submitted one constrained run for the restaurant domain and one for the laptop domain for sub-task B (aspect term polarity prediction), ranking highest out of 36 systems on the restaurant test set and joint highest out of 32 systems on the laptop test set.

Research paper thumbnail of Sentiment Analysis Meets Information Retrieval

Research paper thumbnail of Ad-hoc Information Retrieval focused on Wikipedia based Query Expansion and Entropy Based Ranking

This paper presents the experiments carried out at Jadavpur University as part of the participati... more This paper presents the experiments carried out at Jadavpur University as part of the participation in the Forum for Information Retrieval Evaluation (FIRE) 2012 in ad-hoc monolingual information retrieval task for Bengali Hindi and English languages. The experiments carried out by us for FIRE 2012 are based on query expansion and entropy based ranking. The document collection for Bengali, Hindi and English contained 4, 57,370 , 3,31,599 and 3,92,577 documents respectively. Each query was specified using title, narration and description format. 100 queries were used for training the system while the system was tested with 50 queries in Bengali.

Research paper thumbnail of NextGen AML: Distributed Deep Learning based Language Technologies to Augment Anti Money Laundering Investigation

Proceedings of ACL 2018, System Demonstrations, 2018

Research paper thumbnail of NER from Tweets: SRI-JU System@ MSM 2013

Now a day Twitter has become an interesting source of experiment for different NLP experiments li... more Now a day Twitter has become an interesting source of experiment for different NLP experiments like entity extraction, user opinion analysis and more. Due to the noisy nature of user generated content it is hard to run standard NLP tools to obtain a better result. The task of named entity extraction from tweets is one of them. Traditional NER approaches on tweets do not perform well. Tweets are usually informal in nature and short (up to 140 characters). They often contain grammatical errors, misspellings, and unreliable capitalization. These unreliable linguistic features cause traditional methods to perform poorly on tweets. This article reports the author's participation in the Concept Extraction Challenge, Making Sense of micro posts (#MSM2013). Three different systems runs have been submitted. The first run is the baseline, second run is with capitalization and syntactic feature and the last run is with dictionary features. The last run yielded than all other. The accuracy of the final run has been checked is 79.57 (precision), 71.00 (recall) and 74.79 (f-measure) respectively.

Research paper thumbnail of Semantic Answer Validation using Universal Networking Language

International Journal of Computer Science and Information Technologies (IJCSIT), ISSN

we present a rule-based answer validation (AV) system based on textual entailment (TE) recognitio... more we present a rule-based answer validation (AV) system based on textual entailment (TE) recognition mechanism that uses semantic features expressed in the Universal Networking Language (UNL). We consider the question as the TE hypothesis (H) and the supporting text as TE text (T). Our proposed TE system compares the UNL relations in both T and H in order to identify the entailment relation as either validated or rejected. For training and evaluation, we used the AVE 2008 development set. We obtained 58% precision and 22% F-score for the decision "validated."

Research paper thumbnail of A statistics-based semantic textual entailment system

Advances in Artificial Intelligence, 2011

We present a Textual Entailment (TE) recognition system that uses semantic features based on the ... more We present a Textual Entailment (TE) recognition system that uses semantic features based on the Universal Networking Language (UNL). The proposed TE system compares the UNL relations in both the text and the hypothesis to arrive at the two-way entailment decision. The system has been separately trained on each development corpus released as part of the Recognizing Textual Entailment (RTE) competitions RTE-1, RTE-2, RTE-3 and RTE-5 and tested on the respective RTE test sets.

Research paper thumbnail of Automatic processing of code-mixed social media content

I hereby certify that this material, which I now submit for assessment on the programme of study ... more I hereby certify that this material, which I now submit for assessment on the programme of study leading to the award of Ph.D. is entirely my own work, that I have exercised reasonable care to ensure that the work is original, and does not to the best of my knowledge breach any law of copyright, and has not been taken from the work of others save and to the extent that such work has been cited and acknowledged within the text of my work.

Research paper thumbnail of Part-of-speech Tagging of Code-Mixed Social Media Text

Proceedings of the Second Workshop on Computational Approaches to Code Switching, 2016

Multilingual users of social media sometimes use multiple languages during conversation. Mixing m... more Multilingual users of social media sometimes use multiple languages during conversation. Mixing multiple languages in content is known as code-mixing. We annotate a subset of a trilingual code-mixed corpus (Barman et al., 2014) with part-of-speech (POS) tags. We investigate two state-of-the-art POS tagging techniques for code-mixed content and combine the features of the two systems to build a better POS tagger. Furthermore, we investigate the use of a joint model which performs language identification (LID) and partof-speech (POS) tagging simultaneously.

Research paper thumbnail of Automatic processing of code-mixed social media content

Code-mixing or language-mixing is a linguistic phenomenon where multiple language mix together du... more Code-mixing or language-mixing is a linguistic phenomenon where multiple language mix together during conversation. Standard natural language processing (NLP) tools such as part-of-speech (POS) tagger and parsers perform poorly because such tools are generally trained with monolingual content. Thus there is a need for code-mixed NLP. This research focuses on creating a code-mixed corpus in English-Hindi-Bengali and using it to develop a world-level language identifier and a POS tagger for such code-mixed content. The first target of this research is word-level language identification. A data set of romanised and code-mixed content written in English, Hindi and Bengali was created and annotated. Word-level language identification (LID) was performed on this data using dictionaries and machine learn- ing techniques. We find that among a dictionary-based system, a character-n-gram based linear model, a character-n-gram based first order Conditional Random Fields (CRF) and a recurrent n...

Research paper thumbnail of NER from Tweets: SRI-JU System @MSM 2013

Now a day Twitter has become an interesting source of experiment for different NLP experiments li... more Now a day Twitter has become an interesting source of experiment for different NLP experiments like entity extraction, user opinion analysis and more. Due to the noisy nature of user generated content it is hard to run standard NLP tools to obtain a better result. The task of named entity extraction from tweets is one of them. Traditional NER approaches on tweets do not perform well. Tweets are usually informal in nature and short (up to 140 characters). They often contain grammatical errors, misspellings, and unreliable capitalization. These unreliable linguistic features cause traditional methods to perform poorly on tweets. This article reports the author's participation in the Concept Extraction Challenge, Making Sense of micro posts (#MSM2013). Three different systems runs have been submitted. The first run is the baseline, second run is with capitalization and syntactic feature and the last run is with dictionary features. The last run yielded than all other. The accuracy ...

Research paper thumbnail of DCU-UVT: Word-Level Language Classification with Code-Mixed Data

Proceedings of the First Workshop on Computational Approaches to Code Switching, 2014

This paper describes the DCU-UVT team's participation in the Language Identification in Code-Swit... more This paper describes the DCU-UVT team's participation in the Language Identification in Code-Switched Data shared task in the Workshop on Computational Approaches to Code Switching. Wordlevel classification experiments were carried out using a simple dictionary-based method, linear kernel support vector machines (SVMs) with and without contextual clues, and a k-nearest neighbour approach. Based on these experiments, we select our SVM-based system with contextual clues as our final system and present results for the Nepali-English and Spanish-English datasets.

Research paper thumbnail of Code Mixing: A Challenge for Language Identification in the Language of Social Media

Proceedings of the First Workshop on Computational Approaches to Code Switching, 2014

In social media communication, multilingual speakers often switch between languages, and, in such... more In social media communication, multilingual speakers often switch between languages, and, in such an environment, automatic language identification becomes both a necessary and challenging task. In this paper, we describe our work in progress on the problem of automatic language identification for the language of social media. We describe a new dataset that we are in the process of creating, which contains Facebook posts and comments that exhibit code mixing between Bengali, English and Hindi. We also present some preliminary word-level language identification experiments using this dataset. Different techniques are employed, including a simple unsupervised dictionary-based approach, supervised word-level classification with and without contextual clues, and sequence labelling using Conditional Random Fields. We find that the dictionary-based approach is surpassed by supervised classification and sequence labelling, and that it is important to take contextual clues into consideration. Hiroshi Yamaguchi and Kumiko Tanaka-Ishii. 2012. Text segmentation by language using minimum description length. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1, pages 969-978. Association for Computational Linguistics.

Research paper thumbnail of DCU: Aspect-based Polarity Classification for SemEval Task 4

Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014), 2014

We describe the work carried out by DCU on the Aspect Based Sentiment Analysis task at SemEval 20... more We describe the work carried out by DCU on the Aspect Based Sentiment Analysis task at SemEval 2014. Our team submitted one constrained run for the restaurant domain and one for the laptop domain for sub-task B (aspect term polarity prediction), ranking highest out of 36 systems on the restaurant test set and joint highest out of 32 systems on the laptop test set.

Research paper thumbnail of Sentiment Analysis Meets Information Retrieval

Research paper thumbnail of Ad-hoc Information Retrieval focused on Wikipedia based Query Expansion and Entropy Based Ranking

This paper presents the experiments carried out at Jadavpur University as part of the participati... more This paper presents the experiments carried out at Jadavpur University as part of the participation in the Forum for Information Retrieval Evaluation (FIRE) 2012 in ad-hoc monolingual information retrieval task for Bengali Hindi and English languages. The experiments carried out by us for FIRE 2012 are based on query expansion and entropy based ranking. The document collection for Bengali, Hindi and English contained 4, 57,370 , 3,31,599 and 3,92,577 documents respectively. Each query was specified using title, narration and description format. 100 queries were used for training the system while the system was tested with 50 queries in Bengali.

Research paper thumbnail of NextGen AML: Distributed Deep Learning based Language Technologies to Augment Anti Money Laundering Investigation

Proceedings of ACL 2018, System Demonstrations, 2018

Research paper thumbnail of NER from Tweets: SRI-JU System@ MSM 2013

Now a day Twitter has become an interesting source of experiment for different NLP experiments li... more Now a day Twitter has become an interesting source of experiment for different NLP experiments like entity extraction, user opinion analysis and more. Due to the noisy nature of user generated content it is hard to run standard NLP tools to obtain a better result. The task of named entity extraction from tweets is one of them. Traditional NER approaches on tweets do not perform well. Tweets are usually informal in nature and short (up to 140 characters). They often contain grammatical errors, misspellings, and unreliable capitalization. These unreliable linguistic features cause traditional methods to perform poorly on tweets. This article reports the author's participation in the Concept Extraction Challenge, Making Sense of micro posts (#MSM2013). Three different systems runs have been submitted. The first run is the baseline, second run is with capitalization and syntactic feature and the last run is with dictionary features. The last run yielded than all other. The accuracy of the final run has been checked is 79.57 (precision), 71.00 (recall) and 74.79 (f-measure) respectively.

Research paper thumbnail of Semantic Answer Validation using Universal Networking Language

International Journal of Computer Science and Information Technologies (IJCSIT), ISSN

we present a rule-based answer validation (AV) system based on textual entailment (TE) recognitio... more we present a rule-based answer validation (AV) system based on textual entailment (TE) recognition mechanism that uses semantic features expressed in the Universal Networking Language (UNL). We consider the question as the TE hypothesis (H) and the supporting text as TE text (T). Our proposed TE system compares the UNL relations in both T and H in order to identify the entailment relation as either validated or rejected. For training and evaluation, we used the AVE 2008 development set. We obtained 58% precision and 22% F-score for the decision "validated."

Research paper thumbnail of A statistics-based semantic textual entailment system

Advances in Artificial Intelligence, 2011

We present a Textual Entailment (TE) recognition system that uses semantic features based on the ... more We present a Textual Entailment (TE) recognition system that uses semantic features based on the Universal Networking Language (UNL). The proposed TE system compares the UNL relations in both the text and the hypothesis to arrive at the two-way entailment decision. The system has been separately trained on each development corpus released as part of the Recognizing Textual Entailment (RTE) competitions RTE-1, RTE-2, RTE-3 and RTE-5 and tested on the respective RTE test sets.