Bits_Pilani@INLI-FIRE-2017: Indian Native Language Identification using Deep Learning (original) (raw)

Deep Learning-Based Language Identification in English-Hindi-Bengali Code-Mixed Social Media Corpora

Journal of Intelligent Systems

This article addresses language identification at the word level in Indian social media corpora taken from Facebook, Twitter and WhatsApp posts that exhibit code-mixing between English-Hindi, English-Bengali, as well as a blend of both language pairs. Code-mixing is a fusion of multiple languages previously mainly associated with spoken language, but which social media users also deploy when communicating in ways that tend to be rather casual. The coarse nature of code-mixed social media text makes language identification challenging. Here, the performance of deep learning on this task is compared to feature-based learning, with two Recursive Neural Network techniques, Long Short Term Memory (LSTM) and bidirectional LSTM, being contrasted to a Conditional Random Fields (CRF) classifier. The results show the deep learners outscoring the CRF, with the bidirectional LSTM demonstrating the best language identification performance.

Native-Language Identification with Attention

2020

The paper explores how an attention-based approach can increase performance on the task of native-language identification (NLI), i.e., to identify an author’s first language given information expressed in a second language. Previously, Support Vector Machines have consistently outperformed deep learning-based methods on the TOEFL11 data set, the de facto standard for evaluating NLI systems. The attention-based system BERT (Bidirectional Encoder Representations from Transformers) was first tested in isolation on the TOEFL11 data set, then used in a metaclassifier stack in combination with traditional techniques to produce an accuracy of 0.853. However, more labelled NLI data is now available, so BERT was also trained on the much larger Reddit-L2 data set, containing 50 times as many examples as previously used for English NLI, giving an accuracy of 0.902 on the Reddit-L2 in-domain test scenario, improving the state-of-the-art by 21.2 percentage points.

IIT (BHU) System for Indo-Aryan Language Identification (ILI) at VarDial 2018

2018

Text language Identification is a Natural Language Processing task of identifying and recognizing a given language out of many different languages from a piece of text. This paper describes our submission to the ILI 2018 shared-task, which includes the identification of 5 closely related Indo-Aryan languages. We developed a word-level LSTM(Long Short-term Memory) model, a specific type of Recurrent Neural Network model, for this task. Given a sentence, our model embeds each word of the sentence and convert into its trainable word embedding, feeds them into our LSTM network and finally predict the language. We obtained an F1 macro score of 0.836, ranking 5th in the task.

Indian Language Identification using Deep Learning

ITM Web of Conferences

Spoken language is the most regular method of correspondence in this day and age. Endeavours to create language recognizable proof frameworks for Indian dialects have been very restricted because of the issue of speaker accessibility and language readability. However, the necessity of SLID is expanding for common and safeguard applications day by day. Feature extraction is a basic and important procedure performed in LID. A sound example is changed over into a spectrogram visual portrayal which describes a range of frequencies in regard with time. Three such spectrogram visuals were generated namely Log Spectrogram, Gammatonegram and IIR-CQT Spectrogram for audio samples from the standardized IIIT-H Indic Speech Database. These visual representations depict language specific details and the nature of each language. These spectrograms images were then used as an input to the CNN. Classification accuracy of 98.86% was obtained using the proposed methodology.

Character Level Convolutional Neural Network for Indo-Aryan Language Identification

This paper presents the systems submitted by the safina team to the Indo-Aryan Language Identification (ILI) shared task at the VarDial Evaluation Campaign 2018. The ILI shared task included 5 closely-related languages of Indo-Aryan language family: Hindi (also known as Khari Boli), Braj Bhasha, Awadhi, Bhojpuri and Magahi. The proposed approach is to use character-level convolution neural network to distinguish the four dialects. We submitted three models with the same architecture except for the first layer. The first system uses one-hot char representation as input to the convolution layer. The second system uses an embedding layer before the convolu-tion layer. The third system uses a recurrent layer before the convolution layer. The best results were obtained using the first model achieving 86.27% F1-score, ranked the fourth among eight teams. 1

Towards developing tools for Indian Languages using Deep Learning

2019

Extensive research is being carried out in the field of Natural Language Processing (NLP) in the context of Indian languages. Spelling correction, word segmentation, and grammar checking are the fundamental problems in NLP. The aim of these problems is to identify noise in the data and correct it. These tools are important for many NLP applications like web search engines, text summarization, sentiment analysis, machine translation etc. Many methods have been developed for these problems for English, which usually exploit linguistic resources like parsers, large amounts of real world data etc. making it difficult to adapt them to other languages. For these problems, deep learning models have also been implemented for English. These models use parallel data of noisy and correct mappings from different sources as training data for automatic correction tasks. Indian languages are resource-scarce and do not have such parallel data due to low volume of queries and non-existence of such p...

SeerNet@INLI-FIRE-2017: Hierarchical Ensemble for Indian Native Language Identification

2017

Native Language Identification has played an important role in forensics primarily for author profiling and identification. In this work, we discuss our approach to the shared task of Indian Language Identification. The task is primarily to identify the native language of the writer from the given XML file which contains a set of Facebook comments in the English language. We propose a hierarchical ensemble approach which combines various machine learning techniques along with language agnostic feature extraction to perform the final classification. Our hierarchical ensemble improves the TF-IDF based baseline accuracy by 3.9%. The proposed system stood 3rd across unique team submissions..

Sentences and Documents in Native Language Identification

Proceedings of the Fifth Italian Conference on Computational Linguistics CLiC-it 2018

English. In this paper we present a work aimed at testing the most advanced, stateof-the-art syntactic parsers based on deep neural networks (DNN) on Italian. We made a set of experiments by using the Universal Dependencies benchmarks and propose a new solution based on ensemble systems obtaining very good performances. Italiano. In questo contributo presentiamo alcuni esperimenti volti a verificare le prestazioni dei più avanzati parser sintattici sull'italiano utilizzando i treebank disponibili nell'ambito delle Universal Dependencies. Proponiamo inoltre un nuovo sistema basato sull'ensemble parsing che ha mostrato ottime prestazioni.

“ye word kis lang ka hai bhai?” Testing the Limits of Word level Language Identification

2014

Language identification is a necessary prerequisite for processing any user generated text, where the language is unknown. It becomes even more challenging when the text is code-mixed, i.e., two or more languages are used within the same text. Such data is commonly seen in social media, where further challenges might arise due to contractions and transliterations. The existing language identification systems are not designed to deal with codemixed text, and as our experiments show, perform poorly on a synthetically created code-mixed dataset for 28 languages.We propose extensions to an existing approach for word level language identification. Our technique not only outperforms the existing methods, but also makes no assumption about the language pairs mixed in the text a common requirement of the existing word level language identification systems.This study shows that word level language identification is most likely to confuse between languages which are linguistically related (e....

AmritaCEN_NLP @ FIRE 2015 Language Identification for Indian Languages in Social Media Text

The progression of social media contents, similar like Twitter and Facebook messages and blog post, has created, many new opportunities for language technology. The user generated contents such as tweets and blogs in most of the languages are written using Roman script due to distinct social culture and technology. Some of them using own language script and mixed script. The primary challenges in process the short message is identifying languages. Therefore, the language identification is not restricted to a language but also to multiple languages. The task is to label the words with the following categories L1, L2, Named Entities, Mixed, Punctuation and Others This paper presents the AmritaCen_NLP team participation in FIRE2015-Shared Task on Mixed Script Information Retrieval Subtask 1: Query Word Labeling on language identification of each word in text, Named Entities, Mixed, Punctuation and Others which uses sequence level query labelling with Support Vector Machine.