AmritaCEN_NLP @ FIRE 2015 Language Identification for Indian Languages in Social Media Text (original) (raw)
Related papers
Query Labelling for Indic Languages using a hybrid approach
2015
With a boom in the internet, social media text has been increasing day by day. Much of the user generated content on internet is written in a very informal way. Usually people tend to write text on social media using indigenous script. To understand a script different from ours is a difficult task. Moreover, nowadays queries received by the search engines are large number of transliterated text. Hence providing a common platform to deal with the problem of transliterated text becomes really important. This paper presents our approach to handle labeling of queries as part of the FIRE2015 shared task on Mixed-Script Information Retrieval. Tokens in the query are labeled on basis of a hybrid approach which involves rule based and machine learning techniques. Each annotation has been dealt separately but sequentially.
Identifying Languages at the Word Level in Code-Mixed Indian Social Media Text
Language identification at the document level has been considered an almost solved problem in some application areas, but language detectors fail in the social media context due to phenomena such as utterance internal code-switching, lexical borrowings, and phonetic typing; all implying that language identification in social media has to be carried out at the word level. The paper reports a study to detect language boundaries at the word level in chat message corpora in mixed English-Bengali and English-Hindi. We introduce a code-mixing index to evaluate the level of blending in the corpora and describe the performance of a system developed to separate multiple languages.
Named Entity Recognition for Hindi-English Code-Mixed Social Media Text
Proceedings of the Seventh Named Entities Workshop, 2018
Named Entity Recognition (NER) is a major task in the field of Natural Language Processing (NLP), and also is a subtask of Information Extraction. The challenge of NER for tweets lies in the insufficient information available in a tweet. There has been a significant amount of work done related to entity extraction, but only for resource-rich languages and domains such as the newswire. Entity extraction is, in general, a challenging task for such an informal text, and code-mixed text further complicates the process with it's unstructured and incomplete information. We propose experiments with different machine learning classification algorithms with word, character and lexical features. The algorithms we experimented with are Decision tree, Long Short-Term Memory (LSTM), and Conditional Random Field (CRF). In this paper, we present a corpus for NER in Hindi-English Code-Mixed along with extensive experiments on our machine learning models which achieved the best f1-score of 0.95 with both CRF and LSTM.
Vira@FIRE 2015: Entity Extraction from Social Media Text Indian Languages (ESM-IL)
2015
In this paper we have tried to identify and extract “Named Entities” from social media text using conditional random field(CRF) [3]. The paper represents our working methodology and result on Entity Extraction from Social Media Text Indian Languages task of FIRE-2015. We have extracted named entities from two languages Hindi and English. Named Entity Extraction system is implemented based on CRFSuite. CRFSuite [8] is the populer implementation of Conditional Random Fields (CRF). This is a sequential labelling task to achieve the desired tagging output. Conditional random fields (CRF) are a class of statistical modelling method often applied in pattern recognition, machine learning and many natural language processing tasks. We get F1-score of 19.82 and 3.72 for the Hindi and English text respectively.
HITS@FIRE task 2015: Twitter based Named Entity Recognizer for Indian Languages
Natural Language processing (NLP) in its pure sense, is a platform that provides the ability for transforming natural language text to useful information. Named Entity Recognition (NER) is a key task in NLP for classification of named entities in natural languages. Though, there are several algorithms for named entity classification, identifying named entities in twitter data is a demanding task. Loads of information are being shared by people in twitter on a daily basis. This information is unstructured and often contains important information about organizations, politics, disasters, promotional advertisements etc. In this paper, we provide a NER that can effectively classify named entities in twitter data for Indian Languages such as English, Hindi and Tamil. POS, Chunk, Suffix, Prefix information has been used for training in Conditional Random Fields (CRF) based NER Model. CRF is a popular model for labeling and classification in text mining. Performance analysis was done using n-fold validation and F-measure. A maximum precision of 93.82 for English, 92.28 for Hindi and 86.94 for Tamil twitter data was achieved through N fold validation. Results provided by ESM-IL share task in terms of precision for English is 50.48, for Hindi is 81.49 and for Tamil 70.42. The proposed algorithm has a higher classification accuracy and it is achieved through n-fold validation.
A New Methodology for Language Identification in Social Media Code-Mixed Text
2020
Nowadays, Transliteration is one of the hot research areas in the field of Natural Language Processing. Transliteration means that transferring a word from one language to another language and it is mostly used in cross-language platforms. Generally, people use code-mixed language for sharing their views on social media like Twitter, WhatsApp, etc. Code-mixed language means one language is written using another language script and it is very important to identify the languages used in each word to process such type of text. Therefore, a deep learning model is implemented using Bidirectional Long Short-Term Memory (BLSTM) for Indian social media texts in this paper. This model identifies the origin of the word from language perspective in the sequence based on the specific words that have come before it in the sequence. The proposed model gives better accuracy for word-embedding model as compared to character embedding.
Automatic Word-level Identification of Language in Assamese -English -Hindi Code-mixed Data
2018
In this paper, we discuss the automatic identification of language in Assamese-English-Hindi code-mixed data at the word-level. The data for this study was collected from public Facebook Pages and was annotated using a minimal tagset for code-mixed data. Support Vector Machine was trained using the total tagged dataset of approximately 20k tokens. The best performing classifier achieved a state-of-the-art accuracy of over 96%.
Language Identification of Hindi-English tweets using code-mixed BERT
ArXiv, 2021
Language identification of social media text has been an interesting problem of study in recent years. Social media messages are predominantly in code mixed in nonEnglish speaking states. Prior knowledge by pre-training contextual embeddings have shown state of the art results for a range of downstream tasks. Recently, models such as BERT have shown that using a large amount of unlabeled data, the pretrained language models are even more beneficial for learning common language representations. Extensive experiments exploiting transfer learning and fine-tuning BERT models to identify language on Twitter are presented in this paper. The work utilizes a data collection of Hindi-English-Urdu codemixed text for language pre-training and Hindi-English codemixed for subsequent word-level language classification. The results show that the representations pre-trained over codemixed data produce better results by their monolingual counterpart. Keywords—language identification, code-mixed text...
POS Tagging of Hindi-English Code Mixed Text from Social Media: Some Machine Learning Experiments
2015
We discuss Part-of-Speech(POS) tagging of Hindi-English Code-Mixed(CM) text from social media content. We propose extensions to the existing approaches, we also present a new feature set which addresses the transliteration problem inherent in social media. We achieve an 84% accuracy with the new feature set. We show that the context and joint modelling of language detection and POS tag layers do not help in POS tagging.
14_Code-mix entity extraction for Hindi-English and Tamil-English tweets.pdf
Social media text holds information regarding various important aspects. Extraction of such information serves as the basis for the most preliminary task in Natural Language Processing called Entity extraction. The work is submitted as a part of Shared task on Code Mix Entity Extraction for Indian Languages(CMEE-IL) at Forum for Information Retrieval Evaluation (FIRE) 2016. Three different methodology is proposed in this paper for the task of entity extraction for code-mix data. Proposed systems include approaches based on the Embedding models and feature based model. Creation of trigram embedding and BIO tag formatting were done during feature extraction. Evaluation of the system is carried out using machine learning based classifier, SVM-Light. Overall accuracy through cross validation has proven that the proposed system is efficient in classifying unknown tokens too.