Query Labelling for Indic Languages using a hybrid approach (original) (raw)
Related papers
IIIT-H System Submission for FIRE2014 Shared Task on Transliterated Search
This paper describes our submission for FIRE 2014 Shared Task on Transliterated Search. The shared task features two sub-tasks: Query word labeling and Mixed-script Ad hoc retrieval for Hindi Song Lyrics. Query Word Labeling is on token level language identification of query words in code-mixed queries and the transliteration of identi- fied Indian language words into their native scripts. We have devel- oped an SVM classifier for the token level language identification of query words and a decision tree classifier for transliteration. The second subtask for Mixed-script Ad hoc retrieval for Hindi Song Lyrics is to retrieve a ranked list of songs from a corpus of Hindi song lyrics given an input query in Devanagari or transliter- ated Roman script. We have used edit distance based query expan- sion and language modeling based pruning followed by relevance based re-ranking for the retrieval of relevant Hindi Song lyrics for a given query. We see that even though our approaches are not very sophis- ticated, they perform reasonably well. Our results show that these approaches may perform much better if more sophisticated features or ranking is used. Both of our systems are available for download and can be used for research purposes.
Overview of FIRE-2015 Shared Task on Mixed Script Information Retrieval
2015
The Transliterated Search track has been organized for the third year in FIRE-2015. The track had three subtasks. Subtask I was on language labeling of words in code-mixed text fragments; it was conducted for 8 Indian languages: Bangla, Gujarati, Hindi, Kannada, Malayalam, Marathi, Tamil, Telugu, mixed with English. Subtask II was on ad-hoc retrieval of Hindi film lyrics, movie reviews and astrology documents, where both the queries and documents were either in Hindi written in Devanagari or in Roman transliterated form. Subtask III was on transliterated question answering where the documents as well as questions were in Bangla script or Roman transliterated Bangla. A total of 24 runs were submitted by 10 teams, of which 14 runs were for subtask I and 10 runs for subtask II. There were no participation for Subtask III. The overview presents a comprehensive report of the subtasks, datasets, runs submitted and performances.
Labeling of Query Words using Conditional Random Field
This paper describes our approach on Query Word Labeling as an attempt in the shared task on Mixed Script Information Retrieval at Forum for Information Retrieval Evaluation (FIRE) 2015. The query is written in Roman script and the words were in English or transliterated from Indian regional languages. A total of eight Indian languages were present in addition to English. We also identified the Named Entities and special symbols as part of our task. A CRF based machine learning framework was used for labeling the individual words with their corresponding language labels. We used a dictionary based approach for language identification. We also took into account the context of the word while identifying the language. Our system demonstrated an overall accuracy of 75.5% for token level language identification. The strict F-measure scores for the identification of token level language labels for Bengali, English and Hindi are 0.7486, 0.892 and 0.7972 respectively. The overall weighted F-measure of our system was 0.7498.
AmritaCEN_NLP @ FIRE 2015 Language Identification for Indian Languages in Social Media Text
The progression of social media contents, similar like Twitter and Facebook messages and blog post, has created, many new opportunities for language technology. The user generated contents such as tweets and blogs in most of the languages are written using Roman script due to distinct social culture and technology. Some of them using own language script and mixed script. The primary challenges in process the short message is identifying languages. Therefore, the language identification is not restricted to a language but also to multiple languages. The task is to label the words with the following categories L1, L2, Named Entities, Mixed, Punctuation and Others This paper presents the AmritaCen_NLP team participation in FIRE2015-Shared Task on Mixed Script Information Retrieval Subtask 1: Query Word Labeling on language identification of each word in text, Named Entities, Mixed, Punctuation and Others which uses sequence level query labelling with Support Vector Machine.
IIIT-H System Submission for FIRE 2014 Shared Task on Transliterated Search Conference
2015
This paper describes our submission for FIRE 2014 Shared Task on Transliterated Search. The shared task features two sub-tasks: Query word labeling and Mixed-script Ad hoc retrieval for Hindi Song Lyrics. Query Word Labeling is on token level language identification of query words in code-mixed queries and back-transliteration of identified Indian language words into their native scripts. We have developed letter based language models for the token level language identification of query words and a structured perceptron model for back-transliteration of Indic words. The second subtask for Mixed-script Ad hoc retrieval for Hindi Song Lyrics is to retrieve a ranked list of songs from a corpus of Hindi song lyrics given an input query in Devanagari or transliterated Roman script. We have used edit distance based query expansion and language modeling followed by relevance based reranking for the retrieval of relevant Hindi Song lyrics for a given query.
Mixed script query labelling using supervised learning and Ad hoc retrieval using sub word indexing
Much of the user generated content on the internet is written in their transliterated form instead of in their indigenous script. Due to this search engines receive a large number of transliterated search queries. This paper presents our approach to handle labelling of queries and ad hoc retrieval of documents based on these queries, as part of the FIRE2014 shared task on transliterated search. The content of each document is written in either the native Devanagari script or its transliterated form in Roman script or a combination of both. The queries to retrieve these documents can also be in mixed script. The task is challenging primarily due to the spelling variations that occur in the transliterated form of search queries. This particular problem is addressed by using back transliteration to reduce spelling variations, and a set of hand-tailored rules for consonant mapping. Sub-word indexing is done to take care of breaking and joining of transliterated words. Implementation of query labelling of the mixed script content was done using a supervised learning approach where an SVM classifier was trained using character nnn-grams as features for language identification. A Naïve Bayes classifier was used for classifying transliterated words that can belong to both Hindi and English when looked at individually. The 2 runs submitted by our team BITS-Lipyantaran performs best across all metrics for Subtask 2 among all the teams that participated, with a MRR score of 0.8171 and MAP score of 0.6421. Only the working notes have been added here. The full paper can be viewed at this link: http://dl.acm.org/citation.cfm?doid=2824864.2824873
Transliterated Search using Syllabification Approach
2013
Machine transliteration refers to the process of automatic conversion of a word from one language to another without losing its phonological characteristics. In this work, we present our experiments performed in subtask-1 and subtask-2 as a part of the FIRE-2013 transliterated search task. In both the subtasks, the transliteration from Roman script to Devanagari script was performed using syllabification approach that converted English into Hindi language. In the query labeling subtask, identification of English and Hindi words was performed using a hybrid approach that involved morphological analysis of English words and a corpus based approach to identify frequently occurring Hindi words. In the multi-script adhoc retrieval of Hindi song lyrics subtask, the queries were formulated that contained both Roman and Devanagari script and Roman script for separate run submissions. The evaluation of our experiments achieved a higher recall value of query labeling in subtask-1 however the ...
Query expansion for mixed-script information retrieval
Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval, 2014
For many languages that use non-Roman based indigenous scripts (e.g., Arabic, Greek and Indic languages) one can often find a large amount of user generated transliterated content on the Web in the Roman script. Such content creates a monolingual or multilingual space with more than one script which we refer to as the Mixed-Script space. IR in the mixed-script space is challenging because queries written in either the native or the Roman script need to be matched to the documents written in both the scripts. Moreover, transliterated content features extensive spelling variations. In this paper, we formally introduce the concept of Mixed-Script IR, and through analysis of the query logs of Bing search engine, estimate the prevalence and thereby establish the importance of this problem. We also give a principled solution to handle the mixed-script term matching and spelling variation where the terms across the scripts are modelled jointly in a deep-learning architecture and can be compared in a low-dimensional abstract space. We present an extensive empirical analysis of the proposed method along with the evaluation results in an ad-hoc retrieval setting of mixedscript IR where the proposed method achieves significantly better results (12% increase in MRR and 29% increase in MAP) compared to other state-of-the-art baselines.
Machine Learning Approach for Language Identification & Transliteration
Proceedings of the Forum for Information Retrieval Evaluation on - FIRE '14, 2015
In this paper, we describe the system that we developed as part of our participation to the FIRE-2014 Shared Task on Transliterated Search. We participated only for Subtask 1 that focused on labeling the query words. The entire process consists of the following subtasks: language identification of each word in the text, named entity recognition and classification (NERC) and transliteration of the Indian language words written in non-native scripts to the corresponding native Indic scripts. The proposed methods of language identification and NERC are based on the supervised approaches, where we use several machine learning algorithms. We develop a transliteration framework which is based on the modified joint source channel model. Experiments on the benchmark setup show that we achieve quite encouraging performance for both pairs of languages. It is also to be noted that we did not make use of any deep domainspecific resources and/or tools, and therefore this can be easily adapted to the other domains and/or languages.