Anand Kumar M | Amrita Vishwa Vidyapeetham (original) (raw)

Papers by Anand Kumar M

Transliteration is the process of replacing the characters in one language with the corresponding... more Transliteration is the process of replacing the characters in one language with the corresponding phonetically equivalent characters of the other language. India is a language diversified country where people speak and understand many languages but does not know the script of some of these languages. Transliteration plays a major role in such cases. Transliteration has been a supporting tool in machine translation and cross language information retrieval systems as most of the proper nouns are out of vocabulary words. In this paper, a sequence learning method for transliterating named entities from Tamil to Hindi is proposed. Through this approach, accuracy obtained is encouraging. This transliteration system can be embedded with Tamil to Hindi machine translation system in future.

This paper proposes a morphology based Factored Statistical Machine Translation (SMT) system for ... more This paper proposes a morphology based Factored Statistical Machine Translation (SMT) system for translating English language sentences into Tamil language sentences. Automatic translation from English into morphologically rich languages like Tamil is a challenging task. Morphologically rich languages need extensive morphological pre-processing before the SMT training to make the source language structurally similar to target language. English and Tamil languages have disparate morphological and syntactical structure. Because of the highly rich morphological nature of the Tamil language, a simple lexical mapping alone does not help for retrieving and mapping all the morpho-syntactic information from the English language sentences. The main objective of this proposed work is to develop a machine translation system from English to Tamil using a novel pre-processing methodology. This pre-processing methodology is used to pre-process the English language sentences according to the Tamil language. These pre-processed sentences are given to the factored Statistical Machine Translation models for training. Finally, the Tamil morphological generator is used for generating a new surface word-form from the output factors of SMT. Experiments are conducted with nine different type of models, which are trained, tuned and tested with the help of general domain corpora and developed linguistic tools. These models are different combinations of developed pre-processing tools with baseline models and factored models and the accuracies are evaluated using the well known evaluation metric BLEU and METOR. In addition, accuracies are also compared with the existing online " Google-Translate " machine translation system. Results show that the proposed method significantly outperforms the other models and the existing system.

—This paper is based on morphological analyzer using machine learning approach for complex agglut... more —This paper is based on morphological analyzer using machine learning approach for complex agglutinative natural languages. Morphological analysis is concerned with retrieving the structure, the syntactic and morphological properties or the meaning of a morphologically complex word. The morphology structure of agglutinative language is unique and capturing its complexity in a machine analyzable and generatable format is a challenging job. Generally rule based approaches are used for building morphological analyzer system. In rule based approaches what works in the forward direction may not work in the backward direction. This new and state of the art machine learning approach based on sequence labeling and training by kernel methods captures the non-linear relationships in the different aspect of morphological features of natural languages in a better and simpler way. The overall accuracy obtained for the morphologically rich agglutinative language (Tamil) was really encouraging.

— This paper presents the chunker for Tamil using Machine learning techniques. Chunking is the ta... more — This paper presents the chunker for Tamil using Machine learning techniques. Chunking is the task of identifying and segmenting the text into syntactically correlated word groups. The chunking is done by the machine learning techniques, where the linguistical knowledge is automatically extracted from the annotated corpus. We have developed our own tagset for annotating the corpus, which is used for training and testing the POS tagger generator and the chunker. The present tagset consists of thirty tags for POS and nine tags for chunking. A corpus size of two hundred and twenty five thousand words was used for training and testing the accuracy of the Chunker. We found that CRF++ affords the most encouraging result for Tamil chunker.

—Grammar plays an important role in good communication. Learning grammar rules for Tamil language... more —Grammar plays an important role in good communication. Learning grammar rules for Tamil language is very difficult as they have a very rich morphological structure which is agglutinative. Students get annoyed with the language rules and the old teaching methodology. Computer assisted Grammar Teaching Tools makes students to learn faster and better. NLP applications are used to generate such tools for curriculum enhancement of the students. In this paper we present the Grammar teaching tools in the sentence and word analyzing level for Tamil Language. The tools like Parts of speech Tagger, Chunker and Dependency parser for the sentence level analysis and Morphological Analyzer and Generator for the word level analysis were developed using machine learning based technology. These tools were very useful for second language learners to understand the word and sentence construction in a non-conceptual way. An user interface is developed for the practical usage of the tool.

In this paper, we presented a morphological analyzer for the classical Dravidian language Telugu ... more In this paper, we presented a morphological analyzer for the classical Dravidian language Telugu using machine learning approach. Morphological analyzer is a computer program that analyses the words belonging to Natural Languages and produces its grammatical structure as output. Telugu language is highly inflection and suffixation oriented, therefore developing the morphological analyzer for Telugu is a significant task. The developed morphological ana-lyzer is based on sequence labeling and training by kernel methods, it captures the non-linear relationships and various morphological features of Telugu language in a better and simpler way. This approach is more efficient than other morphological analyzers which were based on rules. In rule based approach every rule is depends on the previous rule. So if one rule fails, it will affect the entire rule that follows. Regarding the accuracy our system significantly achieves a very competitive accuracy of 94% and 97% in case of Telugu Verbs and nouns. Morphological analyzer for Tamil and Malayalam was also developed by using this approach.

An efficient and reliable method for implementing Morphological Analyzer for Malayalam using Mach... more An efficient and reliable method for implementing Morphological Analyzer for Malayalam using Machine Learning approach has been presented here. A Morphological Analyzer segments words into morphemes and analyze word formation. Morphemes are smallest meaning bearing units in a language. Morphological Analysis is one of the techniques used in formal reading and writing. Rule based approaches are generally used for building Morphological Analyzer. The disadvantage of using rule based approaches are that if one rule fails it will affect the entire rule that follows, that is each rule works on the output of previous rule. The significance of using machine learning approach arises from the fact that rules are learned automatically from data, uses learning and classification algorithms to learn models and make predictions. The result shows that the system is very effective and after learning it predicts correct grammatical features even for words which are not in the training set.

Clause boundary identification is a very important task in natural language processing. Identifyi... more Clause boundary identification is a very important task in natural language processing. Identifying the clauses in the sentence becomes a tough task if the clauses are embedded inside other clauses in the sentence. In our approach, we use the dependency parser to identify the boundary for the clause. The dependency tag set, contains 11 tags, and is useful for identifying the boundary of the clause along with the identification of the subject and object information of the sentence. The MALT parser is used to get the required information about the sentence.

— In recent years, with the development of technology, life has become very easy. Computers have ... more — In recent years, with the development of technology, life has become very easy. Computers have become the life line of today's high-tech world. There is no work in our whole day without the use of computers. When we focus particularly in the field of education, people started preferring to e-books than carrying textbooks. In the phase of learning, visualization plays a major role. When the visualization tool and auditory learning comes together, it brings the in-depth understanding of data and their phoneme sequence through animation and with proper pronunciation of the words, which is far better than the people learning from the textbooks and imagining in their perspective and have their own pronunciation. Scratch with its visual, block-based programming platform is widely used among high school kids to learn programming basics. We investigated that in many schools around the world uses this scratch for students to learn programming basics. Literature review shows that students find it interesting and are very curious about it. This made us anxious towards natural language learning using scratch because of its interesting visual platform. This paper is based on the concept of visual and auditory learning. Here, we described how we make use of this scratch toolkit for learning the secondary language. We also claim that this visual learning will help people remember easily than to read as texts in books and the auditory learning helps in proper pronunciation of words rather than expecting someone's help. We have developed a scratch based tool for learning simple sentence construction of secondary language through primary language. In this paper, languages used are English (secondary language) and Tamil (primary language). This is an enterprise for language learning tool in scratch. This is applicable for other language specific exercises and can be adopted easily for other languages too.

Non-native English writers often make preposition errors in English language. The most commonly o... more Non-native English writers often make preposition errors in English language. The most commonly occurring preposition errors are preposition replacement, preposition missing and unwanted preposition. So, in this method, a system is developed for finding and handling the English preposition errors in preposition replacement case. The proposed method applies 2-Singular Value Decomposition (SVD 2) concept for data decomposition resulting in fast calculation and these features are given for classification using Support Vector Machines (SVM) classifier which obtains an overaU accuracy above 90%. Features are retrieved using novel SVD 2 based method applied on trigrams which is having a preposition in the middle of the context. A matrix with the left and right vectors of each word in the trigram is computed for applying SVD 2 concept and these features are used for supervised classification. Preliminary results show that this novel feature extraction and dimensionality reduction method is the appropriate method for handling preposition errors.

— Between the growth of Internet or World Wide Web (WWW) and the emersion of the social networkin... more — Between the growth of Internet or World Wide Web (WWW) and the emersion of the social networking site like Friendster, Myspace etc., information society started facing exhilarating challenges in language technology applications such as Machine Translation (MT) and Information Retrieval (IR). Nevertheless, there were researchers working in Machine Translation that deal with real time information for over 50 years since the first computer has come along. Merely, the need for translating data has become larger than before as the world was getting together through social media. Especially, translating proper nouns and technical terms has become openly challenging task in Machine Translation. The Machine transliteration was emerged as a part of information retrieval and machine translation projects to translate the Named Entities based on phoneme and grapheme, hence, those are not registered in the dictionary. Many researchers have used approaches such as conventional Graphical models and also adopted other machine translation techniques for Machine Transliteration. Machine Transliteration was always looked as a Machine Learning Problem. In this paper, we presented a new area of Machine Learning approach termed as a Deep Learning for improving the bilingual machine transliteration task for Tamil and English languages with limited corpus. This technique precedes Artificial Intelligence. The system is built on Deep Belief Network (DBN), a generative graphical model, which has been proved to work well with other Machine Learning problem. We have obtained 79.46% accuracy for English to Tamil transliteration task and 78.4 % for Tamil to English transliteration.

This paper explores the full-fledged supervised Machine Learning based approach for the automatic... more This paper explores the full-fledged supervised Machine Learning based approach for the automatic extraction of lexical chunks, commonly called as Multi-Word Expression (MWE). The concept of MWE concerns a variety of constructions in everyday language in the form of idioms, phrasal verbs and noun compounds. The pervasiveness of MWE in the NLP tasks that deals with real text, such as Machine Translation and Information retrieval should be provided with enough MWE treatment; if not, the system will fail to generate high-quality natural output. Here, we are extracting phrasal verbs from the English movie subtitle corpus based on their corresponding linguistic pattern and standard association scores. The extracted phrasal verbs have been used to train various machine learning algorithms for discriminating MWE. Two methods of linguistic pattern extraction are implemented, out of which one is proven to be effective. Here, we have demonstrated two major findings, 1) MWE extraction based on dependency information along with POS tag provides better accuracy than it had been extracted from the POS tag pattern alone. 2) The result obtained from extraction is used to train three different machine learning classifiers, out of which Random forest classifier is verified to be the suitable classifier for the application handled.

This paper describes Named Entity Recognition systems for English, Hindi, Tamil and Malayalam lan... more This paper describes Named Entity Recognition systems for English, Hindi, Tamil and Malayalam languages. This paper presents our working methodology and results on Named Entity Recognition (NER) Task of FIRE-2014. In this work, English NER system is implemented based on Conditional Random Fields (CRF) which makes use of the model learned from the training corpus and feature representation, for tagging. The other systems for Hindi, Tamil and Malayalam are based on Support Vector Machine (SVM). In addition to training corpus, Tamil and Malayalam system uses gazetteer information.

This paper presents the method of Morpheme Extraction and lemmatization for Tamil language in Mor... more This paper presents the method of Morpheme Extraction and lemmatization for Tamil language in Morpheme Extraction Task (MET) of FIRE-2014. Tamil is a morphologically rich and agglutinative language. Such a language needs deeper analysis at the word level to capture the meaning of the word from its morphemes and its categories. In this attempt, the methodology employed to extract Tamil morphemes and lemmas are based on a supervised machine learning algorithm for nouns and verbs and simple suffix stripping for pronouns and proper nouns. Morphemes are extracted for other Part-of-Speech categories using Tamil Part of Speech tagger. In supervised learning, Morphological analyzer problem is redefined as a classification problem. We decompose the problem of noun and verb morpheme extraction into two sub-problems: learning to perform morpheme identification of words in a text, and learning to perform morpheme tagging. In addition to the Morpheme extraction task results of FIRE-2014, we have carried out different experiments to show the effectiveness of the proposed method.

A short text gets updated every now and then. With the global upswing of such micro posts, the ne... more A short text gets updated every now and then. With the global upswing of such micro posts, the need to retrieve information from them also seems to be incumbent. This work focuses on the knowledge extraction from the micro posts by having entity as evidence. Here the extracted entities are then linked to their relevant DBpedia source by featurization, Part Of Speech (POS) tagging, Named Entity Recognition (NER) and Word Sense Disambiguation (WSD). This short paper encompasses its contribution to #Micropost2015-NEEL task by experimenting existing Machine Learning (ML) algorithms.

The present work is done as part of shared task in Sentiment Analysis in Indian Languages (SAIL 2... more The present work is done as part of shared task in Sentiment Analysis in Indian Languages (SAIL 2015), under constrained category. The task is to classify the twitter data into three polarity categories such as positive, negative and neutral. For training, twitter dataset under three languages were provided Hindi, Bengali and Tamil. In this shared task, ours is the only team who participated in all the three languages. Each dataset contained three separate categories of twitter data namely positive, negative and neutral. The proposed method used binary features, statistical features generated from SentiWordNet, and word presence (binary feature). Due to the sparse nature of the generated features, the input features were mapped to a random Fourier feature space to get a separation and performed a linear classification using regularized least square method. The proposed method identified more negative tweets in the test data provided Hindi and Bengali language. In test tweet for Tamil language, positive tweets were identified more than other two polarity categories. Due to the lack of language specific features and sentiment oriented features, the tweets under neutral were less identified and also caused misclassifications in all the three polarity categories. This motivates to take forward our research in this area with the proposed method.

Story books are copiously filled with image illustration in which the illustrations are essential... more Story books are copiously filled with image illustration in which the illustrations are essential to the enjoyment and understanding of the story. Often the photos themselves turn out to be more important than content. In such cases, our principle job is to locate the best pictures to show. Stories composed for kids must be improved with pictures to manage the enthusiasm of a tyke, for words usually can't do a picture justice. This system is built as a part of shared task of Forum of Information Retrieval and Evaluation (FIRE) 2015 workshop. In this system we provide a methodology for automatically illustrating a given Children's story using the Wikipedia ImageCLEF 2010 dataset, with appropriate images for better learning and understanding.

This contemporary work is done as a slice of the shared task on Entity Extraction from Social Med... more This contemporary work is done as a slice of the shared task on Entity Extraction from Social Media Text Indian Languages in Forum for Information Retrieval and Evaluation (FIRE2015). Nowadays people are extensively using social media platforms like Face book, Twitter, etc, to exchange their thoughts. The twitter messages are growing rapidly and their style and short nature present a new challenge in language technology field. This extensive amount of textual data is also increases the interest in Information Extraction (IE) on such textual data. Named entity extraction is one of the essential tasks in Information Extraction, aims to extract and classify entities from text. The performance of the present standard language processing tools is severely affected on Tweet messages. Hence, different improvised and non-improvised algorithms are necessary for extracting these entities from the informal text. This paper deals with the extracting the Named Entities from twitter messages of four Indian Languages. The extraction of the Named entity relies mainly on the domain specific features and conventional features. A well known supervised algorithm, Support Vector Machine (SVM) is used to extracting the entities.

We present a methodology for naturally grouping the estimation of Twitter messages. Miniaturized ... more We present a methodology for naturally grouping the estimation of Twitter messages. Miniaturized scale websites are a testing new wellspring of data for information mining methods. The aim of this paper is to focus the careful feeling of the information from the microblogging site Twitter. Tweets regularly likewise contain URLs to different sites. Tweets additionally contain a certain measure of OOV (Out-Of-Vocabulary) words, for example, Hash tags, a labeling framework for points permitting Tweets in a comparative vein of discussion to be found. Other OOV words incorporate notice which is a system to direct a Tweet to one or more users. The KH coder tool gives a conventional precision result where the content is POS labeled and MySQL is utilized for putting away points of interest as a part of the database. The R tool is utilized to view the factual examination of information. Further, machine learning calculation has likewise been performed. A preprocessing and highlight choice system in blend with a Maximum Entropy, Naive Bayes and Decision Tree classifiers has been exhibited and sensible results has been delivered. Accuracy of the machine adapting methods for sentiment has been thought about and statistical representation of the classes has been depicted through KH Coder.

The progression of social media contents, similar like Twitter and Facebook messages and blog pos... more The progression of social media contents, similar like Twitter and Facebook messages and blog post, has created, many new opportunities for language technology. The user generated contents such as tweets and blogs in most of the languages are written using Roman script due to distinct social culture and technology. Some of them using own language script and mixed script. The primary challenges in process the short message is identifying languages. Therefore, the language identification is not restricted to a language but also to multiple languages. The task is to label the words with the following categories L1, L2, Named Entities, Mixed, Punctuation and Others This paper presents the AmritaCen_NLP team participation in FIRE2015-Shared Task on Mixed Script Information Retrieval Subtask 1: Query Word Labeling on language identification of each word in text, Named Entities, Mixed, Punctuation and Others which uses sequence level query labelling with Support Vector Machine.