Two Run Morphological Analysis for POS Tagging of Untagged Words (original) (raw)

International Journal of Computer Science Issues, 2014

In this paper, we present a morphological based automatic tagging for Telugu without requiring any machine learning algorithm or training data. We believe that inflectional and agglutinating languages, the critical information required for tagging comes more from word internal structure than from the context and we show how a well designed morphological analyzer can assign correct tags and disambiguate many cases of tag ambiguities too. We have used fine grained, hierarchical tag set, carrying not only morph-syntactic information but also some aspects of lexical and semantic information that is necessary or useful for syntactic parsing. We give details of our experiments and results obtained. We believe our approach can also be applied to other Dravidian languages.

Hindi POS Tagger Using Naive Stemming : Harnessing Morphological Information Without Extensive Linguistic Knowledge

2008

Part of Speech tagging for Indian Languages in general and Hindi in particular is not a very widely explored territory. There have been many attempts at developing a good POS tagger for Hindi, but the morphological complexity of the language makes it a hard nut to crack. Some of the best taggers available for Indian Languages employ hybrids of machine learning or stochastic methods and linguistic knowledge. Though, the results achieved using such methods are good, there practicability for other inflective Indian Languages is reduced due to their heavy dependence on linguistic knowledge. Even though taggers can achieve very good results if provided good morphological information, the cost of creating these resources renders such methods impractical.

A Novel Approach to Morphological Analysis for Tamil Language

This paper presents the morphological analysis for complex agglutinative Tamil language using machine learning approach. Morphological analysis is concerned with retrieving the structure, syntactic rules, morphological properties and the meaning of a morphologically complex word. The morphological structure of an agglutinative language is unique and capturing its complexity in a machine analyzable and generatable format is a challenging job. Generally rule based approach is used in building morphological analyzer. In rule based approach what works in the forward direction may not work in the backward direction. The Novel approach to morphological analyzer is based on sequence labeling and training by kernel methods. It captures the non-linear relationships and various morphological features of Tamil language in a better and simpler way. The efficiency of our system is compared with the existing morphological analyzers which are available in net. Regarding the accuracy our system significantly outperforms the existing morphological analyzer and achieves a very competitive accuracy of 95.65% for Tamil language. 1 Introduction Morphological analysis is the process of segmenting words into morphemes and analyzing the word formation. It is a primary step for various types of text analysis of any language. Morphological analyzers are used in search engines for retrieving the documents from the keyword (Daelemans Walter et al., 2004). Morphological analyzer increases the recall of search engines. It is also used in speech synthesizer, speech recognizer, lemmatization, noun decompounding, spell and grammar checker and machine translation. Tamil language is morphologically rich and agglutinative. Each root word is affixed with several morphemes to generate word forms. Generally, Tamil language is postpositionally inflected to the root word. Computationally each root word can take a few thousand inflected word forms, out of which only a few hundred will exist in a typical corpus. For the purpose of analysis of such inflectionally rich languages, the root and the morphemes of each word have to be identified. Generally rule based approaches are used for building morphological analyzer (Rajendran.S et al., 2001). We have implemented a novel method for the morphological analysis of the Tamil language using machine learning approach. 2 Challenges in Morphological Analyzer for Tamil The morphological structure of Tamil is quite complex since it inflect to person, gender, and number markings and also combines with auxiliaries that indicate aspect, mood, causa-tion, attitude etc in verb. Noun inflects with plural, oblique, case, postpositions and clitics suffixes. For the purpose of analysis of such inflectionally rich languages, the root and the morphemes of each word have to be identified. The structure of verbal complex is unique and capturing this complexity in a machine analyzable and generatable format is a challenging job. The formation of the verbal complex involves arrangement of the verbal units and the interpretation of their combinatory meaning. Phonology also plays its part in the formation of

A TELUGU MORPHOLOGICAL ANALYZER

A Morphological Analyzer (MA) is a program which compiles and analyses words of a natural language into their roots and their constituent morpho-syntactic elements along with their attributes. The present paper demonstrates computational implementation of a Morphological Analyzer for Telugu. The algorithm used to build this MA is theoretically justified and is practically executed for Telugu in the context of Modern Standard Written variety. The present proposal is a demonstration of the optimal organization of linguistic database and its performance in computational environment by ensuring high precision and coverage in the parsing of wordforms. The current MA engine's coverage may range between 95-97% on a variety of corpora (3 million word length corpus).

Morphological Analyzer for Classical Tamil Texts - A Rule based approach

This paper describes the works to build a Morphological Analyzer for Classical Tamil texts using Rule-based approach. The rule-based approach has successfully been used in developing many natural language processing systems. Systems that use rule-based transformations are based on a core of solid linguistic knowledge. The linguistic knowledge acquired for one natural language processing system may be reused to build knowledge required for a similar task in another system. The advantage of the rule-based approach over the corpus-based approach is clear for: 1) less-resourced languages, for which large corpora, possibly parallel or bilingual, with representative structures and entities are neither available nor easily affordable, and 2) for morphologically rich languages, which even with the availability of corpora suffer from data sparseness. Morphology is the study of internal structure of the word. Morphological analysis is a process of segmenting words into morphemes and a process of analyzing the word formation. Morphological analyzer is a tool for any type of Natural Language Processing work. It is a computer program which takes words as input and produces its grammatical structure as output. It identifies and segments the words and assigns the grammatical information. Capturing the agglutinative structure of Tamil words by an automatic system is a challenging job. This paper is going to reveal a rule-based approach for classical Tamil texts

A New Approach to Tagging in Indian Languages

Research in Computing Science

In this paper, we present a new approach to automatic tagging without requiring any machine learning algorithm or training data. We argue that the critical information required for tagging comes more from word internal structure than from the context and we show how a well designed morphological analyzer can assign correct tags and disambiguate many cases of tag ambiguities too. The crux of the approach is in the very definition of words. While others simply tokenize a given sentence based on spaces and take these tokens to be words, we argue that words need to be motivated from semantic and syntactic considerations, not orthographic conventions. We have worked on Telugu and Kannada languages and in this paper, we take the example of Telugu language and show how high quality tagging can be achieved with a fine grained, hierarchical tag set, carrying not only morpho-syntactic information but also some aspects of lexical and semantic information that is necessary or useful for syntactic parsing. In fact entire corpora can be tagged very fast and with a good degree of guarantee of quality. We give details of our experiments and results obtained. We believe our approach can also be applied to other languages.

The Hindi Named Entity Recognizer Using Hybrid Morphological Analyzer Framework

Name Entity Recognition (NER) and Morphological Analyzer has been emerged as one of the Natural Language Processing (NLP) technology which is very effective and hence can be used with various kinds of applications such as Information Retrieval (IR), Question Answering (QA), Information Extraction (IE), text clustering etc. NER is basically used to identify proper nouns present in text and to classify them as different types of named entity such as people, locations, and organizations etc. Morphology is the field of the linguistics that studies the internal structure of the words. Morphological analysis means taking a word as input and identifying their number, gender, word formation and POS tag. Our system is used to evaluate a sentence in which we try to find the named entity in the sentence and find out the morpheme for each word in a sentence.

Using morphological analyzer to statistical POS Tagging on Persian Text

Due to the growing number of textual resources available in digital form, the ability of understanding and processing them automatically has recently become critical. The first fundamental step in understanding these resources is the ability to identify the parts-of-speech of each given token or a word in the sentence in order to disambiguate them. Parts-of-speech (POS) tagging is one of the tools for understanding and processing of natural language and it is of infrastructural stages in some speech and text processing applications. Several methods have been presented for POS tagging that each one has been applied in taggers in order to achieve to a high performance and accuracy. Statistical methods have been of primary techniques and have acquired the most successful results in the field of natural language processing in recent years. This success also has been used in other areas of natural language and is very popular. One of the most important issues in POS tagging systems is identifying unknown words. In this paper, for identifying unknown words we have used morphological analyzer. Before the tagging, the words are checked morphologically and appropriate tag is assigned to the word, and thereby the overall accuracy is increased by using morphological analyzer. We have used 5-Fold cross validation technique for evaluating proposed tagger. Regarding to the obtained results of experiments, the use of text pre-processing and morphological analyzer in the proposed POS Tagger is very effective and demonstrates the performance of the POS Tagging system.

Morphological Analyzer for Marathi using NLP

Morphology is a part of linguistic that deals with study of words, i.e internal structure and partially their meanings. A morphological analyzer is a program for analyzing morphology for an input word, it detects morphemes of any text. In current technique, only provides dictionary which defines the meaning of the word, but does not give the grammatical explanation regarding that word. In propose system, we evaluate the morphological analyzer for Marathi, an inflectional language and even a parsed tree i.e a grammatical structure. We plug the morphological analyzer with statistical pos tagger and chunker to see its impact on their performance so as to confirm its usability as a foundation for NLP applications.

Morphological Analyzer for Classical Tamil Texts: A Rule-based approach for Case Marker

This paper describes the works to build a Morphological Analyzer for Classical Tamil using Rule-based approach. Morphology is the study of internal structure of the word. Morphological analysis is a process of segmenting words into morphemes and a process of analyzing the word formation. Morphological analyzer is a tool for any type of Natural Language Processing work. It is a computer program which takes words as input and produces its grammatical structure as output. It identifies and segments the words and assigns the grammatical information. Capturing the agglutinative structure of Tamil words by an automatic system is a challenging job. This paper is going to reveal a rule-based approach for case marker.

Two Run Morphological Analysis for POS Tagging of Untagged Words (original) (raw)

Related papers