Dependency Parsing in Indian Laguages Research Papers (original) (raw)
Syntactic parsing in NLP is the task of working out the grammatical structure of sentences. Some of the purely formal approaches to parsing such as phrase structure grammar, dependency grammar have been successfully employed for a variety... more
Syntactic parsing in NLP is the task of working out the grammatical structure of sentences. Some of the purely formal approaches to parsing such as phrase structure grammar, dependency grammar have been successfully employed for a variety of languages. While phrase structure based constituent analysis is possible for fixed order languages such as English, dependency analysis between the grammatical units have been suitable for many free word order languages such as Indian languages. All these parsing approaches rely on identifying the linguistic units based on their formal syntactic properties and establishing the relationships between such units in the form of a tree. Dravidian languages which are spoken in Southern India are morphologically-rich, agglutinative languages whose characterization on purely structural terms such as adjectives, adverbs, conjunctions, postpositions as well as traditional interpretations of tense and finiteness pose problems in their syntactic analysis which are well-discussed in literature. We propose that the morpho-syntactic structures of Dravidian languages are better analysed from the theoretical perspectives of “Cognitive Grammar” or “Construction Grammar” where every grammatical structure is treated as a symbol that directly maps to meaningful conceptualizations. In other words, natural language is not treated as a formal system but as a functional system that is entirely symbolic or semiotic right from lexicon to grammar. Through linguistic evidences we point out that morpho-syntactic structures in Dravidian languages have their basis in meaningful discourse conceptualizations. Subsequently we hierarchically arrange all these conceptualizations into construction schemas that exhibit multiple-inheritance relationships and we explain all concrete morpho-syntactic structures as instances of these schemas. Based on this fresh theoretical grounding, we model parsing as automatic identification of meaningful dependency relations between such meaningful construction units. We formulated an annotation scheme for labelling the construction units and dependency relations that can exist between these units. Our approach to full parser annotation shows an average MALT LAS of 82.21% on Tamil gold annotated corpus of 935 sentences in a five-fold validation experiment. We conducted experiments by varying training data size,
annotation scheme, length of a sentence in terms of number of chunks, granularity of tags and report the parser results of these scenarios. Finally, we build a pipeline with splitter, construction labeller, grouper as intermediate layers before MALT parser input and release the working full parser module.
This paper presents our work to apply non linear neural network for parsing five r esource p oor I ndian L anguages belonging to two major language families-Indo-Aryan and Dravidian. Bengali and Marathi are Indo-Aryan languages whereas... more
This paper presents our work to apply non linear neural network for parsing five r esource p oor I ndian L anguages belonging to two major language families-Indo-Aryan and Dravidian. Bengali and Marathi are Indo-Aryan languages whereas Kannada, Telugu and Malayalam belong to the Dravidian family. While little work has been done previously on Bengali and Telugu linear transition-based parsing, we present one of the first parsers for Marathi, Kannada and Malayalam. All the Indian languages are free word order and range from being moderate to very rich in morphology. Therefore in this work we propose the usage of linguistically motivated morphological features (suffix and postposition) in the non linear framework, to capture the intricacies of both the language families. We also capture chunk and gender, number, person information elegantly in this model. We put forward ways to represent these features cost effectively using monolingual distributed em-beddings. Instead of relying on expensive morphological analyzers to extract the information , these embeddings are used effectively to increase parsing accuracies for resource poor languages. Our experiments provide a comparison between the two language families on the importance of varying morphological features. Part of speech taggers and chunkers for all languages are also built in the process.
In this paper, we introduce experiment results of a Vietnamese sentence parser which is built by using the Chomsky’s subcategorization theory and PDCG (Probabilistic Definite Clause Grammar). The efficiency of this subcategorized PDCG... more
In this paper, we introduce experiment results of a Vietnamese sentence parser which is built by using the Chomsky’s subcategorization theory and PDCG (Probabilistic Definite Clause Grammar). The efficiency
of this subcategorized PDCG parser has been proved by experiments, in which, we have built by hand a
Treebank with 1000 syntactic structures of Vietnamese training sentences, and used different testing datasets to evaluate the results. As a result, the precisions, recalls and F-measures of these experiments are
over 98%.
In this research, we would like to build an initial model for semantic parsing of simple Vietnamese sentences. With a semantic parsing model like that, we can analyse simple Vietnamese sentences to determine their semantic structures that... more
In this research, we would like to build an initial model for semantic parsing of simple Vietnamese sentences. With a semantic parsing model like that, we can analyse simple Vietnamese sentences to determine their semantic structures that are represented in a form that was defined by our point of view. So, we try to solve two tasks: first, building an our taxonomy of Vietnamese nouns, then we use it to define the feature structures of nouns and verbs; second, to build a Unification-Based Vietnamese Grammar we
define the syntactic and semantic unification rules for the Vietnamese phrases, clauses and sentences based on the Unification-Based Grammar. This Vietnamese grammar has been used to build a semantic parser for single Vietnamese sentences. This semantic parser has been experienced and the experiment results get precision and recall all over 84%.
In this paper, we present our work towards the effective expansion of treebanks by minimizing the human efforts required during annotation. We are combining the benefits of both automatic and human annotation for manual post-processing.... more
In this paper, we present our work towards the effective expansion of treebanks by minimizing the human efforts required during annotation. We are combining the benefits of both automatic and human annotation for manual post-processing. Our approach includes identifying probable incorrect edges and then suggesting k-best alternates for the same in a typed-dependency framework. Minimizing the human efforts calls for automatic identification of ambiguous cases. We have employed an entropy based confusion measure to capture uncertainty exerted by the parser oracle and later flag the highly uncertain predictions. To further assist human decisions, k-best alternatives are supplied in the order of their likelihood. Our experiments, conducted for
Hindi, establish the effectiveness of the proposed approach. We exercised label accuracy as a metric to show the effectiveness by increasing it with economically viable manual intervention. This work leads to new directions in the expansion of treebanks by accelerating the annotation process.
- by Naman Jain
- •
- Computer Science, Parsing, Hindi, Wordnet
This paper presents our work to apply non linear neural network for parsing five r esource p oor I ndian L anguages belonging to two major language families Indo-Aryan and Dravidian. Bengali and Marathi are Indo-Aryan languages whereas... more
This paper presents our work to apply non linear neural network for parsing five r esource p oor I ndian L anguages belonging to two major language families Indo-Aryan and Dravidian. Bengali and Marathi are Indo-Aryan languages whereas Kannada, Telugu and Malayalam belong to the Dravidian family. While little work has been done previously on Bengali and Telugu linear transition-based parsing, we present one of the first parsers for Marathi, Kannada and Malayalam. All the Indian languages are free word order and range from being moderate to very rich in morphology. Therefore in this work we propose the usage of linguistically motivated morphological features (suffix and postposition) in the non linear framework, to capture the intricacies of both the language families. We also capture chunk and gender, number, person information elegantly in this model. We put forward ways to represent these features cost effectively using monolingual distributed embeddings. Instead of relying on exp...
In this paper, we present our approach towards dependency parsing of Hindi language as a part of Hindi Shared Task on Parsing, COLING 2012. Our approach includes the effect of using different settings available in Malt Parser following... more
In this paper, we present our approach towards dependency parsing of Hindi language as a part of Hindi Shared Task on Parsing, COLING 2012. Our approach includes the effect of using different settings available in Malt Parser following the two-step parsing strategy ie splitting the data into interChunks and intraChunks to obtain the best possible LAS1, UAS2 and LA3 accuracy. Our system achieved best LAS of 90.99% for Gold Standard track and second best LAS of 83.91% for Automated data. KEYWORDS: Hindi ...
In this paper, we present our approach towards dependency parsing of Hindi language as a part of Hindi Shared Task on Parsing, COLING 2012. Our approach includes the effect of using different settings available in Malt Parser following... more
In this paper, we present our approach towards dependency parsing of Hindi language as a part of Hindi Shared Task on Parsing, COLING 2012. Our approach includes the effect of using different settings available in Malt Parser following the two-step parsing strategy ie splitting the data into interChunks and intraChunks to obtain the best possible LAS1, UAS2 and LA3 accuracy. Our system achieved best LAS of 90.99% for Gold Standard track and second best LAS of 83.91% for Automated data. KEYWORDS: Hindi ...
In this paper, we present our approach towards dependency parsing of Hindi language as a part of Hindi Shared Task on Parsing, COLING 2012. Our approach includes the effect of using different settings available in Malt Parser following... more
In this paper, we present our approach towards dependency parsing of Hindi language as a part of Hindi Shared Task on Parsing, COLING 2012. Our approach includes the effect of using different settings available in Malt Parser following the two-step parsing strategy i.e. splitting the data into interChunks and intraChunks to obtain the best possible LAS, UAS and LA accuracy. Our system achieved best LAS of 90.99% for Gold Standard track and second best LAS of 83.91% for automated data.
In this paper, we present our approach towards dependency parsing of Hindi language as a part of Hindi Shared Task on Parsing, COLING 2012. Our approach includes the effect of using different settings available in Malt Parser following... more
In this paper, we present our approach towards dependency parsing of Hindi language as a part of Hindi Shared Task on Parsing, COLING 2012. Our approach includes the effect of using different settings available in Malt Parser following the two-step parsing strategy i.e. splitting the data into interChunks and intraChunks to obtain the best possible LAS1, UAS2 and LA3 accuracy. Our system achieved best LAS of 90.99% for Gold Standard track and second best LAS of 83.91% for Automated data.