Naman Jain | IIIT Hyderabad (original) (raw)
Uploads
Publications by Naman Jain
In this paper, we present our work towards the effective expansion of treebanks by minimizing the... more In this paper, we present our work towards the effective expansion of treebanks by minimizing the human efforts required during annotation. We are combining the benefits of both automatic and human annotation for manual post-processing. Our approach includes identifying probable incorrect edges and then suggesting k-best alternates for the same in a typed-dependency framework. Minimizing the human efforts calls for automatic identification of ambiguous cases. We have employed an entropy based confusion measure to capture uncertainty exerted by the parser oracle and later flag the highly uncertain predictions. To further assist human decisions, k-best alternatives are supplied in the order of their likelihood. Our experiments, conducted for
Hindi, establish the effectiveness of the proposed approach. We exercised label accuracy as a metric to show the effectiveness by increasing it with economically viable manual intervention. This work leads to new directions in the expansion of treebanks by accelerating the annotation process.
In this paper, we present our efforts towards identifying probable incorrect edges and then sugge... more In this paper, we present our efforts towards identifying probable incorrect edges and then suggesting k-best alternates for the same in a typed-dependency framework. Such a setup is beneficial in human aided NLP systems where
the decisions are largely automated with minimal human intervention. Minimizing the human intervention calls for automatic identification of ambiguous cases. We have employed an entropy based confusion measure to capture uncertainty exerted by the parser oracle and later flag the highly uncertain predictions. To further assist human decisions, k-best alternatives are supplied in the order of their likelihood. Our experiments, conducted for Hindi, establish the effectiveness of the proposed approach towards
increasing the label accuracy with economically viable manual intervention. This work leads to new directions for parser development and also in the human-aided NLP systems.
Proceedings of the Workshop on Machine Translation and Parsing in Indian Languages (MTPIL-2012), pages 163–170, COLING 2012, Mumbai, December 2012., Dec 2012
In this paper, we present our approach towards dependency parsing of Hindi language as a part of ... more In this paper, we present our approach towards dependency parsing of Hindi language as a part of Hindi Shared Task on Parsing, COLING 2012. Our approach includes the effect of using different settings available in Malt Parser following the two-step parsing strategy i.e. splitting the data into interChunks and intraChunks to obtain the best possible LAS, UAS and LA accuracy. Our system achieved best LAS of 90.99% for Gold Standard track and second best LAS of 83.91% for automated data.
Papers by Naman Jain
In Computational Linguistics, Hindi and Urdu are not viewed as a monolithic entity and have recei... more In Computational Linguistics, Hindi and Urdu are not viewed as a monolithic entity and have received separate attention with respect to their text processing. From part-of-speech tagging to machine translation, models are separately trained for both Hindi and Urdu despite the fact that they represent the same language. The reasons mainly are their divergent literary vocabularies and separate orthographies, and probably also their political status and the social perception that they are two separate languages. In this article, we propose a simple but efficient approach to bridge the lexical and orthographic differences between Hindi and Urdu texts. With respect to text processing, addressing the differences between the Hindi and Urdu texts would be beneficial in the following ways: (a) instead of training separate models, their individual resources can be augmented to train single, unified models for better generalization, and (b) their individual text processing applications can be ...
Proceedings of the First Workshop on Computational Approaches to Code Switching, 2014
Training
In this paper, we present our approach towards dependency parsing of Hindi language as a part of ... more In this paper, we present our approach towards dependency parsing of Hindi language as a part of Hindi Shared Task on Parsing, COLING 2012. Our approach includes the effect of using different settings available in Malt Parser following the two-step parsing strategy ie splitting the data into interChunks and intraChunks to obtain the best possible LAS1, UAS2 and LA3 accuracy. Our system achieved best LAS of 90.99% for Gold Standard track and second best LAS of 83.91% for Automated data. KEYWORDS: Hindi ...
Proceedings of the EMNLP'2014 Workshop on Language Technology for Closely Related Languages and Language Variants, 2014
In Computational Linguistics, Hindi and Urdu are not viewed as a monolithic entity and have recei... more In Computational Linguistics, Hindi and Urdu are not viewed as a monolithic entity and have received separate attention with respect to their text processing. From part-of-speech tagging to machine translation, models are separately trained for both Hindi and Urdu despite the fact that they represent the same language. The reasons mainly are their divergent literary vocabularies and separate orthographies, and probably also their political status and the social perception that they are two separate languages. In this paper, we propose a simple but efficient approach to bridge the lexical and orthographic differences between Hindi and Urdu texts. With respect to text processing, addressing the differences between their texts would be beneficial in the following ways: (a) instead of training separate models, their individual resources can be augmented to train single, unified models for better generalization, and (b) their individual text processing applications can be used interchangeably under varied resource conditions. To remove the script barrier, we learn accurate statistical transliteration models which use sentence-level decoding to resolve word ambiguity. Similarly, we learn cross-register word embeddings from the harmonized Hindi and Urdu corpora to nullify their lexical divergences. As a proof of the concept, we evaluate our approach on the Hindi and Urdu dependency parsing under two scenarios: (a) resource sharing, and (b) resource augmentation. We demonstrate that a neural network-based dependency parser trained on augmented, harmonized Hindi and Urdu resources performs significantly better than the parsing models trained separately on the individual resources. We also show that we can achieve near state-of-the-art results when the parsers are used interchangeably.
Lecture Notes in Computer Science, 2015
In this paper, we present our work towards the effective expansion of treebanks by minimizing the... more In this paper, we present our work towards the effective expansion of treebanks by minimizing the human efforts required during annotation. We are combining the benefits of both automatic and human annotation for manual post-processing. Our approach includes identifying probable incorrect edges and then suggesting k-best alternates for the same in a typed-dependency framework. Minimizing the human efforts calls for automatic identification of ambiguous cases. We have employed an entropy based confusion measure to capture uncertainty exerted by the parser oracle and later flag the highly uncertain predictions. To further assist human decisions, k-best alternatives are supplied in the order of their likelihood. Our experiments, conducted for
Hindi, establish the effectiveness of the proposed approach. We exercised label accuracy as a metric to show the effectiveness by increasing it with economically viable manual intervention. This work leads to new directions in the expansion of treebanks by accelerating the annotation process.
In this paper, we present our efforts towards identifying probable incorrect edges and then sugge... more In this paper, we present our efforts towards identifying probable incorrect edges and then suggesting k-best alternates for the same in a typed-dependency framework. Such a setup is beneficial in human aided NLP systems where
the decisions are largely automated with minimal human intervention. Minimizing the human intervention calls for automatic identification of ambiguous cases. We have employed an entropy based confusion measure to capture uncertainty exerted by the parser oracle and later flag the highly uncertain predictions. To further assist human decisions, k-best alternatives are supplied in the order of their likelihood. Our experiments, conducted for Hindi, establish the effectiveness of the proposed approach towards
increasing the label accuracy with economically viable manual intervention. This work leads to new directions for parser development and also in the human-aided NLP systems.
Proceedings of the Workshop on Machine Translation and Parsing in Indian Languages (MTPIL-2012), pages 163–170, COLING 2012, Mumbai, December 2012., Dec 2012
In this paper, we present our approach towards dependency parsing of Hindi language as a part of ... more In this paper, we present our approach towards dependency parsing of Hindi language as a part of Hindi Shared Task on Parsing, COLING 2012. Our approach includes the effect of using different settings available in Malt Parser following the two-step parsing strategy i.e. splitting the data into interChunks and intraChunks to obtain the best possible LAS, UAS and LA accuracy. Our system achieved best LAS of 90.99% for Gold Standard track and second best LAS of 83.91% for automated data.
In Computational Linguistics, Hindi and Urdu are not viewed as a monolithic entity and have recei... more In Computational Linguistics, Hindi and Urdu are not viewed as a monolithic entity and have received separate attention with respect to their text processing. From part-of-speech tagging to machine translation, models are separately trained for both Hindi and Urdu despite the fact that they represent the same language. The reasons mainly are their divergent literary vocabularies and separate orthographies, and probably also their political status and the social perception that they are two separate languages. In this article, we propose a simple but efficient approach to bridge the lexical and orthographic differences between Hindi and Urdu texts. With respect to text processing, addressing the differences between the Hindi and Urdu texts would be beneficial in the following ways: (a) instead of training separate models, their individual resources can be augmented to train single, unified models for better generalization, and (b) their individual text processing applications can be ...
Proceedings of the First Workshop on Computational Approaches to Code Switching, 2014
Training
In this paper, we present our approach towards dependency parsing of Hindi language as a part of ... more In this paper, we present our approach towards dependency parsing of Hindi language as a part of Hindi Shared Task on Parsing, COLING 2012. Our approach includes the effect of using different settings available in Malt Parser following the two-step parsing strategy ie splitting the data into interChunks and intraChunks to obtain the best possible LAS1, UAS2 and LA3 accuracy. Our system achieved best LAS of 90.99% for Gold Standard track and second best LAS of 83.91% for Automated data. KEYWORDS: Hindi ...
Proceedings of the EMNLP'2014 Workshop on Language Technology for Closely Related Languages and Language Variants, 2014
In Computational Linguistics, Hindi and Urdu are not viewed as a monolithic entity and have recei... more In Computational Linguistics, Hindi and Urdu are not viewed as a monolithic entity and have received separate attention with respect to their text processing. From part-of-speech tagging to machine translation, models are separately trained for both Hindi and Urdu despite the fact that they represent the same language. The reasons mainly are their divergent literary vocabularies and separate orthographies, and probably also their political status and the social perception that they are two separate languages. In this paper, we propose a simple but efficient approach to bridge the lexical and orthographic differences between Hindi and Urdu texts. With respect to text processing, addressing the differences between their texts would be beneficial in the following ways: (a) instead of training separate models, their individual resources can be augmented to train single, unified models for better generalization, and (b) their individual text processing applications can be used interchangeably under varied resource conditions. To remove the script barrier, we learn accurate statistical transliteration models which use sentence-level decoding to resolve word ambiguity. Similarly, we learn cross-register word embeddings from the harmonized Hindi and Urdu corpora to nullify their lexical divergences. As a proof of the concept, we evaluate our approach on the Hindi and Urdu dependency parsing under two scenarios: (a) resource sharing, and (b) resource augmentation. We demonstrate that a neural network-based dependency parser trained on augmented, harmonized Hindi and Urdu resources performs significantly better than the parsing models trained separately on the individual resources. We also show that we can achieve near state-of-the-art results when the parsers are used interchangeably.
Lecture Notes in Computer Science, 2015