Word Segmentation, Unknown-word Resolution, and Morphological Agreement in a Hebrew Parsing System (original) (raw)
Related papers
Using the Penn Treebank to evaluate non-treebank parsers
2004
Abstract This paper describes a method for conducting evaluations of Treebank and non-Treebank parsers alike against the English language U. Penn Treebank (Marcus et al., 1993) using a metric that focuses on the accuracy of relatively non-controversial aspects of parse structure. Our conjecture is that if we focus on maximal projections of heads (MPH), we are likely to find much broader agreement than if we try to evaluate based on order of attachment.
2016
There are many Treebanks, texts with the parse tree, available for the researcher in the field of Natural Language Processing (NLP). All these Treebanks are limited in size, and each one used private Context Free Grammar (CFG) production rules (private formalism) because its construction is time consuming and need to experts in the field of linguistics. These Treebanks, as we know, can be used for statistical parsing and machine translation tests and other fields in NLP applications. We propose, in this paper, to build large Treebank from multiple Treebanks for the same language. Also, we propose to use an annotated corpus as a lexical resource. Three English Treebanks are taken for our study which arePenn Treebank (PTB), GENIA Treebank (GTB) and British National Corpus (BNC). Brown corpus is used as a lexical resource which contains approximately one million tokens annotated with part of speech tags for each. Our work start by the unification of POS tagsets of the three Treebank th...
An alternative to head-driven approaches for parsing a (relatively) free word-order language
Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing Volume 2 - EMNLP '09, 2009
Applying statistical parsers developed for English to languages with freer wordorder has turned out to be harder than expected. This paper investigates the adequacy of different statistical parsing models for dealing with a (relatively) free word-order language. We show that the recently proposed Relational-Realizational (RR) model consistently outperforms state-of-the-art Head-Driven (HD) models on the Hebrew Treebank. Our analysis reveals a weakness of HD models: their intrinsic focus on configurational information. We conclude that the form-function separation ingrained in RR models makes them better suited for parsing nonconfigurational phenomena.
Hebrew dependency parsing: Initial results
2009
We describe a newly available Hebrew Dependency Treebank, which is extracted from the Hebrew (constituency) Treebank. We establish some baseline unlabeled dependency parsing performance on Hebrew, based on two state-of-the-art parsers, MST-parser and MaltParser. The evaluation is performed both in an artificial setting, in which the data is assumed to be properly morphologically segmented and POS-tagged, and in a real-world setting, in which the parsing is performed on automatically segmented and POS-tagged text. We present an evaluation measure that takes into account the possibility of incompatible token segmentation between the gold standard and the parsed data. Results indicate that (a) MST-parser performs better on Hebrew data than Malt-Parser, and (b) both parsers do not make good use of morphological information when parsing Hebrew.
Accurate Unlexicalized Parsing for Modern Hebrew
Lecture Notes in Computer Science, 2007
Many state-of-the-art statistical parsers for English can be viewed as Probabilistic Context-Free Grammars (PCFGs) acquired from treebanks consisting of phrase-structure trees enriched with a variety of contextual, derivational (e.g., markovization) and lexical information. In this paper we empirically investigate the applicability and adequacy of the unlexicalized variety of such parsing models to Modern Hebrew, a Semitic language that differs in structure and characteristics from English. We show that contrary to experience with parsing the WSJ, the markovized, head-driven unlexicalized variety does not necessarily outperform plain PCFGs for Semitic languages. We demonstrate that enriching unlexicalized PCFGs with morphologically marked agreement features percolated up the parse tree (e.g., definiteness) outperforms plain PCFGs as well as a simple head-driven variation on the MH treebank. We further show that an (unlexicalized) head-driven variety enriched with the same features achieves even better performance. We conclude that morphologically rich languages introduce an additional dimension of parametrization that is orthogonal to the horizontal/vertical dimensions proposed before [11] and its contribution is essential and complementary.