Jaroslava Hlaváčová - Academia.edu (original) (raw)
Papers by Jaroslava Hlaváčová
On the example of the recent edition of the Frequency Dictionary of Czech we describe and explain... more On the example of the recent edition of the Frequency Dictionary of Czech we describe and explain some new general principles that should be followed for getting better results for practical uses of frequency dictionaries. It is mainly adopting average reduced frequency instead of absolute frequency for ordering items. The formula for calculation of the average reduced frequency is presented in the contribution together with a brief explanation, including examples clarifying the difference between the measures. Then, the Frequency Dictionary of Czech and its parts are described. 1.
Journal of Linguistics/Jazykovedný casopis
The paper presents a discussion of homonymy of Czech nouns with different or varying genders. The... more The paper presents a discussion of homonymy of Czech nouns with different or varying genders. The lemmas with this type of homonymy are treated in the new release of the MorfFlex dictionary as separate. We show that the separation of paradigms according to the gender is not only superfluous, but also clumsy, because it forces a choice when making one is not necessary. That is why we call this type of hononymy “artificial”.
Universal Dependencies is a project that seeks to develop cross-linguistically consistent treeban... more Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective. The annotation scheme is based on (universal) Stanford dependencies (de Marneffe et al., 2006, 2008, 2014), Google universal part-of-speech tags (Petrov et al., 2012), and the Interset interlingua for morphosyntactic tagsets (Zeman, 2008).
The file contains all Czech verbs included in the Retrograde Morphemic Dictionary of Czech Langua... more The file contains all Czech verbs included in the Retrograde Morphemic Dictionary of Czech Language (Slavickova Eleonora, Academia 1975). The data was obtained by scanning a portion of the dictionary that contains words ending in -ci and -ti. Among them, there were 18 non-verbs, which were removed. Using OCR, the data was converted into the plain text format and the result was checked by two independent readers. However, if a user encounters a forgotten error, please report.
Czech morphological dictionary developed originally by Jan Hajic as a spelling checker and lemmat... more Czech morphological dictionary developed originally by Jan Hajic as a spelling checker and lemmatization dictionary. Currently it contains full morphological information for each covered wordform, as well as some derivational, semantic and named entity information.
Autoři představi navrh noveho morfologickeho znackovani ceských korpusů, pracovně nazývaný NovaMo... more Autoři představi navrh noveho morfologickeho znackovani ceských korpusů, pracovně nazývaný NovaMorf. Uvedou nejprve motivaci noveho systemu znackovani: (a) snaha uvest v soulad dosavadni morfologicke systemy (hlavně pražský a brněnský) (b) zpřehledněni systemu znackovani a zajistěni konzistence. Představi nový tagset, tedy morfologicke kategorie a jejich hodnoty, z nichž některe jsou navrženy odlisně od dosavadniho pojeti. Zaměři se přitom zvlastě na podstatne nove rysy v novem znackovani. Castecně se dotknou i lemmatizace, předevsim zavedeni konceptu vicenasobneho lemmatu.
Czech morphological dictionary developed originally by Jan Hajic as a spelling checker and lemmat... more Czech morphological dictionary developed originally by Jan Hajic as a spelling checker and lemmatization dictionary. Currently it contains full morphological information for each covered wordform, as well as some derivational, semantic and named entity information.
We describe an annotation experiment combining topics from lexicography and Word Sense Disambigua... more We describe an annotation experiment combining topics from lexicography and Word Sense Disambiguation. It involves a lexicon (Pattern Dictionary of English Verbs, PDEV), an existing data set (VPS-GradeUp), and an unpublished data set (RTE in PDEV Implicatures). The aim of the experiment was twofold: a pilot annotation of Recognizing Textual Entailment (RTE) on PDEV implicatures (lexicon glosses) on the one hand, and, on the other hand, an analysis of the effect of Textual Entailment between lexicon glosses on annotators’ Word-SenseDisambiguation decisions, compared to other predictors, such as finiteness of the target verb, the explicit presence of its relevant arguments, and the semantic distance between corresponding syntactic arguments in two different patterns (dictionary senses).
Cilem přispěvku je představit projekt inovace popisu ceske morfologie pro nastroje automaticke mo... more Cilem přispěvku je představit projekt inovace popisu ceske morfologie pro nastroje automaticke morfologicke analýzy, zejmena změny v pojeti tagsetu. Vice než dvacet let je automaticka morfologicka analýza soucasti mnoha nastrojů pocitacoveho zpracovani přirozeneho jazyka (natural language processing, NLP). Jeji výsledky uživa lingvisticka veřejnost zejmena při praci s velkými jazykovými korpusy. Od roku 2012 běži grantový projekt, v jehož ramci se připravuji inovace automaticke morfologicke analýzy cestiny. Ty jsou zaměřeny předevsim na odstraněni nedostatků, s nimiž se dosavadni praxe potýka, a zhodnocuji zkusenosti, ktere bylo možne ziskat pouze na zakladě praxe.
Universal Dependencies is a project that seeks to develop cross-linguistically consistent treeban... more Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective. The annotation scheme is based on (universal) Stanford dependencies (de Marneffe et al., 2006, 2008, 2014), Google universal part-of-speech tags (Petrov et al., 2012), and the Interset interlingua for morphosyntactic tagsets (Zeman, 2008). This release contains the test data used in the CoNLL 2017 shared task on parsing Universal Dependencies. Due to the shared task the test data was held hidden and not released together with the training and development data of UD 2.0. Therefore this release complements the UD 2.0 release (http://hdl.handle.net/11234/1-1983) to a full release of UD treebanks. In addition, the present release contains 18 new parallel test sets and 4 test sets in surprise languages. The present r...
Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies
The Conference on Computational Natural Language Learning (CoNLL) features a shared task, in whic... more The Conference on Computational Natural Language Learning (CoNLL) features a shared task, in which participants train and test their learning systems on the same data sets. In 2017, one of two tasks was devoted to learning dependency parsers for a large number of languages, in a realworld setting without any gold-standard annotation on input. All test sets followed a unified annotation scheme, namely that of Universal Dependencies. In this paper, we define the task and evaluation methodology, describe data preparation, report and analyze the main results, and provide a brief categorization of the different approaches of the participating systems.
We discuss two types of asymmetry between wordforms and their (morphological) characteristics, na... more We discuss two types of asymmetry between wordforms and their (morphological) characteristics, namely (morphological) variants and homographs. We introduce a concept of multiple lemma that allows for unique identification of wordform variants as well as ‘morphologicallybased’ identification of homographic lexemes. The deeper insight into these concepts allows further refining of morphological dictionaries and subsequently better performance of any NLP tasks. We demonstrate our approach on the morphological dictionary of Czech.
Lecture Notes in Computer Science, 2011
ABSTRACT The paper deals with automatic methods for prefix extraction and their comparison. We pr... more ABSTRACT The paper deals with automatic methods for prefix extraction and their comparison. We present experiments with Czech and English and compare the results with regard to the size and type (wordforms vs. lemmas) of input data.
Journal of Linguistics/Jazykovedný casopis
A detailed morphological description of word forms in any language is a necessary condition for a... more A detailed morphological description of word forms in any language is a necessary condition for a successful automatic processing of linguistic data. The paper focuses on a new description of morphological categories, mainly on the subcategorization of parts of speech in Czech within the NovaMorf project. NovaMorf focuses on the description of morphological properties of Czech word forms in a more compact and consistent way and with a higher explicative power than approaches used so far. It also aims at the unification of diverse approaches to morphological annotation of Czech. NovaMorf approach will be reflected in a new morphological dictionary to be exploited for a new automatic morphological analysis (and disambiguation) of corpora of contemporary Czech.
Journal of Linguistics/Jazykovedný casopis
In many languages, some words can be written in several ways. We call them variants. Values of al... more In many languages, some words can be written in several ways. We call them variants. Values of all their morphological categories are identical, which leads to an identical morphological tag. Together with the identical lemma, we have two or more wordforms with the same morphological description. This ambiguity may cause problems in various NLP applications. There are two types of variants – those affecting the whole paradigm (global variants) and those affecting only wordforms sharing some combinations of morphological values (inflectional variants). In the paper, we propose means how to tag all wordforms, including their variants, unambiguously. We call this requirement “Golden rule of morphology”. The paper deals mainly with Czech, but the ideas can be applied to other languages as well.
The paper presents a corpus-based method for obtaining ranked wordlists that can characterise lex... more The paper presents a corpus-based method for obtaining ranked wordlists that can characterise lexical usage changes. The method is evaluated on two 100-million representatively balanced corpora of contemporary written Czech that cover two consecutive time periods. Despite similar overall design of the corpora, lexical frequencies have to be first normalised in order to achieve comparability. Furthermore, dispersion information is used to reduce the number of domain-specific items, as their frequencies highly depend on inclusion of particular texts into the corpus. Statistical significance measures are finally used for evaluation of frequency differences between individual items in both corpora. It is demonstrated that the method ranks the resulting wordlists appropriately and several limitations of the approach are also discussed. Influence of corpora composition cannot be completely obliterated and comparability of the corpora is shown to play a key role. Therefore, although highly...
On the example of the recent edition of the Frequency Dictionary of Czech we describe and explain... more On the example of the recent edition of the Frequency Dictionary of Czech we describe and explain some new general principles that should be followed for getting better results for practical uses of frequency dictionaries. It is mainly adopting average reduced frequency instead of absolute frequency for ordering items. The formula for calculation of the average reduced frequency is presented in the contribution together with a brief explanation, including examples clarifying the difference between the measures. Then, the Frequency Dictionary of Czech and its parts are described.
Lecture Notes in Computer Science, 2008
In the paper, we present a software tool Affisix for automatic recognition of prefixes. On the ba... more In the paper, we present a software tool Affisix for automatic recognition of prefixes. On the basis of an extensive list of words in a language, it determines the segments – candidates for prefixes. There are two methods implemented for the recognition – the entropy method and the squares method. We briefly describe the methods, propose their improvements and present
On the example of the recent edition of the Frequency Dictionary of Czech we describe and explain... more On the example of the recent edition of the Frequency Dictionary of Czech we describe and explain some new general principles that should be followed for getting better results for practical uses of frequency dictionaries. It is mainly adopting average reduced frequency instead of absolute frequency for ordering items. The formula for calculation of the average reduced frequency is presented in the contribution together with a brief explanation, including examples clarifying the difference between the measures. Then, the Frequency Dictionary of Czech and its parts are described. 1.
Journal of Linguistics/Jazykovedný casopis
The paper presents a discussion of homonymy of Czech nouns with different or varying genders. The... more The paper presents a discussion of homonymy of Czech nouns with different or varying genders. The lemmas with this type of homonymy are treated in the new release of the MorfFlex dictionary as separate. We show that the separation of paradigms according to the gender is not only superfluous, but also clumsy, because it forces a choice when making one is not necessary. That is why we call this type of hononymy “artificial”.
Universal Dependencies is a project that seeks to develop cross-linguistically consistent treeban... more Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective. The annotation scheme is based on (universal) Stanford dependencies (de Marneffe et al., 2006, 2008, 2014), Google universal part-of-speech tags (Petrov et al., 2012), and the Interset interlingua for morphosyntactic tagsets (Zeman, 2008).
The file contains all Czech verbs included in the Retrograde Morphemic Dictionary of Czech Langua... more The file contains all Czech verbs included in the Retrograde Morphemic Dictionary of Czech Language (Slavickova Eleonora, Academia 1975). The data was obtained by scanning a portion of the dictionary that contains words ending in -ci and -ti. Among them, there were 18 non-verbs, which were removed. Using OCR, the data was converted into the plain text format and the result was checked by two independent readers. However, if a user encounters a forgotten error, please report.
Czech morphological dictionary developed originally by Jan Hajic as a spelling checker and lemmat... more Czech morphological dictionary developed originally by Jan Hajic as a spelling checker and lemmatization dictionary. Currently it contains full morphological information for each covered wordform, as well as some derivational, semantic and named entity information.
Autoři představi navrh noveho morfologickeho znackovani ceských korpusů, pracovně nazývaný NovaMo... more Autoři představi navrh noveho morfologickeho znackovani ceských korpusů, pracovně nazývaný NovaMorf. Uvedou nejprve motivaci noveho systemu znackovani: (a) snaha uvest v soulad dosavadni morfologicke systemy (hlavně pražský a brněnský) (b) zpřehledněni systemu znackovani a zajistěni konzistence. Představi nový tagset, tedy morfologicke kategorie a jejich hodnoty, z nichž některe jsou navrženy odlisně od dosavadniho pojeti. Zaměři se přitom zvlastě na podstatne nove rysy v novem znackovani. Castecně se dotknou i lemmatizace, předevsim zavedeni konceptu vicenasobneho lemmatu.
Czech morphological dictionary developed originally by Jan Hajic as a spelling checker and lemmat... more Czech morphological dictionary developed originally by Jan Hajic as a spelling checker and lemmatization dictionary. Currently it contains full morphological information for each covered wordform, as well as some derivational, semantic and named entity information.
We describe an annotation experiment combining topics from lexicography and Word Sense Disambigua... more We describe an annotation experiment combining topics from lexicography and Word Sense Disambiguation. It involves a lexicon (Pattern Dictionary of English Verbs, PDEV), an existing data set (VPS-GradeUp), and an unpublished data set (RTE in PDEV Implicatures). The aim of the experiment was twofold: a pilot annotation of Recognizing Textual Entailment (RTE) on PDEV implicatures (lexicon glosses) on the one hand, and, on the other hand, an analysis of the effect of Textual Entailment between lexicon glosses on annotators’ Word-SenseDisambiguation decisions, compared to other predictors, such as finiteness of the target verb, the explicit presence of its relevant arguments, and the semantic distance between corresponding syntactic arguments in two different patterns (dictionary senses).
Cilem přispěvku je představit projekt inovace popisu ceske morfologie pro nastroje automaticke mo... more Cilem přispěvku je představit projekt inovace popisu ceske morfologie pro nastroje automaticke morfologicke analýzy, zejmena změny v pojeti tagsetu. Vice než dvacet let je automaticka morfologicka analýza soucasti mnoha nastrojů pocitacoveho zpracovani přirozeneho jazyka (natural language processing, NLP). Jeji výsledky uživa lingvisticka veřejnost zejmena při praci s velkými jazykovými korpusy. Od roku 2012 běži grantový projekt, v jehož ramci se připravuji inovace automaticke morfologicke analýzy cestiny. Ty jsou zaměřeny předevsim na odstraněni nedostatků, s nimiž se dosavadni praxe potýka, a zhodnocuji zkusenosti, ktere bylo možne ziskat pouze na zakladě praxe.
Universal Dependencies is a project that seeks to develop cross-linguistically consistent treeban... more Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective. The annotation scheme is based on (universal) Stanford dependencies (de Marneffe et al., 2006, 2008, 2014), Google universal part-of-speech tags (Petrov et al., 2012), and the Interset interlingua for morphosyntactic tagsets (Zeman, 2008). This release contains the test data used in the CoNLL 2017 shared task on parsing Universal Dependencies. Due to the shared task the test data was held hidden and not released together with the training and development data of UD 2.0. Therefore this release complements the UD 2.0 release (http://hdl.handle.net/11234/1-1983) to a full release of UD treebanks. In addition, the present release contains 18 new parallel test sets and 4 test sets in surprise languages. The present r...
Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies
The Conference on Computational Natural Language Learning (CoNLL) features a shared task, in whic... more The Conference on Computational Natural Language Learning (CoNLL) features a shared task, in which participants train and test their learning systems on the same data sets. In 2017, one of two tasks was devoted to learning dependency parsers for a large number of languages, in a realworld setting without any gold-standard annotation on input. All test sets followed a unified annotation scheme, namely that of Universal Dependencies. In this paper, we define the task and evaluation methodology, describe data preparation, report and analyze the main results, and provide a brief categorization of the different approaches of the participating systems.
We discuss two types of asymmetry between wordforms and their (morphological) characteristics, na... more We discuss two types of asymmetry between wordforms and their (morphological) characteristics, namely (morphological) variants and homographs. We introduce a concept of multiple lemma that allows for unique identification of wordform variants as well as ‘morphologicallybased’ identification of homographic lexemes. The deeper insight into these concepts allows further refining of morphological dictionaries and subsequently better performance of any NLP tasks. We demonstrate our approach on the morphological dictionary of Czech.
Lecture Notes in Computer Science, 2011
ABSTRACT The paper deals with automatic methods for prefix extraction and their comparison. We pr... more ABSTRACT The paper deals with automatic methods for prefix extraction and their comparison. We present experiments with Czech and English and compare the results with regard to the size and type (wordforms vs. lemmas) of input data.
Journal of Linguistics/Jazykovedný casopis
A detailed morphological description of word forms in any language is a necessary condition for a... more A detailed morphological description of word forms in any language is a necessary condition for a successful automatic processing of linguistic data. The paper focuses on a new description of morphological categories, mainly on the subcategorization of parts of speech in Czech within the NovaMorf project. NovaMorf focuses on the description of morphological properties of Czech word forms in a more compact and consistent way and with a higher explicative power than approaches used so far. It also aims at the unification of diverse approaches to morphological annotation of Czech. NovaMorf approach will be reflected in a new morphological dictionary to be exploited for a new automatic morphological analysis (and disambiguation) of corpora of contemporary Czech.
Journal of Linguistics/Jazykovedný casopis
In many languages, some words can be written in several ways. We call them variants. Values of al... more In many languages, some words can be written in several ways. We call them variants. Values of all their morphological categories are identical, which leads to an identical morphological tag. Together with the identical lemma, we have two or more wordforms with the same morphological description. This ambiguity may cause problems in various NLP applications. There are two types of variants – those affecting the whole paradigm (global variants) and those affecting only wordforms sharing some combinations of morphological values (inflectional variants). In the paper, we propose means how to tag all wordforms, including their variants, unambiguously. We call this requirement “Golden rule of morphology”. The paper deals mainly with Czech, but the ideas can be applied to other languages as well.
The paper presents a corpus-based method for obtaining ranked wordlists that can characterise lex... more The paper presents a corpus-based method for obtaining ranked wordlists that can characterise lexical usage changes. The method is evaluated on two 100-million representatively balanced corpora of contemporary written Czech that cover two consecutive time periods. Despite similar overall design of the corpora, lexical frequencies have to be first normalised in order to achieve comparability. Furthermore, dispersion information is used to reduce the number of domain-specific items, as their frequencies highly depend on inclusion of particular texts into the corpus. Statistical significance measures are finally used for evaluation of frequency differences between individual items in both corpora. It is demonstrated that the method ranks the resulting wordlists appropriately and several limitations of the approach are also discussed. Influence of corpora composition cannot be completely obliterated and comparability of the corpora is shown to play a key role. Therefore, although highly...
On the example of the recent edition of the Frequency Dictionary of Czech we describe and explain... more On the example of the recent edition of the Frequency Dictionary of Czech we describe and explain some new general principles that should be followed for getting better results for practical uses of frequency dictionaries. It is mainly adopting average reduced frequency instead of absolute frequency for ordering items. The formula for calculation of the average reduced frequency is presented in the contribution together with a brief explanation, including examples clarifying the difference between the measures. Then, the Frequency Dictionary of Czech and its parts are described.
Lecture Notes in Computer Science, 2008
In the paper, we present a software tool Affisix for automatic recognition of prefixes. On the ba... more In the paper, we present a software tool Affisix for automatic recognition of prefixes. On the basis of an extensive list of words in a language, it determines the segments – candidates for prefixes. There are two methods implemented for the recognition – the entropy method and the squares method. We briefly describe the methods, propose their improvements and present