Peter Dirix | KU Leuven (original) (raw)
Papers by Peter Dirix
Universal Dependencies Consortium, May 15, 2021
Universal Dependencies Consortium, Nov 15, 2019
Universal Dependencies Consortium, May 15, 2020
TN&A: Tydskrif vir Nederlands en Afrikaans, Feb 1, 2021
A corpus study on the alternation of complex and simplex initials in Afrikaans In Afrikaans main ... more A corpus study on the alternation of complex and simplex initials in Afrikaans In Afrikaans main clauses the lexical verb usually appears on the second sentence bracket when it is selected by an auxiliary on the first sentence bracket. However, an alternative construction with the main verb collocated with the auxiliary on the first sentence bracket is also possible. Those constructions are called 'simplex initials' and 'complex initials' respectively. We conduct a corpus study to investigate which factors determine the alternation between both constructions. The corpus data reveal that only a few combinations of the auxiliary and lexical verb appear in both constructions. We present the results of a distinctive collexeme analysis in which we investigate which verb combinations are significantly attracted to one of the constructions.
Universal Dependencies Consortium, May 15, 2019
Universal Dependencies Consortium, Nov 15, 2020
Over the past few decades, corpus linguistics has become the basis of for both lexicographers com... more Over the past few decades, corpus linguistics has become the basis of for both lexicographers compiling dictionaries and syntacticians compiling grammars. While a flat corpus without annotation is rather easy to collect and mostly sufficient for lexicographers, a syntactically annotated corpus or treebank is a must for syntax description. For smaller languages or non-standard variants, this type of resource is often lacking. Even Afrikaans with 7 million native speakers can be considered as a lowresource language in this respect. We describe our efforts in creating resources (a manually verified morphosyntactic lexicon of 250,000 entries, a PoS-tagged large corpus of 50 million words, a small treebank of about 45,000 words) and tools (a simple search tool, the treebank-based GrETEL for Afrikaans search tool, a tokenizer and a lemmatizer) for Afrikaans as well as the work we are starting on a gamification project in order to use crowdsourcing for syntactic annotation of dependency re...
The Universal Morphology (UniMorph) project is a collaborative effort providing broad-coverage in... more The Universal Morphology (UniMorph) project is a collaborative effort providing broad-coverage instantiated normalized morphological inflection tables for hundreds of diverse world languages. The project comprises two major thrusts: a languageindependent feature schema for rich morphological annotation and a type-level resource of annotated data in diverse languages realizing that schema. This paper presents the expansions and improvements made on several fronts over the last couple of years (since McCarthy et al. (2020)). Collaborative efforts by numerous linguists have added 67 new languages, including 30 endangered languages. We have implemented several improvements to the extraction pipeline to tackle some issues, e.g. missing gender and macron information. We have also amended the schema to use a hierarchical structure that is needed for morphological phenomena like multiple-argument agreement and case stacking, while adding some missing morphological features to make the schema more inclusive. In light of the last UniMorph release, we also augmented the database with morpheme segmentation for 16 languages. Lastly, this new release makes a push towards inclusion of derivational morphology in UniMorph by enriching the data and annotation schema with instances representing derivational processes from MorphyNet.
German and Dutch in Contrast, 2020
In the West-Germanic languages we expect an auxiliary of the perfect to select a past participle.... more In the West-Germanic languages we expect an auxiliary of the perfect to select a past participle. In a subset of these languages, however, some verbs select an infinitive instead, i.e. in constructions known as infinitivus pro participio (IPP). The phenomenon is well-studied with regard to Dutch and German, but for Afrikaans an extensive study based on empirical data is still lacking. In order to fill this void, the present paper uses a corpus study to identify the verbs whichobligatorily or optionally-take the IPP form in Afrikaans. Verb classes showing the IPP effect in Afrikaans, Dutch and German are compared, and crosslinguistic similarities and differences are identified. The result is a corpus-based typology of IPP verbs in the three languages in question. Zusammenfassung: In den westgermanischen Sprachen wäre eigentlich zu erwarten, dass Perfekt-Hilfsverben immer ein Partizip Perfekt selektieren. In einer Untergruppe dieser Sprachen selektieren einige Verben jedoch einen Infinitiv, den so genannten infinitivus pro participio (IPP). Während dieses Phänomen hinsichtlich des Niederländischen und Deutschen bereits eingehend erforscht worden ist, fehlt zum Afrikaans bisher eine umfangreichere, empirisch fundierte Studie. Um diesem Mangel abzuhelfen, werden in dem vorliegenden Beitrag mittels einer Korpusuntersuchung diejenigen Verben identifiziert, die im Afrikaans-obligatorisch oder optional-in der IPP-Form auftreten. Wir vergleichen die Verbklassen, die auf Afrikaans, Niederländisch und Deutsch den IPP-Effekt zeigen, und stellen Ähnlichkeiten und Unterschiede zwischen den Sprachen fest. Das Ergebnis ist eine korpusbasierte Typologie von IPP-Verben in den drei betroffenen Sprachen.
Universal Dependencies is a project that seeks to develop cross-linguistically consistent treeban... more Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective. The annotation scheme is based on (universal) Stanford dependencies (de Marneffe et al., 2006, 2008, 2014), Google universal part-of-speech tags (Petrov et al., 2012), and the Interset interlingua for morphosyntactic tagsets (Zeman, 2008).
The Universal Dependencies (UD) project aims to develop a consistent annotation framework for tre... more The Universal Dependencies (UD) project aims to develop a consistent annotation framework for treebanks across many languages. In this paper we present the UD scheme for Afrikaans and we describe the conversion of the AfriBooms treebank to this new format. We will compare the conversion to UD to the conversion of related syntactic structures in typologically similar languages.
LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty ... more LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles Universit
Proceedings of the 13th International Conference on Language Resources and Evaluation (LREC 2022), 2022
The Universal Morphology (UniMorph) project is a collaborative effort providing broad-coverage in... more The Universal Morphology (UniMorph) project is a collaborative effort providing broad-coverage instantiated normalized morphological inflection tables for hundreds of diverse world languages. The project comprises two major thrusts: a language-independent feature schema for rich morphological annotation and a type-level resource of annotated data in diverse languages realizing that schema. This paper presents the expansions and improvements made on several fronts over the last couple of years (since McCarthy et al. (2020)). Collaborative efforts by numerous linguists have added 67 new languages, including 30 endangered languages. We have implemented several improvements to the extraction pipeline to tackle some issues, e.g. missing gender and macron information. We have also amended the schema to use a hierarchical structure that is needed for morphological phenomena like multiple-argument agreement and case stacking, while adding some missing morphological features to make the schema more inclusive. In light of the last UniMorph release, we also augmented the database with morpheme segmentation for 16 languages. Lastly, this new release makes a push towards inclusion of derivational morphology in UniMorph by enriching the data and annotation schema with instances representing derivational processes from MorphyNet.
Compared to well-resourced languages such as English and Dutch, natural language processing (NLP)... more Compared to well-resourced languages such as English and Dutch, natural language processing (NLP) tools for Afrikaans are still not abundant. In the context of the AfriBooms project, KU Leuven and the North-West University collaborated to develop a first, small treebank, a dependency parser, and an easy to use online linguistic search engine for Afrikaans for use by researchers and students in the humanities and social sciences. The search tool is based on a similar development for Dutch, i.e. GrETEL, a user-friendly search engine which allows users to query a treebank by means of a natural language example instead of a formal search instruction.
In this paper we describe the METIS-II system and its evaluation on each of the language pairs: D... more In this paper we describe the METIS-II system and its evaluation on each of the language pairs: Dutch, German, Greek, and Spanish to English. The METIS-II system envisaged developing a data-driven approach in which no parallel corpus is required, and in which no full parser or extensive rule sets are needed. We describe evalution on a development test set and on a test set coming from Europarl, and compare our results with SYSTRAN. We also provide some further analysis, researching the impact of the number and source of the reference translations and analysing the results according to test text type. The results are expectably lower for the METIS system, but not at an unatainable distance from a mature system like SYSTRAN.
Lot Occasional Series, 2007
This volume contains a selection of the papers presented at the seventeenth installment of the Co... more This volume contains a selection of the papers presented at the seventeenth installment of the Computational Linguistics in the Netherlands conference, held at the Katholieke Universiteit Leuven on Friday, January 12th, 2007. It was organized by the Centre for Computational Linguistics and featured an invited talk, 52 oral presentations and 12 poster presentations, held by participants from four continents. The fifteen papers in this volume present a state of the art overview of research in different domains of the broad field of Computational Linguistics, including tagging, parsing, machine translation, eLearning, question-answering, computational semantics, text generation, and information retrieval.
Universal Dependencies Consortium, May 15, 2021
Universal Dependencies Consortium, Nov 15, 2019
Universal Dependencies Consortium, May 15, 2020
TN&A: Tydskrif vir Nederlands en Afrikaans, Feb 1, 2021
A corpus study on the alternation of complex and simplex initials in Afrikaans In Afrikaans main ... more A corpus study on the alternation of complex and simplex initials in Afrikaans In Afrikaans main clauses the lexical verb usually appears on the second sentence bracket when it is selected by an auxiliary on the first sentence bracket. However, an alternative construction with the main verb collocated with the auxiliary on the first sentence bracket is also possible. Those constructions are called 'simplex initials' and 'complex initials' respectively. We conduct a corpus study to investigate which factors determine the alternation between both constructions. The corpus data reveal that only a few combinations of the auxiliary and lexical verb appear in both constructions. We present the results of a distinctive collexeme analysis in which we investigate which verb combinations are significantly attracted to one of the constructions.
Universal Dependencies Consortium, May 15, 2019
Universal Dependencies Consortium, Nov 15, 2020
Over the past few decades, corpus linguistics has become the basis of for both lexicographers com... more Over the past few decades, corpus linguistics has become the basis of for both lexicographers compiling dictionaries and syntacticians compiling grammars. While a flat corpus without annotation is rather easy to collect and mostly sufficient for lexicographers, a syntactically annotated corpus or treebank is a must for syntax description. For smaller languages or non-standard variants, this type of resource is often lacking. Even Afrikaans with 7 million native speakers can be considered as a lowresource language in this respect. We describe our efforts in creating resources (a manually verified morphosyntactic lexicon of 250,000 entries, a PoS-tagged large corpus of 50 million words, a small treebank of about 45,000 words) and tools (a simple search tool, the treebank-based GrETEL for Afrikaans search tool, a tokenizer and a lemmatizer) for Afrikaans as well as the work we are starting on a gamification project in order to use crowdsourcing for syntactic annotation of dependency re...
The Universal Morphology (UniMorph) project is a collaborative effort providing broad-coverage in... more The Universal Morphology (UniMorph) project is a collaborative effort providing broad-coverage instantiated normalized morphological inflection tables for hundreds of diverse world languages. The project comprises two major thrusts: a languageindependent feature schema for rich morphological annotation and a type-level resource of annotated data in diverse languages realizing that schema. This paper presents the expansions and improvements made on several fronts over the last couple of years (since McCarthy et al. (2020)). Collaborative efforts by numerous linguists have added 67 new languages, including 30 endangered languages. We have implemented several improvements to the extraction pipeline to tackle some issues, e.g. missing gender and macron information. We have also amended the schema to use a hierarchical structure that is needed for morphological phenomena like multiple-argument agreement and case stacking, while adding some missing morphological features to make the schema more inclusive. In light of the last UniMorph release, we also augmented the database with morpheme segmentation for 16 languages. Lastly, this new release makes a push towards inclusion of derivational morphology in UniMorph by enriching the data and annotation schema with instances representing derivational processes from MorphyNet.
German and Dutch in Contrast, 2020
In the West-Germanic languages we expect an auxiliary of the perfect to select a past participle.... more In the West-Germanic languages we expect an auxiliary of the perfect to select a past participle. In a subset of these languages, however, some verbs select an infinitive instead, i.e. in constructions known as infinitivus pro participio (IPP). The phenomenon is well-studied with regard to Dutch and German, but for Afrikaans an extensive study based on empirical data is still lacking. In order to fill this void, the present paper uses a corpus study to identify the verbs whichobligatorily or optionally-take the IPP form in Afrikaans. Verb classes showing the IPP effect in Afrikaans, Dutch and German are compared, and crosslinguistic similarities and differences are identified. The result is a corpus-based typology of IPP verbs in the three languages in question. Zusammenfassung: In den westgermanischen Sprachen wäre eigentlich zu erwarten, dass Perfekt-Hilfsverben immer ein Partizip Perfekt selektieren. In einer Untergruppe dieser Sprachen selektieren einige Verben jedoch einen Infinitiv, den so genannten infinitivus pro participio (IPP). Während dieses Phänomen hinsichtlich des Niederländischen und Deutschen bereits eingehend erforscht worden ist, fehlt zum Afrikaans bisher eine umfangreichere, empirisch fundierte Studie. Um diesem Mangel abzuhelfen, werden in dem vorliegenden Beitrag mittels einer Korpusuntersuchung diejenigen Verben identifiziert, die im Afrikaans-obligatorisch oder optional-in der IPP-Form auftreten. Wir vergleichen die Verbklassen, die auf Afrikaans, Niederländisch und Deutsch den IPP-Effekt zeigen, und stellen Ähnlichkeiten und Unterschiede zwischen den Sprachen fest. Das Ergebnis ist eine korpusbasierte Typologie von IPP-Verben in den drei betroffenen Sprachen.
Universal Dependencies is a project that seeks to develop cross-linguistically consistent treeban... more Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective. The annotation scheme is based on (universal) Stanford dependencies (de Marneffe et al., 2006, 2008, 2014), Google universal part-of-speech tags (Petrov et al., 2012), and the Interset interlingua for morphosyntactic tagsets (Zeman, 2008).
The Universal Dependencies (UD) project aims to develop a consistent annotation framework for tre... more The Universal Dependencies (UD) project aims to develop a consistent annotation framework for treebanks across many languages. In this paper we present the UD scheme for Afrikaans and we describe the conversion of the AfriBooms treebank to this new format. We will compare the conversion to UD to the conversion of related syntactic structures in typologically similar languages.
LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty ... more LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles Universit
Proceedings of the 13th International Conference on Language Resources and Evaluation (LREC 2022), 2022
The Universal Morphology (UniMorph) project is a collaborative effort providing broad-coverage in... more The Universal Morphology (UniMorph) project is a collaborative effort providing broad-coverage instantiated normalized morphological inflection tables for hundreds of diverse world languages. The project comprises two major thrusts: a language-independent feature schema for rich morphological annotation and a type-level resource of annotated data in diverse languages realizing that schema. This paper presents the expansions and improvements made on several fronts over the last couple of years (since McCarthy et al. (2020)). Collaborative efforts by numerous linguists have added 67 new languages, including 30 endangered languages. We have implemented several improvements to the extraction pipeline to tackle some issues, e.g. missing gender and macron information. We have also amended the schema to use a hierarchical structure that is needed for morphological phenomena like multiple-argument agreement and case stacking, while adding some missing morphological features to make the schema more inclusive. In light of the last UniMorph release, we also augmented the database with morpheme segmentation for 16 languages. Lastly, this new release makes a push towards inclusion of derivational morphology in UniMorph by enriching the data and annotation schema with instances representing derivational processes from MorphyNet.
Compared to well-resourced languages such as English and Dutch, natural language processing (NLP)... more Compared to well-resourced languages such as English and Dutch, natural language processing (NLP) tools for Afrikaans are still not abundant. In the context of the AfriBooms project, KU Leuven and the North-West University collaborated to develop a first, small treebank, a dependency parser, and an easy to use online linguistic search engine for Afrikaans for use by researchers and students in the humanities and social sciences. The search tool is based on a similar development for Dutch, i.e. GrETEL, a user-friendly search engine which allows users to query a treebank by means of a natural language example instead of a formal search instruction.
In this paper we describe the METIS-II system and its evaluation on each of the language pairs: D... more In this paper we describe the METIS-II system and its evaluation on each of the language pairs: Dutch, German, Greek, and Spanish to English. The METIS-II system envisaged developing a data-driven approach in which no parallel corpus is required, and in which no full parser or extensive rule sets are needed. We describe evalution on a development test set and on a test set coming from Europarl, and compare our results with SYSTRAN. We also provide some further analysis, researching the impact of the number and source of the reference translations and analysing the results according to test text type. The results are expectably lower for the METIS system, but not at an unatainable distance from a mature system like SYSTRAN.
Lot Occasional Series, 2007
This volume contains a selection of the papers presented at the seventeenth installment of the Co... more This volume contains a selection of the papers presented at the seventeenth installment of the Computational Linguistics in the Netherlands conference, held at the Katholieke Universiteit Leuven on Friday, January 12th, 2007. It was organized by the Centre for Computational Linguistics and featured an invited talk, 52 oral presentations and 12 poster presentations, held by participants from four continents. The fifteen papers in this volume present a state of the art overview of research in different domains of the broad field of Computational Linguistics, including tagging, parsing, machine translation, eLearning, question-answering, computational semantics, text generation, and information retrieval.