Heylen Kris - Profile on Academia.edu (original) (raw)
Papers by Heylen Kris
Zur Abfolge (pro)nominaler Satzglieder im Deutschen : eine korpusbasierte Analyse der relativen Abfolge von nominalem Subjekt und pronominalem Objekt im Mittelfeld
ABSTRACT
Over the last decade or so, distributional methods have become the mainstay of semantic modelling... more Over the last decade or so, distributional methods have become the mainstay of semantic modelling in Computational Linguistics. As such, they have also been applied the automatic modelling of verb meaning. However, more than with other lexical categories, the research into verb semantics has taken its inspiration from the idea that a verb's meaning is strongly linked to its syntactic behaviour and more specifically, to its selectional preferences. Depending on how they use these selectional preferences, distributional models of verb meaning come in two flavours. The first approach has its historical origins in the linguistic research tradition into verb valency and frame semnatics and is in principle purely syntactical in nature. A verb's semantic category is said to be inferrable from its distribution over subcategorization (subcat) frames, i.e. the possible combinations of syntactic verb arguments like subject, direct object, indirect object etc. Additionally, this purely syntactic information can be extended with some high-level semantic information like the animacy of the verb arguments (see for an overview). Whereas this first, syntax-oriented approach is specifically geared towards verbs, the second approach is more generally applicable to all lexical categories and is a direct implementation of the ideas of Harris (1954). These so-called word space models use other words as context features with a specific implementation using only those context words that co-occur in a given dependency relation to the target word (see for an overview). In the first approach, one context feature is a possible combination of syntactic arguments that a verb can govern. In the second approach, one specific context feature corresponds to one lexeme plus its syntactic relation to the target verb. Whereas the first approach is mostly used to automatically induce Levin-style verb classes, the second approach is typically applied to retrieve semantic equivalents for specific verbs (but see for a comparison of the two methods on the task of inducing Levin-style classes).
The prevalence of multiword term candidates in a legal corpus
ABSTRACT Many approaches to term extraction focus on the extraction of multiword units, assuming ... more ABSTRACT Many approaches to term extraction focus on the extraction of multiword units, assuming that multiword units comprise the majority of terms in most subject fields. However, this supposed prevalence of multiword terms has gone largely untested in the literature. In this paper, we perform a quantitative corpus-based analysis of the claim that multiword units are more technical than single word units, and that multiword units are more widespread in specialized domains. As a case study, we look at Dutch terminology from the Belgian legal domain. First, the relevant units are extracted using linguistic filters and an algorithm to identify Dutch compounds and multiword units. In a second step, we calculate for all units an association measure that captures the degree to which a linguistic unit belongs to the domain. Thirdly, we analyze the relationship between the units' technicality, frequency and their status as a simplex, compound or multiword unit.
TKE 2010: Presenting …, 2010
This paper describes how a toolset developed for the purposes of variational linguistics to ident... more This paper describes how a toolset developed for the purposes of variational linguistics to identify regional variants, can be used in the field of term extraction. The notion of stable lexical marker will be introduced as a method to quantify centrality and dispersion of terms ...
TermWise: Leveraging Big Data for Terminological Support in Legal Translation
A perennial problem in German syntax is the order of verb arguments in the Mittelfeld. The Mittel... more A perennial problem in German syntax is the order of verb arguments in the Mittelfeld. The Mittelfeld is the section of the clause between the two parts of the discontinuous verbal group. In it, all verb arguments can be realized simultaneously, however, not always in the same order. There is a longstanding debate about a number of factors that possibly govern this variation, yet, what their actual influence is, remains unclear. This study takes a quantitative corpus-based approach to the problem and looks specifically at a type of variation that has scarcely been dealt with up to now, viz. the relative order of a pronominal object and a nominal subject in the Mittelfeld. Clauses that show this kind of variation have been collected from the NEGRA-corpus consisting of German newspaper material and they have been annotated for 5 factors frequently mentioned in the literature: grammatical function of the arguments, given/new status and animacy of the arguments' referents, difference in length between the arguments, and their occurrence in either a main or a subordinate clause. The effect of these factors has been statistically checked and modelled in a logistic regression model. The results of the statistical analysis show an effect of all factors except for grammatical. The effect of main versus subordinate clause is especially strong, contradicting earlier hypotheses that this factor is only an epiphenomenon of length difference
Cet article présente les résultats d'une analyse sémantique quantitative des unités lexicales... more Cet article présente les résultats d'une analyse sémantique quantitative des unités lexicales spécifiques dans un corpus technique, relevant du domaine des machines-outils pour l'usinage des métaux. L'étude vise à vérifier si et dans quelle mesure les mots-clés du corpus technique sont monosémiques. A cet effet, nous procédons à une analyse statistique de régression simple, qui permet d'étudier la corrélation entre le rang de spécificité des mots-clés et leur rang de monosémie, mais qui soulève des problèmes statistiques et méthodologiques, notamment un biais de fréquence. Pour y remédier, nous adoptons une approche alternative pour le repérage des unités lexicales spécifiques, à savoir l'analyse des marqueurs lexicaux stables ou Stable Lexical Marker Analysis (SLMA). Nous discutons les résultats quantitatifs et statistiques de cette approche dans la perspective de la corrélation entre le rang de spécificité et le rang de monosémie.
Corpus Studies in Contrastive Linguistics: Introduction
International Journal of Corpus Linguistics
Methodological issues in corpus-based cognitive linguistics
... Using multifactorial statisti-cal techniques, both Gries and Grondelaers are able to examine ... more ... Using multifactorial statisti-cal techniques, both Gries and Grondelaers are able to examine the com-bined effect of explanatory variables on syntactic variation. Gries exam-ines the relative strength of variables by pair-wise comparison. ...
The study of lexical collocations occupies a central position in corpus linguistic research. Lexi... more The study of lexical collocations occupies a central position in corpus linguistic research. Lexical restrictions on a word's combinatorial possibilities are often an integral part of corpus linguistic analyses and are applied in various domains (e.g. lexicography, language teaching). However, if a corpus is considered a sample of spontaneously realized language use by a linguistic community in (a) given setting(s), it is rather surprising that the settings of actual language use received little attention in traditional corpus linguistics. In this contribution, we will focus on the impact of the usage settings on the linguistic properties of the language use in a corpus. We will investigate whether lexical collocability is subject to extra-linguistic constraints. Based on a variational case study, viz. the inflectional variation of attributive adjectives in Dutch, it will be demonstrated that the collocational strength of the AN pair is significantly modified by register, region...
Studies in Generative Grammar, 2005
Applying word space models to sociolinguistics. Religion names before and after 9/11
Advances in Cognitive Sociolinguistics, 2010
Applying word space models to sociolinguistics. Religion names before and after 9/11. Yves Peirsm... more Applying word space models to sociolinguistics. Religion names before and after 9/11. Yves Peirsman, Kris Heylen and Dirk Geeraerts Abstract Researchers in disciplines like lexical semantics and critical discourse analysis are in need of a quantitative method that allows them to ...
Many approaches to term extraction focus on the extraction of multiword units, assuming that mult... more Many approaches to term extraction focus on the extraction of multiword units, assuming that multiword units comprise the majority of terms in most subject fields. However, this supposed prevalence of multiword terms has gone largely untested in the literature. In this paper, we perform a quantitative corpus-based analysis of the claim that multiword units are more technical than single word units, and that multiword units are more widespread in specialized domains. As a case study, we look at Dutch terminology from the Belgian legal domain. First, the relevant units are extracted using linguistic filters and an algorithm to identify Dutch compounds and multiword units. In a second step, we calculate for all units an association measure that captures the degree to which a linguistic unit belongs to the domain. Thirdly, we analyze the relationship between the units' technicality, frequency and their status as a simplex, compound or multiword unit.
Lingua, 2015
This paper demonstrates how token-level Word Space Models (a distributional semantic technique th... more This paper demonstrates how token-level Word Space Models (a distributional semantic technique that was originally developed in statistical natural language processing) can be developed into a heuristic tool to support lexicological and lexicographical analyses of large amounts of corpus data. The paper provides a non-technical introduction to the statistical methods and illustrates with a case study analysis of the Dutch polysemous noun 'monitor' how token-level Word Space Models in combination with visualisation techniques allow human analysts to identify semantic patterns in an unstructured set of attestations. Additionally, we show how the interactive features of the visualisation make it possible to explore the effect of different contextual factors on the distributional model.
Degrees of semantic control in measuring aggregated lexical distances.
As in lexical semantics in general, distributional methods have also proven a successful techniqu... more As in lexical semantics in general, distributional methods have also proven a successful technique for the automatic modeling of verb meaning. However, much more than with other lexical categories, the research into verb semantics has been based on the idea that a verb's meaning is strongly linked to its syntactic behavior and more specifically, to its selectional preferences. This has led distributional methods of verb meaning to make use of two distinct types of syntactic contexts to automatically retrieve semantically similar verbs.
Proceedings of the EACL 2012 …, 2012
In statistical NLP, Semantic Vector Spaces (SVS) are the standard technique for the automatic mod... more In statistical NLP, Semantic Vector Spaces (SVS) are the standard technique for the automatic modeling of lexical semantics. However, it is largely unclear how these black-box techniques exactly capture word meaning. To explore the way an SVS structures the individual occurrences of words, we use a non-parametric MDS solution of a token-by-token similarity matrix. The MDS solution is visualized in an interactive plot with the Google Chart Tools. As a case study, we look at the occurrences of 476 Dutch nouns grouped in 214 synsets.
Usage-based approaches in Cognitive Linguistics: A technical state of the art
Corpus Linguistics and Linguistic Theory, 2000
This paper presents a technical state of the art in usage-based linguistics as defined in the con... more This paper presents a technical state of the art in usage-based linguistics as defined in the context of Cognitive Linguistics. Starting from actual case studies rather than theoretical assumptions, methodological issues concerning the usage-based approach are ...
Zur Abfolge (pro)nominaler Satzglieder im Deutschen : eine korpusbasierte Analyse der relativen Abfolge von nominalem Subjekt und pronominalem Objekt im Mittelfeld
ABSTRACT
Over the last decade or so, distributional methods have become the mainstay of semantic modelling... more Over the last decade or so, distributional methods have become the mainstay of semantic modelling in Computational Linguistics. As such, they have also been applied the automatic modelling of verb meaning. However, more than with other lexical categories, the research into verb semantics has taken its inspiration from the idea that a verb's meaning is strongly linked to its syntactic behaviour and more specifically, to its selectional preferences. Depending on how they use these selectional preferences, distributional models of verb meaning come in two flavours. The first approach has its historical origins in the linguistic research tradition into verb valency and frame semnatics and is in principle purely syntactical in nature. A verb's semantic category is said to be inferrable from its distribution over subcategorization (subcat) frames, i.e. the possible combinations of syntactic verb arguments like subject, direct object, indirect object etc. Additionally, this purely syntactic information can be extended with some high-level semantic information like the animacy of the verb arguments (see for an overview). Whereas this first, syntax-oriented approach is specifically geared towards verbs, the second approach is more generally applicable to all lexical categories and is a direct implementation of the ideas of Harris (1954). These so-called word space models use other words as context features with a specific implementation using only those context words that co-occur in a given dependency relation to the target word (see for an overview). In the first approach, one context feature is a possible combination of syntactic arguments that a verb can govern. In the second approach, one specific context feature corresponds to one lexeme plus its syntactic relation to the target verb. Whereas the first approach is mostly used to automatically induce Levin-style verb classes, the second approach is typically applied to retrieve semantic equivalents for specific verbs (but see for a comparison of the two methods on the task of inducing Levin-style classes).
The prevalence of multiword term candidates in a legal corpus
ABSTRACT Many approaches to term extraction focus on the extraction of multiword units, assuming ... more ABSTRACT Many approaches to term extraction focus on the extraction of multiword units, assuming that multiword units comprise the majority of terms in most subject fields. However, this supposed prevalence of multiword terms has gone largely untested in the literature. In this paper, we perform a quantitative corpus-based analysis of the claim that multiword units are more technical than single word units, and that multiword units are more widespread in specialized domains. As a case study, we look at Dutch terminology from the Belgian legal domain. First, the relevant units are extracted using linguistic filters and an algorithm to identify Dutch compounds and multiword units. In a second step, we calculate for all units an association measure that captures the degree to which a linguistic unit belongs to the domain. Thirdly, we analyze the relationship between the units' technicality, frequency and their status as a simplex, compound or multiword unit.
TKE 2010: Presenting …, 2010
This paper describes how a toolset developed for the purposes of variational linguistics to ident... more This paper describes how a toolset developed for the purposes of variational linguistics to identify regional variants, can be used in the field of term extraction. The notion of stable lexical marker will be introduced as a method to quantify centrality and dispersion of terms ...
TermWise: Leveraging Big Data for Terminological Support in Legal Translation
A perennial problem in German syntax is the order of verb arguments in the Mittelfeld. The Mittel... more A perennial problem in German syntax is the order of verb arguments in the Mittelfeld. The Mittelfeld is the section of the clause between the two parts of the discontinuous verbal group. In it, all verb arguments can be realized simultaneously, however, not always in the same order. There is a longstanding debate about a number of factors that possibly govern this variation, yet, what their actual influence is, remains unclear. This study takes a quantitative corpus-based approach to the problem and looks specifically at a type of variation that has scarcely been dealt with up to now, viz. the relative order of a pronominal object and a nominal subject in the Mittelfeld. Clauses that show this kind of variation have been collected from the NEGRA-corpus consisting of German newspaper material and they have been annotated for 5 factors frequently mentioned in the literature: grammatical function of the arguments, given/new status and animacy of the arguments' referents, difference in length between the arguments, and their occurrence in either a main or a subordinate clause. The effect of these factors has been statistically checked and modelled in a logistic regression model. The results of the statistical analysis show an effect of all factors except for grammatical. The effect of main versus subordinate clause is especially strong, contradicting earlier hypotheses that this factor is only an epiphenomenon of length difference
Cet article présente les résultats d'une analyse sémantique quantitative des unités lexicales... more Cet article présente les résultats d'une analyse sémantique quantitative des unités lexicales spécifiques dans un corpus technique, relevant du domaine des machines-outils pour l'usinage des métaux. L'étude vise à vérifier si et dans quelle mesure les mots-clés du corpus technique sont monosémiques. A cet effet, nous procédons à une analyse statistique de régression simple, qui permet d'étudier la corrélation entre le rang de spécificité des mots-clés et leur rang de monosémie, mais qui soulève des problèmes statistiques et méthodologiques, notamment un biais de fréquence. Pour y remédier, nous adoptons une approche alternative pour le repérage des unités lexicales spécifiques, à savoir l'analyse des marqueurs lexicaux stables ou Stable Lexical Marker Analysis (SLMA). Nous discutons les résultats quantitatifs et statistiques de cette approche dans la perspective de la corrélation entre le rang de spécificité et le rang de monosémie.
Corpus Studies in Contrastive Linguistics: Introduction
International Journal of Corpus Linguistics
Methodological issues in corpus-based cognitive linguistics
... Using multifactorial statisti-cal techniques, both Gries and Grondelaers are able to examine ... more ... Using multifactorial statisti-cal techniques, both Gries and Grondelaers are able to examine the com-bined effect of explanatory variables on syntactic variation. Gries exam-ines the relative strength of variables by pair-wise comparison. ...
The study of lexical collocations occupies a central position in corpus linguistic research. Lexi... more The study of lexical collocations occupies a central position in corpus linguistic research. Lexical restrictions on a word's combinatorial possibilities are often an integral part of corpus linguistic analyses and are applied in various domains (e.g. lexicography, language teaching). However, if a corpus is considered a sample of spontaneously realized language use by a linguistic community in (a) given setting(s), it is rather surprising that the settings of actual language use received little attention in traditional corpus linguistics. In this contribution, we will focus on the impact of the usage settings on the linguistic properties of the language use in a corpus. We will investigate whether lexical collocability is subject to extra-linguistic constraints. Based on a variational case study, viz. the inflectional variation of attributive adjectives in Dutch, it will be demonstrated that the collocational strength of the AN pair is significantly modified by register, region...
Studies in Generative Grammar, 2005
Applying word space models to sociolinguistics. Religion names before and after 9/11
Advances in Cognitive Sociolinguistics, 2010
Applying word space models to sociolinguistics. Religion names before and after 9/11. Yves Peirsm... more Applying word space models to sociolinguistics. Religion names before and after 9/11. Yves Peirsman, Kris Heylen and Dirk Geeraerts Abstract Researchers in disciplines like lexical semantics and critical discourse analysis are in need of a quantitative method that allows them to ...
Many approaches to term extraction focus on the extraction of multiword units, assuming that mult... more Many approaches to term extraction focus on the extraction of multiword units, assuming that multiword units comprise the majority of terms in most subject fields. However, this supposed prevalence of multiword terms has gone largely untested in the literature. In this paper, we perform a quantitative corpus-based analysis of the claim that multiword units are more technical than single word units, and that multiword units are more widespread in specialized domains. As a case study, we look at Dutch terminology from the Belgian legal domain. First, the relevant units are extracted using linguistic filters and an algorithm to identify Dutch compounds and multiword units. In a second step, we calculate for all units an association measure that captures the degree to which a linguistic unit belongs to the domain. Thirdly, we analyze the relationship between the units' technicality, frequency and their status as a simplex, compound or multiword unit.
Lingua, 2015
This paper demonstrates how token-level Word Space Models (a distributional semantic technique th... more This paper demonstrates how token-level Word Space Models (a distributional semantic technique that was originally developed in statistical natural language processing) can be developed into a heuristic tool to support lexicological and lexicographical analyses of large amounts of corpus data. The paper provides a non-technical introduction to the statistical methods and illustrates with a case study analysis of the Dutch polysemous noun 'monitor' how token-level Word Space Models in combination with visualisation techniques allow human analysts to identify semantic patterns in an unstructured set of attestations. Additionally, we show how the interactive features of the visualisation make it possible to explore the effect of different contextual factors on the distributional model.
Degrees of semantic control in measuring aggregated lexical distances.
As in lexical semantics in general, distributional methods have also proven a successful techniqu... more As in lexical semantics in general, distributional methods have also proven a successful technique for the automatic modeling of verb meaning. However, much more than with other lexical categories, the research into verb semantics has been based on the idea that a verb's meaning is strongly linked to its syntactic behavior and more specifically, to its selectional preferences. This has led distributional methods of verb meaning to make use of two distinct types of syntactic contexts to automatically retrieve semantically similar verbs.
Proceedings of the EACL 2012 …, 2012
In statistical NLP, Semantic Vector Spaces (SVS) are the standard technique for the automatic mod... more In statistical NLP, Semantic Vector Spaces (SVS) are the standard technique for the automatic modeling of lexical semantics. However, it is largely unclear how these black-box techniques exactly capture word meaning. To explore the way an SVS structures the individual occurrences of words, we use a non-parametric MDS solution of a token-by-token similarity matrix. The MDS solution is visualized in an interactive plot with the Google Chart Tools. As a case study, we look at the occurrences of 476 Dutch nouns grouped in 214 synsets.
Usage-based approaches in Cognitive Linguistics: A technical state of the art
Corpus Linguistics and Linguistic Theory, 2000
This paper presents a technical state of the art in usage-based linguistics as defined in the con... more This paper presents a technical state of the art in usage-based linguistics as defined in the context of Cognitive Linguistics. Starting from actual case studies rather than theoretical assumptions, methodological issues concerning the usage-based approach are ...