Benoît Sagot | Institut National de Recherche en Informatique et Automatique (INRIA) (original) (raw)

Drafts by Benoît Sagot

Research paper thumbnail of What is Old is New again: PIE Secondary Roots with Fossilised Preverbs

Conference given in the University of Leiden, May 29th 2019

Prefixal productivity is attested in all Indo-European languages and is reconstructed in the prot... more Prefixal productivity is attested in all Indo-European languages and is reconstructed in the proto-languages of all major Indo-European languages families. It is particularly important for under-standing the origin of many non-primary verbal roots in all these languages. Surprisingly, only a handful of etymons involving prefixation have been reconstructed at the PIE level. A systematic study of this word formation process in PIE remains to be carried out. It could result in a better understanding of the origin of a number of PIE roots, especially complex roots with limited attestation, and help explain attested words in daughter languages that need a convincing etymology.
In our talk, we will show how a better understanding of the role of prefixes in (secondary) verbal root formation can result in new etymological insights. Such analyses have already been proposed for several examples. For instance, with the compensatory lenghtening *Ce=HC- > *CV̄C-, Weiss (1993) analyses Lat. pālārī ‘to wander’ as reflecting *pe=h2lh2-ó- > *pālH-āye/o-. Another classical example of the same prefix is Arm. p‘law < *p‘ulaw < *pōlH-to based on *pe=h3lh1-, as also P.- Germ. *fall-an- ‘to fall’ < *pŏlle/o- (with Osthoff’s shorthening) < *pōlle/o- < *pōlH-é/ó- (Praust 2005; Neri 2007; Kroonen 2013: 125–6; Dunkel 2014 II: 82). Other examples include PIE *pro=h1ed- ‘to devour’ > P.-Germ. *fr(a)-et-an-, (Scheungraber 2016: §4), PIE *kom=pro=h1ṇḱ- ‘to bring’ > P.-Germ. *breng-an- (Kroonen 2013: 77), PIE *°kom=h1ep- ‘to give’ > P.-Germ. *geb-an- (Kortlandt 1992) and *pe=h2ṛk- > Lat. parcō (Weiss 1993; cf. Hitt. pē=ḫark- ‘to hold off’).
We intend to show that this word formation process is by far more widespread than usually thought. We shall discuss several examples, thus unifying sets of semantically similar roots and proposing novel etymologies for several difficult words. These views may shed some light on the possibility that PIE may have been (once) a satellite-framed language.

Research paper thumbnail of A new PIE root *h1er '(to be) dark red, dusk red': drawing the line between inherited and borrowed words for 'red(ish)', 'pea', 'ore', 'dusk' and 'love' in daughter languages (conference handout)

Papers by Benoît Sagot

Research paper thumbnail of Metathesis of Proto-Indo-European Sonorants

Münchener Studien zur Sprachwissenschaft 73/1, 2019, 29-53., 2019

Proto-Indo-European roots as reconstructed in the literature regularly result in formally and sem... more Proto-Indo-European roots as reconstructed in the literature regularly result in formally and semantically similar yet distinct roots. Amongst them, some of the doublets of the form CeH/Cei̯H and CeH/ Ceu̯H have been explained as reflecting laryngeal metathesis based on nominal formations in -i- and -u-, thus creating secondary roots. In this paper, we advocate for a more general version of this law, whereby any glide or resonant can be involved in such a metathesis. It explains a number of doublets of the form CeH/CeRH as the result of such a metathesis based on nominal forms exhibiting specific stress patterns. It results in new etymological proposals as well as in a reanalysis of the facts generally explained as reflecting the so-called Saussure effect.

Research paper thumbnail of New results on a centum substratum in Greek: the Lydian connection

Loanwords and Substrata. Proceedings of the Colloquium held in Limoges (June 5th-7th 2018), Innsbruck : Innsbrucker Beiträge zur Sprachwissenschaft, Band 164, 2020, 169-200., 2020

The present study aims at introducing a consistent set of new etymological proposals concerning a... more The present study aims at introducing a consistent set of new etymological proposals concerning a number of unetymologised Greek words, analysing them as borrowings from Lydian or a pre-stage thereof. Lydian being very scarcely attested, these proposals must be considered as tentative. Yet they could reflect under-recognised linguistic and cultural contacts between Lydians (or Pre-Lydians) and Greeks. These proposals constitute a follow-up to our article on a centum substratum in Greek (GARNIER & SAGOT 2017). Some of the analyses proposed in that paper must be modified, even perhaps abandoned, but a number of them are fully compatible with an Anatolian source language, sometimes with typical Lydian features. We will therefore conclude with a brief critical reading of (GARNIER & SAGOT 2017), showing that the centum sub- stratum language we proposed in this paper could well be (a pre-form) of Lydian.
In the present paper, we will propose a survey of Lydian place-names and personal names in Greek transcription (1), then possible Greek loanwords from Lydian (2), before exploring other Anatolian origins: the Lycian connection (3) and finally obscure Greek words of unrecognized Greek origin (4).

Research paper thumbnail of Controllable Sentence Simplification

Text simplification aims at making a text easier to read and understand by simplifying grammar an... more Text simplification aims at making a text easier to read and understand by simplifying grammar and structure while keeping the underlying information identical. It is often considered an all-purpose generic task where the same simplification is suitable for all; however multiple audiences can benefit from simplified text in different ways. We adapt a discrete parametrization mechanism that provides explicit control on simplification systems based on Sequence-to-Sequence models. As a result, users can condition the simplifications returned by a model on attributes such as length, amount of paraphrasing, lexical complexity and syntactic complexity. We also show that carefully chosen values of these attributes allow out-of-the-box Sequence-to-Sequence models to outperform their standard counterparts on simplification benchmarks. Our model, which we call ACCESS (as shorthand for AudienCe-CEntric Sentence Simplification), establishes the state of the art at 41.87 SARI on the WikiLarge te...

Research paper thumbnail of Sentiment analysis of write-in comments related to organisational change

Purpose Sentiment analysis is becoming increasingly popular in social media. In organisational re... more Purpose Sentiment analysis is becoming increasingly popular in social media. In organisational research, however, this technique is underutilised. This paper aims to explore employee and manager sentiments related to recently announced organisational change that were investigated as part of employee opinion survey. Design/Method The data studied here are from an employee opinion survey. Over 5600 employees participated in the survey and 2262 commented on recent organisational transformation. Comments were coded into key themes and their tone was measured using sentiment analysis. The sentiment scores were linked to the themes identified in the comments. Results Mostly negative sentiments were reported regarding the change. Employees were dissatisfied with the lack of information about redundancies, which related to lost in trust employees expressed towards management and general frustration. Differences in sentiments towards specific topics were found in manager and employee populat...

Research paper thumbnail of CoNLL-UL: Universal Morphological Lattices for Universal Dependency Parsing

Following the development of the universal dependencies (UD) framework and the CoNLL 2017 Shared ... more Following the development of the universal dependencies (UD) framework and the CoNLL 2017 Shared Task on end-to-end UD parsing, we address the need for a universal representation of morphological analysis which on the one hand can capture a range of different alternative morphological analyses of surface tokens, and on the other hand is compatible with the segmentation and morphological annotation guidelines prescribed for UD treebanks. We propose the CoNLL universal lattices (CoNLL-UL) format, a new annotation format for word lattices that represent morphological analyses, and provide resources that obey this format for a range of typologically different languages. The resources we provide are harmonized with the two-level representation and morphological annotation in their respective UD v2 treebanks, thus enabling research on universal models for morphological and syntactic parsing , in both pipeline and joint settings, and presenting new opportunities in the development of UD re...

Research paper thumbnail of A new PIE root *h1er ‘(to be) dark red, dusk red’: drawing the line between inherited and borrowed words for ‘red(ish)’, ‘pea’, ‘ore’, ‘dusk’ and ‘love’ in daughter languages

Research paper thumbnail of Improving neural tagging with lexical information

Neural part-of-speech tagging has achieved competitive results with the incorporation of characte... more Neural part-of-speech tagging has achieved competitive results with the incorporation of character-based and pre-trained word embeddings. In this paper, we show that a state-of-the-art bi-LSTM tagger can benefit from using information from morphosyntactic lexicons as additional input. The tagger, trained on several dozen languages, shows a consistent, average improvement when using lexical information, even when also using character-based embeddings, thus showing the complementarity of the different sources of lexical information. The improvements are particularly important for the smaller datasets.

Research paper thumbnail of A multilingual collection of CoNLL-U-compatible morphological lexicons

We introduce UDLexicons, a multilingual collection of morphological lexicons that follow the guid... more We introduce UDLexicons, a multilingual collection of morphological lexicons that follow the guidelines and format of the Universal Dependencies initiative. We describe the three approaches we use to create 53 morphological lexicons covering 38 languages, based on existing resources. These lexicons, which are freely available, have already proven useful for improving part-of-speech tagging accuracy in state-of-the-art architectures.

Research paper thumbnail of Speeding up corpus development for linguistic research: language documentation and acquisition in Romansh Tuatschin

Proceedings of the Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, 2017

Research paper thumbnail of Crowdsourcing for Language Resource Development: Criticisms About Amazon Mechanical Turk Overpowering Use

Lecture Notes in Computer Science, 2014

Research paper thumbnail of Dictionary-Ontology Cross-Enrichment Using TLFi and WOLF to enrich one another

Research paper thumbnail of Les catégories prédicatives dans le Lefff

Nous presenterons le Lefff, lexique morphologique et syntaxique du francais librement disponible,... more Nous presenterons le Lefff, lexique morphologique et syntaxique du francais librement disponible, largement utilise dans la communaute du TAL. Nous mettrons l'accent sur la representation des proprietes syntaxiques des elements predicatifs. Dans un premier temps, nous decrirons les principes linguistiques mis en œuvre dans le Lefff (fonction syntaxique, realisation, redistribution). Dans un second temps, nous donnerons un apercu de diverses techniques automatiques ayant contribue a etendre ou a ameliorer le Lefff, notamment par l'interpretation d'autres ressources et leur confrontation avec le Lefff. Nous insisterons neanmoins sur l'importance des travaux d'analyse linguistique prealable et de validation manuelle a posteriori, sans lesquels la qualite de la ressource ne serait pas suffisante. Nous donnerons quelques exemples de l'utilisation du Lefff dans des analyseurs syntaxiques, et notamment en les contrastant avec l'utilisation de lexiques de valence extraits automatiquement de corpus et avec l'utilisation d'autres ressources lexicales. Nous terminerons par un apercu des travaux en cours, qui concernent notamment le couplage du Lefff avec des ressources semantiques telles que le wordnet WOLF et un lexique a la FrameNet en cours de developpement.

Research paper thumbnail of Sous-catégorisation en pour et syntaxe lexicale

Research paper thumbnail of A Language-Independent Approach to Extracting Derivational Relations from an Inflectional Lexicon

Research paper thumbnail of Producción eficiente de recursos lingüísticos: proyecto Victoria

Research paper thumbnail of Développement de ressources pour le persan: le nouveau lexique morphologique PerLex 2 et l'étiqueteur morphosyntaxique MElt-fa

Résumé: Dans cet article nous présentons une nouvelle version de PerLex, lexique morphologique du... more Résumé: Dans cet article nous présentons une nouvelle version de PerLex, lexique morphologique du persan, une version corrigée et partiellement réannotée du corpus étiqueté BijanKhan (BijanKhan, 2004) et MEltfa, un nouvel étiqueteur morphosyntaxique librement disponible pour le persan. Après avoir développé une première version de PerLex (Sagot & Walther, 2010), nous en proposons donc ici une version améliorée. Outre une validation manuelle partielle, PerLex 2 repose désormais sur un inventaire de catégories ...

Research paper thumbnail of Développement de ressources pour le persan: PerLex2, nouveau lexique morphologique et MElt_fa, étiqueteur morphosyntaxique

TALN 2011, 2011

Résumé. Nous présentons une nouvelle version de PerLex, lexique morphologique du persan, une vers... more Résumé. Nous présentons une nouvelle version de PerLex, lexique morphologique du persan, une version corrigée et partiellement réannotée du corpus étiqueté BijanKhan (BijanKhan, 2004) et MEltfa, un nouvel étiqueteur morphosyntaxique librement disponible pour le persan. Après avoir développé une première version de PerLex (Sagot & Walther, 2010), nous en proposons donc ici une version améliorée. Outre une validation manuelle partielle, PerLex 2 repose désormais sur un inventaire de catégories linguistiquement ...

Research paper thumbnail of Développement de ressources pour le persan: lexique morphologique et chaîne de traitements de surface

Résumé: Nous présentons PerLex, un lexique morphologique du persan à large couverture et libremen... more Résumé: Nous présentons PerLex, un lexique morphologique du persan à large couverture et librement disponible, accompagné d'une chaîne de traitements de surface pour cette langue. Nous décrivons quelques caractéristiques de la morphologie du persan, et la façon dont nous l'avons représentée dans le formalisme lexical Alexina, sur lequel repose PerLex. Nous insistons sur la méthodologie que nous avons employée pour construire les entrées lexicales à partir de diverses sources, ainsi que sur les problèmes liés à la normalisation ...

Research paper thumbnail of What is Old is New again: PIE Secondary Roots with Fossilised Preverbs

Conference given in the University of Leiden, May 29th 2019

Prefixal productivity is attested in all Indo-European languages and is reconstructed in the prot... more Prefixal productivity is attested in all Indo-European languages and is reconstructed in the proto-languages of all major Indo-European languages families. It is particularly important for under-standing the origin of many non-primary verbal roots in all these languages. Surprisingly, only a handful of etymons involving prefixation have been reconstructed at the PIE level. A systematic study of this word formation process in PIE remains to be carried out. It could result in a better understanding of the origin of a number of PIE roots, especially complex roots with limited attestation, and help explain attested words in daughter languages that need a convincing etymology.
In our talk, we will show how a better understanding of the role of prefixes in (secondary) verbal root formation can result in new etymological insights. Such analyses have already been proposed for several examples. For instance, with the compensatory lenghtening *Ce=HC- > *CV̄C-, Weiss (1993) analyses Lat. pālārī ‘to wander’ as reflecting *pe=h2lh2-ó- > *pālH-āye/o-. Another classical example of the same prefix is Arm. p‘law < *p‘ulaw < *pōlH-to based on *pe=h3lh1-, as also P.- Germ. *fall-an- ‘to fall’ < *pŏlle/o- (with Osthoff’s shorthening) < *pōlle/o- < *pōlH-é/ó- (Praust 2005; Neri 2007; Kroonen 2013: 125–6; Dunkel 2014 II: 82). Other examples include PIE *pro=h1ed- ‘to devour’ > P.-Germ. *fr(a)-et-an-, (Scheungraber 2016: §4), PIE *kom=pro=h1ṇḱ- ‘to bring’ > P.-Germ. *breng-an- (Kroonen 2013: 77), PIE *°kom=h1ep- ‘to give’ > P.-Germ. *geb-an- (Kortlandt 1992) and *pe=h2ṛk- > Lat. parcō (Weiss 1993; cf. Hitt. pē=ḫark- ‘to hold off’).
We intend to show that this word formation process is by far more widespread than usually thought. We shall discuss several examples, thus unifying sets of semantically similar roots and proposing novel etymologies for several difficult words. These views may shed some light on the possibility that PIE may have been (once) a satellite-framed language.

Research paper thumbnail of A new PIE root *h1er '(to be) dark red, dusk red': drawing the line between inherited and borrowed words for 'red(ish)', 'pea', 'ore', 'dusk' and 'love' in daughter languages (conference handout)

Research paper thumbnail of Metathesis of Proto-Indo-European Sonorants

Münchener Studien zur Sprachwissenschaft 73/1, 2019, 29-53., 2019

Proto-Indo-European roots as reconstructed in the literature regularly result in formally and sem... more Proto-Indo-European roots as reconstructed in the literature regularly result in formally and semantically similar yet distinct roots. Amongst them, some of the doublets of the form CeH/Cei̯H and CeH/ Ceu̯H have been explained as reflecting laryngeal metathesis based on nominal formations in -i- and -u-, thus creating secondary roots. In this paper, we advocate for a more general version of this law, whereby any glide or resonant can be involved in such a metathesis. It explains a number of doublets of the form CeH/CeRH as the result of such a metathesis based on nominal forms exhibiting specific stress patterns. It results in new etymological proposals as well as in a reanalysis of the facts generally explained as reflecting the so-called Saussure effect.

Research paper thumbnail of New results on a centum substratum in Greek: the Lydian connection

Loanwords and Substrata. Proceedings of the Colloquium held in Limoges (June 5th-7th 2018), Innsbruck : Innsbrucker Beiträge zur Sprachwissenschaft, Band 164, 2020, 169-200., 2020

The present study aims at introducing a consistent set of new etymological proposals concerning a... more The present study aims at introducing a consistent set of new etymological proposals concerning a number of unetymologised Greek words, analysing them as borrowings from Lydian or a pre-stage thereof. Lydian being very scarcely attested, these proposals must be considered as tentative. Yet they could reflect under-recognised linguistic and cultural contacts between Lydians (or Pre-Lydians) and Greeks. These proposals constitute a follow-up to our article on a centum substratum in Greek (GARNIER & SAGOT 2017). Some of the analyses proposed in that paper must be modified, even perhaps abandoned, but a number of them are fully compatible with an Anatolian source language, sometimes with typical Lydian features. We will therefore conclude with a brief critical reading of (GARNIER & SAGOT 2017), showing that the centum sub- stratum language we proposed in this paper could well be (a pre-form) of Lydian.
In the present paper, we will propose a survey of Lydian place-names and personal names in Greek transcription (1), then possible Greek loanwords from Lydian (2), before exploring other Anatolian origins: the Lycian connection (3) and finally obscure Greek words of unrecognized Greek origin (4).

Research paper thumbnail of Controllable Sentence Simplification

Text simplification aims at making a text easier to read and understand by simplifying grammar an... more Text simplification aims at making a text easier to read and understand by simplifying grammar and structure while keeping the underlying information identical. It is often considered an all-purpose generic task where the same simplification is suitable for all; however multiple audiences can benefit from simplified text in different ways. We adapt a discrete parametrization mechanism that provides explicit control on simplification systems based on Sequence-to-Sequence models. As a result, users can condition the simplifications returned by a model on attributes such as length, amount of paraphrasing, lexical complexity and syntactic complexity. We also show that carefully chosen values of these attributes allow out-of-the-box Sequence-to-Sequence models to outperform their standard counterparts on simplification benchmarks. Our model, which we call ACCESS (as shorthand for AudienCe-CEntric Sentence Simplification), establishes the state of the art at 41.87 SARI on the WikiLarge te...

Research paper thumbnail of Sentiment analysis of write-in comments related to organisational change

Purpose Sentiment analysis is becoming increasingly popular in social media. In organisational re... more Purpose Sentiment analysis is becoming increasingly popular in social media. In organisational research, however, this technique is underutilised. This paper aims to explore employee and manager sentiments related to recently announced organisational change that were investigated as part of employee opinion survey. Design/Method The data studied here are from an employee opinion survey. Over 5600 employees participated in the survey and 2262 commented on recent organisational transformation. Comments were coded into key themes and their tone was measured using sentiment analysis. The sentiment scores were linked to the themes identified in the comments. Results Mostly negative sentiments were reported regarding the change. Employees were dissatisfied with the lack of information about redundancies, which related to lost in trust employees expressed towards management and general frustration. Differences in sentiments towards specific topics were found in manager and employee populat...

Research paper thumbnail of CoNLL-UL: Universal Morphological Lattices for Universal Dependency Parsing

Following the development of the universal dependencies (UD) framework and the CoNLL 2017 Shared ... more Following the development of the universal dependencies (UD) framework and the CoNLL 2017 Shared Task on end-to-end UD parsing, we address the need for a universal representation of morphological analysis which on the one hand can capture a range of different alternative morphological analyses of surface tokens, and on the other hand is compatible with the segmentation and morphological annotation guidelines prescribed for UD treebanks. We propose the CoNLL universal lattices (CoNLL-UL) format, a new annotation format for word lattices that represent morphological analyses, and provide resources that obey this format for a range of typologically different languages. The resources we provide are harmonized with the two-level representation and morphological annotation in their respective UD v2 treebanks, thus enabling research on universal models for morphological and syntactic parsing , in both pipeline and joint settings, and presenting new opportunities in the development of UD re...

Research paper thumbnail of A new PIE root *h1er ‘(to be) dark red, dusk red’: drawing the line between inherited and borrowed words for ‘red(ish)’, ‘pea’, ‘ore’, ‘dusk’ and ‘love’ in daughter languages

Research paper thumbnail of Improving neural tagging with lexical information

Neural part-of-speech tagging has achieved competitive results with the incorporation of characte... more Neural part-of-speech tagging has achieved competitive results with the incorporation of character-based and pre-trained word embeddings. In this paper, we show that a state-of-the-art bi-LSTM tagger can benefit from using information from morphosyntactic lexicons as additional input. The tagger, trained on several dozen languages, shows a consistent, average improvement when using lexical information, even when also using character-based embeddings, thus showing the complementarity of the different sources of lexical information. The improvements are particularly important for the smaller datasets.

Research paper thumbnail of A multilingual collection of CoNLL-U-compatible morphological lexicons

We introduce UDLexicons, a multilingual collection of morphological lexicons that follow the guid... more We introduce UDLexicons, a multilingual collection of morphological lexicons that follow the guidelines and format of the Universal Dependencies initiative. We describe the three approaches we use to create 53 morphological lexicons covering 38 languages, based on existing resources. These lexicons, which are freely available, have already proven useful for improving part-of-speech tagging accuracy in state-of-the-art architectures.

Research paper thumbnail of Speeding up corpus development for linguistic research: language documentation and acquisition in Romansh Tuatschin

Proceedings of the Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, 2017

Research paper thumbnail of Crowdsourcing for Language Resource Development: Criticisms About Amazon Mechanical Turk Overpowering Use

Lecture Notes in Computer Science, 2014

Research paper thumbnail of Dictionary-Ontology Cross-Enrichment Using TLFi and WOLF to enrich one another

Research paper thumbnail of Les catégories prédicatives dans le Lefff

Nous presenterons le Lefff, lexique morphologique et syntaxique du francais librement disponible,... more Nous presenterons le Lefff, lexique morphologique et syntaxique du francais librement disponible, largement utilise dans la communaute du TAL. Nous mettrons l'accent sur la representation des proprietes syntaxiques des elements predicatifs. Dans un premier temps, nous decrirons les principes linguistiques mis en œuvre dans le Lefff (fonction syntaxique, realisation, redistribution). Dans un second temps, nous donnerons un apercu de diverses techniques automatiques ayant contribue a etendre ou a ameliorer le Lefff, notamment par l'interpretation d'autres ressources et leur confrontation avec le Lefff. Nous insisterons neanmoins sur l'importance des travaux d'analyse linguistique prealable et de validation manuelle a posteriori, sans lesquels la qualite de la ressource ne serait pas suffisante. Nous donnerons quelques exemples de l'utilisation du Lefff dans des analyseurs syntaxiques, et notamment en les contrastant avec l'utilisation de lexiques de valence extraits automatiquement de corpus et avec l'utilisation d'autres ressources lexicales. Nous terminerons par un apercu des travaux en cours, qui concernent notamment le couplage du Lefff avec des ressources semantiques telles que le wordnet WOLF et un lexique a la FrameNet en cours de developpement.

Research paper thumbnail of Sous-catégorisation en pour et syntaxe lexicale

Research paper thumbnail of A Language-Independent Approach to Extracting Derivational Relations from an Inflectional Lexicon

Research paper thumbnail of Producción eficiente de recursos lingüísticos: proyecto Victoria

Research paper thumbnail of Développement de ressources pour le persan: le nouveau lexique morphologique PerLex 2 et l'étiqueteur morphosyntaxique MElt-fa

Résumé: Dans cet article nous présentons une nouvelle version de PerLex, lexique morphologique du... more Résumé: Dans cet article nous présentons une nouvelle version de PerLex, lexique morphologique du persan, une version corrigée et partiellement réannotée du corpus étiqueté BijanKhan (BijanKhan, 2004) et MEltfa, un nouvel étiqueteur morphosyntaxique librement disponible pour le persan. Après avoir développé une première version de PerLex (Sagot & Walther, 2010), nous en proposons donc ici une version améliorée. Outre une validation manuelle partielle, PerLex 2 repose désormais sur un inventaire de catégories ...

Research paper thumbnail of Développement de ressources pour le persan: PerLex2, nouveau lexique morphologique et MElt_fa, étiqueteur morphosyntaxique

TALN 2011, 2011

Résumé. Nous présentons une nouvelle version de PerLex, lexique morphologique du persan, une vers... more Résumé. Nous présentons une nouvelle version de PerLex, lexique morphologique du persan, une version corrigée et partiellement réannotée du corpus étiqueté BijanKhan (BijanKhan, 2004) et MEltfa, un nouvel étiqueteur morphosyntaxique librement disponible pour le persan. Après avoir développé une première version de PerLex (Sagot & Walther, 2010), nous en proposons donc ici une version améliorée. Outre une validation manuelle partielle, PerLex 2 repose désormais sur un inventaire de catégories linguistiquement ...

Research paper thumbnail of Développement de ressources pour le persan: lexique morphologique et chaîne de traitements de surface

Résumé: Nous présentons PerLex, un lexique morphologique du persan à large couverture et libremen... more Résumé: Nous présentons PerLex, un lexique morphologique du persan à large couverture et librement disponible, accompagné d'une chaîne de traitements de surface pour cette langue. Nous décrivons quelques caractéristiques de la morphologie du persan, et la façon dont nous l'avons représentée dans le formalisme lexical Alexina, sur lequel repose PerLex. Nous insistons sur la méthodologie que nous avons employée pour construire les entrées lexicales à partir de diverses sources, ainsi que sur les problèmes liés à la normalisation ...

Research paper thumbnail of TCOF-POS: un corpus libre de français parlé annoté en morphosyntaxe

Research paper thumbnail of Les Grammaires à Concaténation d’Intervalles (RCG) comme formalisme grammatical pour la linguistique

Le but de cet article est de montrer pourquoi les Grammaires à Concaténation d’Intervalles (Range... more Le but de cet article est de montrer pourquoi les Grammaires à Concaténation d’Intervalles (Range Concatenation Grammars, ou RCG) sont un formalisme particulièrement bien adapté à la description du langage naturel. Nous expliquons d’abord que la puissance nécessaire pour décrire le langage naturel est celle de PTIME. Ensuite, parmi les formalismes grammaticaux ayant cette puissance d’expression, nous justifions le choix des RCG. Enfin, après un aperçu de leur définition et de leurs propriétés, nous montrons comment leur utilisation comme grammaires linguistiques permet de traiter des phénomènes syntagmatiques complexes, de réaliser simultanément l’analyse syntaxique et la vérification des diverses contraintes (morphosyntaxiques, sémantique lexicale), et de construire dynamiquement des grammaires linguistiques modulaires.

Research paper thumbnail of Milk and the Indo-Europeans

Language Dispersal Beyond Farming, 2017

The Yamnaya culture, ofen regarded as the bearer of the Proto-Indo-European language, underwent a... more The Yamnaya culture, ofen regarded as the bearer of the Proto-Indo-European language, underwent a strong population expansion in the late 4th and early 3rd millennia BCE. It suggests that the underlying reason for that expansion might be the then unique capacity to digest animal milk in adulthood. We examine the early Indo-European milk-related vocabulary to confrm the special role of animal milk in Indo-European expansions. We show that Proto-Indo-European did not have a specialized root for ‘to milk’ and argue that the IE root *h2melg̑- ‘to milk’ is secondary and post-Anatolian. We take this innovation as an indication of the novelty of animal milking in early Indo-European society. Together with a detailed study of language-specifc innovations in this semantic feld, we conclude that the ability to digest milk played an important role in boosting Proto-Indo-European demography.