Nuria Gala | Aix-Marseille University (original) (raw)
Papers by Nuria Gala
The interest in lexical resources is evolving continuously as a result of different needs and tec... more The interest in lexical resources is evolving continuously as a result of different needs and technologies. The Lexicon is central to research in various domains : lexicography, vocabulary learning, reading tools, etc. By and large, it is also the basis for natural language processing tools and language technologies in general.
Building and enriching lexical resources remains a costly and time-consuming task that requires competencies in different disciplines. At present, language processing tools enable better coverage of lexicons as well as the specific, explicit and detailed linguistic information contained within them. In addition, the methods used to build the resources have become diversified (automatic, collaborative) and the resulting lexicons tend to be increasingly dynamic, designed with a view towards large-scale linked data.
Lastly, lexical resources are a major issue for society, because they are essential in developping tools for learning languages, assistive technologies for reading and writing, etc.
With a view to obtaining the "Habilitation à Diriger des Recherches", this thesis focuses on the Lexicon and on lexical resources in general. The issue is addressed through an interdisciplinary approach : as well as describing various resources that we have created and/or enriched, we also offer historical and methodological insight into a number of approaches and applications where the lexicon plays a central role.
The sociolinguistic situation in Arabic countries is characterized by diglossia (Ferguson, 1959) ... more The sociolinguistic situation in Arabic countries is characterized by diglossia (Ferguson, 1959) : whereas one variant Modern Standard Arabic (MSA) is highly codified and mainly used for written communication, other variants coexist in regular everyday’s situations (dialects). Similarly, while a number of resources and tools exist for MSA (lexica, annotated corpora, taggers, parsers. . . ), very few are available for the development of dialectal Natural Language Processing tools.
Taking advantage of the closeness of MSA and its dialects, one way to solve the problem of the lack of resources for dialects consists in exploiting available MSA resources and NLP tools in order to adapt them to process dialects. This paper adopts this general framework: we propose a method to build a lexicon of deverbal nouns for Tunisian (TUN) using MSA tools and resources as starting material.
Analysing lexical complexity is a task that has mainly attracted the attention of psycholinguists... more Analysing lexical complexity is a task that has mainly attracted the attention of psycholinguists and
language teachers. More recently, this issue has seen a growing interest in the field of Natural Language Processing (NLP) and, in particular, that of automatic text simplification. The aim of this task is to identify words and structures which
may be difficult to understand by a target audience and provide automated tools to simplify these contents. This article focuses on the lexical issue by identifying a set of predictors of the lexical complexity whose efficiency are assessed with
a correlational analysis. The best of those variables are integrated into a model able to predict the difficulty of words for learners of French.
In this paper we present FLELex, the first graded lexicon for French as a foreign language (FFL) ... more In this paper we present FLELex, the first graded lexicon for French as a foreign language (FFL) that reports word frequencies by difficulty level (according to the CEFR scale). It has been obtained from a tagged corpus of 777,000 words from available textbooks and simplified readers intended for FFL learners. Our goal is to freely provide this resource to the community to be used for a variety of purposes going from the assessment of the lexical difficulty of a text, to the selection of simpler words within text simplification systems, and also as a dictionary in assistive tools for writing.
The readability of a text depends on a number of linguistic factors, among which its lexical comp... more The readability of a text depends on a number of linguistic factors, among which its lexical complexity. In this paper, we specifically explore this issue: our aim is to characterize the criteria that make a word easy to understand independently of the context in which it appears. Yet such a concern must be addressed in the context of particular groups of individuals. In our case, we have focused on language production from patients with language disorders. The results obtained from corpus analysis enable us to define a number of variables which are compared to information from existing resources. Such measures are used in a classification model to predict the degree of difficulty of words and to build a lexical resource, called ReSyf, in which the words and their synonyms are classified according to three levels of complexity.
Text readability depends on a variety of variables. While lexico-semantic and syntactic factors h... more Text readability depends on a variety of variables. While lexico-semantic and syntactic factors have been widely used in the literature, more hight-level discursive and cognitive properties such as cohesion and coherence have received little attention. This paper assesses the efficiency of 41 measures of text cohesion and text coherence as predictors of text readability. We compare results manually obtained on two corpora including texts with different difficulty levels, and show that some cohesive features are indeed useful predictors.
Lexical resources have undergone significant changes with the generalized use of computers and th... more Lexical resources have undergone significant changes with the generalized use of computers and the advent of the Internet. However, while such changes stand for revolutions when it comes to compare machine-readable dictionaries to their paper 'ancestors', machine-readable dictionaries, compiled for human readers, still have serious limitations. Natural language processing lexicons, initially developed for NLP applications, have shed light on some of such shortcomings. In this presentation, we will attempt to bring new elements relatively to NLP approaches aiming to develop present and tomorrow's lexical resources, in particular, using morphological and semantic information to better access lexical items. A special focus will be given on the semantic and on the multilingual side. Our argument is that nowadays lexical resources 1) should be useful both for men and machines, 2) can be constructed in alternative ways from classical lexicographic work, and 3) provide novel accesses and usages that are feasible only in the context of computer and user networks. Such points will be highlighted by means of two resources under development: LexRom, as an example of morphological form-based multilingual access, and the lexical network of JeuxDeMots, as an illustration of associative and semantic access.
Lexical resources have undergone significant changes with the generalized use of computers and th... more Lexical resources have undergone significant changes with the generalized use of computers and the advent of the Internet. However, while such changes stand for revolutions when it comes to compare machine-readable dictionaries to their paper 'ancestors', machine-readable dictionaries, compiled for human readers, still have serious limitations. Natural language processing lexicons, initially developed for NLP applications, have shed light on some of such shortcomings. In this presentation, we will attempt to bring new elements relatively to NLP approaches aiming to develop present and tomorrow's lexical resources, in particular, using morphological and semantic information to better access lexical items. A special focus will be given on the semantic and on the multilingual side. Our argument is that nowadays lexical resources 1) should be useful both for men and machines, 2) can be constructed in alternative ways from classical lexicographic work, and 3) provide novel accesses and usages that are feasible only in the context of computer and user networks. Such points will be highlighted by means of two resources under development: LexRom, as an example of morphological form-based multilingual access, and the lexical network of JeuxDeMots, as an illustration of associative and semantic access.
Electronic dictionaries offer many possibilities unavailable in paper dictionaries to view, displ... more Electronic dictionaries offer many possibilities unavailable in paper dictionaries to view, display or access information. However, even these resources fall short when it comes to access words sharing semantic features and certain aspects of form: few applications offer the possibility to access a word via a morphologically or semantically related word. In this paper, we present such an application, POLYMOTS, a lexical database for contemporary French containing 20.000 words grouped in 2.000 families. The purpose of this resource is to group words into families on the basis of shared morpho-phonological and semantic information. Words with a common stem form a family; words in a family also share a set of common conceptual fragments (in some families there is a continuity of meaning, in others meaning is distributed). With this approach, we capitalize on the bidirectional link between semantics and morpho-phonology : the user can thus access words not only on the basis of ideas, but also on the basis of formal characteristics of the word, i.e. its morphological features. The resulting lexical database should help people learn French vocabulary and assist them to find words they are looking for, going thus beyond other existing lexical resources.
This paper describes an unsupervised method to lexicalise a robust parser grammar for French in o... more This paper describes an unsupervised method to lexicalise a robust parser grammar for French in order to improve prepositional phrase (PP) attachment. The ambiguous attachments produced by the parser after a first analysis of an input text are transformed and used in queries to find and download documents from the Web, where the involved words occur. The collected corpus is parsed and, from the parsing results, we acquire statistical information on PP-attachment configurations, hence building a weighted subcategorisation lexicon. This automatically acquired subcategorisation information is used in a second analysis of the input text in order to improve the disambiguation of multiple PP-attachment.
Cet article présente un état de l'art des analyseurs robustes existants et propose un système aut... more Cet article présente un état de l'art des analyseurs robustes existants et propose un système automatique d'annotation syntaxique de corpus plus efficace fondé sur un diagnostic préalable à l'application de grammaires spécialisées. Après avoir décrit quelques analyseurs et avoir montré leurs limites en ce qui concerne le traitement de certains corpus, une approche d'analyse en deux étapes est proposée. Les différents modules grammaticaux formalisent tout d'abord des phrases considérées comme noyau puis certains phénomènes syntaxiques particuliers comprenant de la ponctuation ou entraînant des ambiguïtés structurelles. L'avantage de cette approche est, pour tout type de corpus, l'application d'une même grammaire stable optimisée puis l'adaptation du parseur en fonction de la présence de certains phénomènes qui sont traités spécifiquement. Cette stratégie garantit des taux de précision et rappel élevés quelle que soit la typologie du corpus.
In this paper, we propose a method combining unsupervised learning of lexical frequencies with se... more In this paper, we propose a method combining unsupervised learning of lexical frequencies with semantic information aiming at improving PP attachment ambiguity resolution. Using the output of a robust parser, i.e. the set of all possible attachments for a given sentence, we query the Web and obtain statistical information about the frequencies of the attachments distributions as well as lexical signatures of the terms on the patterns. All this information is used to weight the dependencies yielded by the parser.
In this paper, we propose a method combining unsupervised learning of lexical frequencies with se... more In this paper, we propose a method combining unsupervised learning of lexical frequencies with semantic information aiming at improving PP attachment ambiguity resolution. Using the output of a robust parser, i.e. the set of all possible attachments for a given sentence, we query the Web and obtain statistical information about the frequencies of the attachments distributions as well as lexical signatures of the terms on the patterns. All this information is used to weight the dependencies yielded by the parser.
Books by Nuria Gala
Lexical resources store knowledge concerning words, their meanings and uses. While dictionaries w... more Lexical resources store knowledge concerning words, their meanings and uses. While dictionaries were confined to printed media, there are now a variety of tools available in electronic form for different purposes. The way we look at these resources (their creation and use) has changed dramatically over the last few decades. Indeed, there is hardly any task in Natural Language Processing which can be conducted without them. While being built by hand in the past, lexical resources are nowadays built with the help of machines, more or less automatically. Also, rather than being conceived as static entities (data-base view), lexical resources are often viewed as graphs, whose nodes and links (connection strengths) may change over time. Interestingly, properties concerning topology, clustering and evolution known from other disciplines also apply to lexical resources: everything is linked, hence accessible, and everything is evolving. While the field is still in evolution, a snapshot may nevertheless be useful to reveal where we stand. This is precisely one of the goals of this volume.
The interest in lexical resources is evolving continuously as a result of different needs and tec... more The interest in lexical resources is evolving continuously as a result of different needs and technologies. The Lexicon is central to research in various domains : lexicography, vocabulary learning, reading tools, etc. By and large, it is also the basis for natural language processing tools and language technologies in general.
Building and enriching lexical resources remains a costly and time-consuming task that requires competencies in different disciplines. At present, language processing tools enable better coverage of lexicons as well as the specific, explicit and detailed linguistic information contained within them. In addition, the methods used to build the resources have become diversified (automatic, collaborative) and the resulting lexicons tend to be increasingly dynamic, designed with a view towards large-scale linked data.
Lastly, lexical resources are a major issue for society, because they are essential in developping tools for learning languages, assistive technologies for reading and writing, etc.
With a view to obtaining the "Habilitation à Diriger des Recherches", this thesis focuses on the Lexicon and on lexical resources in general. The issue is addressed through an interdisciplinary approach : as well as describing various resources that we have created and/or enriched, we also offer historical and methodological insight into a number of approaches and applications where the lexicon plays a central role.
The sociolinguistic situation in Arabic countries is characterized by diglossia (Ferguson, 1959) ... more The sociolinguistic situation in Arabic countries is characterized by diglossia (Ferguson, 1959) : whereas one variant Modern Standard Arabic (MSA) is highly codified and mainly used for written communication, other variants coexist in regular everyday’s situations (dialects). Similarly, while a number of resources and tools exist for MSA (lexica, annotated corpora, taggers, parsers. . . ), very few are available for the development of dialectal Natural Language Processing tools.
Taking advantage of the closeness of MSA and its dialects, one way to solve the problem of the lack of resources for dialects consists in exploiting available MSA resources and NLP tools in order to adapt them to process dialects. This paper adopts this general framework: we propose a method to build a lexicon of deverbal nouns for Tunisian (TUN) using MSA tools and resources as starting material.
Analysing lexical complexity is a task that has mainly attracted the attention of psycholinguists... more Analysing lexical complexity is a task that has mainly attracted the attention of psycholinguists and
language teachers. More recently, this issue has seen a growing interest in the field of Natural Language Processing (NLP) and, in particular, that of automatic text simplification. The aim of this task is to identify words and structures which
may be difficult to understand by a target audience and provide automated tools to simplify these contents. This article focuses on the lexical issue by identifying a set of predictors of the lexical complexity whose efficiency are assessed with
a correlational analysis. The best of those variables are integrated into a model able to predict the difficulty of words for learners of French.
In this paper we present FLELex, the first graded lexicon for French as a foreign language (FFL) ... more In this paper we present FLELex, the first graded lexicon for French as a foreign language (FFL) that reports word frequencies by difficulty level (according to the CEFR scale). It has been obtained from a tagged corpus of 777,000 words from available textbooks and simplified readers intended for FFL learners. Our goal is to freely provide this resource to the community to be used for a variety of purposes going from the assessment of the lexical difficulty of a text, to the selection of simpler words within text simplification systems, and also as a dictionary in assistive tools for writing.
The readability of a text depends on a number of linguistic factors, among which its lexical comp... more The readability of a text depends on a number of linguistic factors, among which its lexical complexity. In this paper, we specifically explore this issue: our aim is to characterize the criteria that make a word easy to understand independently of the context in which it appears. Yet such a concern must be addressed in the context of particular groups of individuals. In our case, we have focused on language production from patients with language disorders. The results obtained from corpus analysis enable us to define a number of variables which are compared to information from existing resources. Such measures are used in a classification model to predict the degree of difficulty of words and to build a lexical resource, called ReSyf, in which the words and their synonyms are classified according to three levels of complexity.
Text readability depends on a variety of variables. While lexico-semantic and syntactic factors h... more Text readability depends on a variety of variables. While lexico-semantic and syntactic factors have been widely used in the literature, more hight-level discursive and cognitive properties such as cohesion and coherence have received little attention. This paper assesses the efficiency of 41 measures of text cohesion and text coherence as predictors of text readability. We compare results manually obtained on two corpora including texts with different difficulty levels, and show that some cohesive features are indeed useful predictors.
Lexical resources have undergone significant changes with the generalized use of computers and th... more Lexical resources have undergone significant changes with the generalized use of computers and the advent of the Internet. However, while such changes stand for revolutions when it comes to compare machine-readable dictionaries to their paper 'ancestors', machine-readable dictionaries, compiled for human readers, still have serious limitations. Natural language processing lexicons, initially developed for NLP applications, have shed light on some of such shortcomings. In this presentation, we will attempt to bring new elements relatively to NLP approaches aiming to develop present and tomorrow's lexical resources, in particular, using morphological and semantic information to better access lexical items. A special focus will be given on the semantic and on the multilingual side. Our argument is that nowadays lexical resources 1) should be useful both for men and machines, 2) can be constructed in alternative ways from classical lexicographic work, and 3) provide novel accesses and usages that are feasible only in the context of computer and user networks. Such points will be highlighted by means of two resources under development: LexRom, as an example of morphological form-based multilingual access, and the lexical network of JeuxDeMots, as an illustration of associative and semantic access.
Lexical resources have undergone significant changes with the generalized use of computers and th... more Lexical resources have undergone significant changes with the generalized use of computers and the advent of the Internet. However, while such changes stand for revolutions when it comes to compare machine-readable dictionaries to their paper 'ancestors', machine-readable dictionaries, compiled for human readers, still have serious limitations. Natural language processing lexicons, initially developed for NLP applications, have shed light on some of such shortcomings. In this presentation, we will attempt to bring new elements relatively to NLP approaches aiming to develop present and tomorrow's lexical resources, in particular, using morphological and semantic information to better access lexical items. A special focus will be given on the semantic and on the multilingual side. Our argument is that nowadays lexical resources 1) should be useful both for men and machines, 2) can be constructed in alternative ways from classical lexicographic work, and 3) provide novel accesses and usages that are feasible only in the context of computer and user networks. Such points will be highlighted by means of two resources under development: LexRom, as an example of morphological form-based multilingual access, and the lexical network of JeuxDeMots, as an illustration of associative and semantic access.
Electronic dictionaries offer many possibilities unavailable in paper dictionaries to view, displ... more Electronic dictionaries offer many possibilities unavailable in paper dictionaries to view, display or access information. However, even these resources fall short when it comes to access words sharing semantic features and certain aspects of form: few applications offer the possibility to access a word via a morphologically or semantically related word. In this paper, we present such an application, POLYMOTS, a lexical database for contemporary French containing 20.000 words grouped in 2.000 families. The purpose of this resource is to group words into families on the basis of shared morpho-phonological and semantic information. Words with a common stem form a family; words in a family also share a set of common conceptual fragments (in some families there is a continuity of meaning, in others meaning is distributed). With this approach, we capitalize on the bidirectional link between semantics and morpho-phonology : the user can thus access words not only on the basis of ideas, but also on the basis of formal characteristics of the word, i.e. its morphological features. The resulting lexical database should help people learn French vocabulary and assist them to find words they are looking for, going thus beyond other existing lexical resources.
This paper describes an unsupervised method to lexicalise a robust parser grammar for French in o... more This paper describes an unsupervised method to lexicalise a robust parser grammar for French in order to improve prepositional phrase (PP) attachment. The ambiguous attachments produced by the parser after a first analysis of an input text are transformed and used in queries to find and download documents from the Web, where the involved words occur. The collected corpus is parsed and, from the parsing results, we acquire statistical information on PP-attachment configurations, hence building a weighted subcategorisation lexicon. This automatically acquired subcategorisation information is used in a second analysis of the input text in order to improve the disambiguation of multiple PP-attachment.
Cet article présente un état de l'art des analyseurs robustes existants et propose un système aut... more Cet article présente un état de l'art des analyseurs robustes existants et propose un système automatique d'annotation syntaxique de corpus plus efficace fondé sur un diagnostic préalable à l'application de grammaires spécialisées. Après avoir décrit quelques analyseurs et avoir montré leurs limites en ce qui concerne le traitement de certains corpus, une approche d'analyse en deux étapes est proposée. Les différents modules grammaticaux formalisent tout d'abord des phrases considérées comme noyau puis certains phénomènes syntaxiques particuliers comprenant de la ponctuation ou entraînant des ambiguïtés structurelles. L'avantage de cette approche est, pour tout type de corpus, l'application d'une même grammaire stable optimisée puis l'adaptation du parseur en fonction de la présence de certains phénomènes qui sont traités spécifiquement. Cette stratégie garantit des taux de précision et rappel élevés quelle que soit la typologie du corpus.
In this paper, we propose a method combining unsupervised learning of lexical frequencies with se... more In this paper, we propose a method combining unsupervised learning of lexical frequencies with semantic information aiming at improving PP attachment ambiguity resolution. Using the output of a robust parser, i.e. the set of all possible attachments for a given sentence, we query the Web and obtain statistical information about the frequencies of the attachments distributions as well as lexical signatures of the terms on the patterns. All this information is used to weight the dependencies yielded by the parser.
In this paper, we propose a method combining unsupervised learning of lexical frequencies with se... more In this paper, we propose a method combining unsupervised learning of lexical frequencies with semantic information aiming at improving PP attachment ambiguity resolution. Using the output of a robust parser, i.e. the set of all possible attachments for a given sentence, we query the Web and obtain statistical information about the frequencies of the attachments distributions as well as lexical signatures of the terms on the patterns. All this information is used to weight the dependencies yielded by the parser.
Lexical resources store knowledge concerning words, their meanings and uses. While dictionaries w... more Lexical resources store knowledge concerning words, their meanings and uses. While dictionaries were confined to printed media, there are now a variety of tools available in electronic form for different purposes. The way we look at these resources (their creation and use) has changed dramatically over the last few decades. Indeed, there is hardly any task in Natural Language Processing which can be conducted without them. While being built by hand in the past, lexical resources are nowadays built with the help of machines, more or less automatically. Also, rather than being conceived as static entities (data-base view), lexical resources are often viewed as graphs, whose nodes and links (connection strengths) may change over time. Interestingly, properties concerning topology, clustering and evolution known from other disciplines also apply to lexical resources: everything is linked, hence accessible, and everything is evolving. While the field is still in evolution, a snapshot may nevertheless be useful to reveal where we stand. This is precisely one of the goals of this volume.