Olivier Ferret - Academia.edu (original) (raw)
Papers by Olivier Ferret
Int J Speech Technol (Abstract No doubt, words play a major role in language production, hence fi... more Int J Speech Technol (Abstract No doubt, words play a major role in language production, hence finding them is of vital importance, be it for writing or for speaking (spontaneous discourse production, simultaneous translation). Words are stored in a dictionary, and the general belief holds, the more entries the better. Yet, to be truly useful the resource should contain not only many entries and a lot of information concerning each one of them, but also adequate navigational means to reveal the stored information. Information access depends crucially on the organization of the data (words) and the access keys (meaning/form), two factors largely overlooked. We will present here some ideas of how an existing electronic dictionary could be enhanced to support a speaker/writer to find the word s/he is looking for. To this end we suggest to add to an existing electronic dictionary an index based on the notion of association, i.e. words co-occurring in a well balanced corpus, the latter being supposed to represent the average citizen's knowledge of the world. Before describing our approach, we will briefly take a critical look at is, computer-generated language, simulation of the mental lexicon, or WordNet (WN),-to see how adequate they are with regard to our goal.
… of TREC9, NIST, …, Jan 1, 2000
Lecture Notes in Computer Science, 2010
One of the early application of Information Extraction, motivated by the needs for intelligence t... more One of the early application of Information Extraction, motivated by the needs for intelligence tools, is the detection of events in news articles. But this detection may be difficult when news articles mention several occurrences of events of the same kind, which is often done for comparison purposes. We propose in this article new approaches to segment the text of news articles in units relative to only one event, in order to help the identification of relevant information associated with the main event of the news. We present two approaches that use statistical machine learning models (HMM and CRF) exploiting temporal information extracted from the texts as a basis for this segmentation. The evaluation of these approaches in the domain of seismic events show that with a robust and generic approach, we can achieve results at least as good as results obtained with a specialized heuristic approach.
Résumé – Abstract Nous exposons dans cet article une méthode réalisant de façon intégrée deux tâc... more Résumé – Abstract Nous exposons dans cet article une méthode réalisant de façon intégrée deux tâches de l'analyse thématique : la segmentation et la détection de liens thématiques. Cette méthode exploite conjointement la récurrence des mots dans les textes et les liens issus d'un réseau de collocations afin de compenser les faiblesses respectives des deux approches. Nous présentons son évaluation concernant la segmentation sur un corpus en français et un corpus en anglais et nous proposons une mesure d'évaluation spécifiquement adaptée à ce type de systèmes. We present in this paper a method for achieving in an integrated way two tasks of topic analy-sis: segmentation and link detection. This method combines the lexical recurrence in texts and the relations from a collocation network to compensate for the respective weaknesses of the two approaches. We report its evaluation for segmentation on a corpus in French and another in English and we propose an evaluation measure...
Text, Speech and Language Technology, 2014
Meeting of the Association for Computational Linguistics, 1998
This article outlines a quantitative method for segmenting texts into thematically coherent units... more This article outlines a quantitative method for segmenting texts into thematically coherent units. This method relies on a network of lexical collocations to compute the thematic coherence of the different parts of a text from the lexical cohesiveness of their words. We also present the results of an experiment about locating boundaries between a series of concatened texts.
Proceedings of the 20th international conference on Computational Linguistics - COLING '04, 2004
Lexico-semantic networks such as WordNet have been criticized about the nature of the senses they... more Lexico-semantic networks such as WordNet have been criticized about the nature of the senses they distinguish as well as on the way they define these senses. In this article, we present a possible solution to overcome these limits by defining the sense of words from the way they are used. More precisely, we propose to differentiate the senses of a word from a network of lexical cooccurrences built from a large corpus. This method was tested both for French and English and was evaluated for English by comparing its results with WordNet.
Proceedings of the 36th annual meeting on Association for Computational Linguistics -, 1998
This article outlines a quantitative method for segmenting texts into thematically coherent units... more This article outlines a quantitative method for segmenting texts into thematically coherent units. This method relies on a network of lexical collocations to compute the thematic coherence of the different parts of a text from the lexical cohesiveness of their words. We also present the results of an experiment about locating boundaries between a series of concatened texts.
Proceedings of the 19th international conference on Computational linguistics -, 2002
We present in this paper a method for achieving in an integrated way two tasks of topic analysis:... more We present in this paper a method for achieving in an integrated way two tasks of topic analysis: segmentation and link detection. This method combines word repetition and the lexical cohesion stated by a collocation network to compensate for the respective weaknesses of the two approaches. We report an evaluation of our method for segmentation on two corpora, one in French and one in English, and we propose an evaluation measure that specifically suits that kind of systems.
La segmentation thématique et l'identification des thèmes d'un document sont souvent traitées com... more La segmentation thématique et l'identification des thèmes d'un document sont souvent traitées comme des problèmes séparés, même si elles relèvent toutes deux de l'analyse thématique. Dans cet article, nous proposons d'examiner comment l'identification thématique peut contribuer à améliorer la segmentation de documents lorsque celle-ci ne s'appuie que sur la récurrence lexicale. Nous présentons d'abord une méthode non supervisée de découverte des thèmes d'un document ; puis nous détaillons comment ces thèmes sont utilisés dans la segmentation pour aider à reconnaître les similarités thématiques entre des segments de documents. Nous montrons enfin, au travers d'une évaluation faite à la fois pour le français et pour l'anglais, l'intérêt effectif de la méthode proposée.
... [BFB+07] Florian Boudin, Benoît Favre, Frédéric Béchet, Marc El-Bèze, Laurent Gillard, and Ju... more ... [BFB+07] Florian Boudin, Benoît Favre, Frédéric Béchet, Marc El-Bèze, Laurent Gillard, and Juan-Manuel Torres-Moreno ... In SIGIR'98: Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pages 335336 ...
Int J Speech Technol (Abstract No doubt, words play a major role in language production, hence fi... more Int J Speech Technol (Abstract No doubt, words play a major role in language production, hence finding them is of vital importance, be it for writing or for speaking (spontaneous discourse production, simultaneous translation). Words are stored in a dictionary, and the general belief holds, the more entries the better. Yet, to be truly useful the resource should contain not only many entries and a lot of information concerning each one of them, but also adequate navigational means to reveal the stored information. Information access depends crucially on the organization of the data (words) and the access keys (meaning/form), two factors largely overlooked. We will present here some ideas of how an existing electronic dictionary could be enhanced to support a speaker/writer to find the word s/he is looking for. To this end we suggest to add to an existing electronic dictionary an index based on the notion of association, i.e. words co-occurring in a well balanced corpus, the latter being supposed to represent the average citizen's knowledge of the world. Before describing our approach, we will briefly take a critical look at is, computer-generated language, simulation of the mental lexicon, or WordNet (WN),-to see how adequate they are with regard to our goal.
… of TREC9, NIST, …, Jan 1, 2000
Lecture Notes in Computer Science, 2010
One of the early application of Information Extraction, motivated by the needs for intelligence t... more One of the early application of Information Extraction, motivated by the needs for intelligence tools, is the detection of events in news articles. But this detection may be difficult when news articles mention several occurrences of events of the same kind, which is often done for comparison purposes. We propose in this article new approaches to segment the text of news articles in units relative to only one event, in order to help the identification of relevant information associated with the main event of the news. We present two approaches that use statistical machine learning models (HMM and CRF) exploiting temporal information extracted from the texts as a basis for this segmentation. The evaluation of these approaches in the domain of seismic events show that with a robust and generic approach, we can achieve results at least as good as results obtained with a specialized heuristic approach.
Résumé – Abstract Nous exposons dans cet article une méthode réalisant de façon intégrée deux tâc... more Résumé – Abstract Nous exposons dans cet article une méthode réalisant de façon intégrée deux tâches de l'analyse thématique : la segmentation et la détection de liens thématiques. Cette méthode exploite conjointement la récurrence des mots dans les textes et les liens issus d'un réseau de collocations afin de compenser les faiblesses respectives des deux approches. Nous présentons son évaluation concernant la segmentation sur un corpus en français et un corpus en anglais et nous proposons une mesure d'évaluation spécifiquement adaptée à ce type de systèmes. We present in this paper a method for achieving in an integrated way two tasks of topic analy-sis: segmentation and link detection. This method combines the lexical recurrence in texts and the relations from a collocation network to compensate for the respective weaknesses of the two approaches. We report its evaluation for segmentation on a corpus in French and another in English and we propose an evaluation measure...
Text, Speech and Language Technology, 2014
Meeting of the Association for Computational Linguistics, 1998
This article outlines a quantitative method for segmenting texts into thematically coherent units... more This article outlines a quantitative method for segmenting texts into thematically coherent units. This method relies on a network of lexical collocations to compute the thematic coherence of the different parts of a text from the lexical cohesiveness of their words. We also present the results of an experiment about locating boundaries between a series of concatened texts.
Proceedings of the 20th international conference on Computational Linguistics - COLING '04, 2004
Lexico-semantic networks such as WordNet have been criticized about the nature of the senses they... more Lexico-semantic networks such as WordNet have been criticized about the nature of the senses they distinguish as well as on the way they define these senses. In this article, we present a possible solution to overcome these limits by defining the sense of words from the way they are used. More precisely, we propose to differentiate the senses of a word from a network of lexical cooccurrences built from a large corpus. This method was tested both for French and English and was evaluated for English by comparing its results with WordNet.
Proceedings of the 36th annual meeting on Association for Computational Linguistics -, 1998
This article outlines a quantitative method for segmenting texts into thematically coherent units... more This article outlines a quantitative method for segmenting texts into thematically coherent units. This method relies on a network of lexical collocations to compute the thematic coherence of the different parts of a text from the lexical cohesiveness of their words. We also present the results of an experiment about locating boundaries between a series of concatened texts.
Proceedings of the 19th international conference on Computational linguistics -, 2002
We present in this paper a method for achieving in an integrated way two tasks of topic analysis:... more We present in this paper a method for achieving in an integrated way two tasks of topic analysis: segmentation and link detection. This method combines word repetition and the lexical cohesion stated by a collocation network to compensate for the respective weaknesses of the two approaches. We report an evaluation of our method for segmentation on two corpora, one in French and one in English, and we propose an evaluation measure that specifically suits that kind of systems.
La segmentation thématique et l'identification des thèmes d'un document sont souvent traitées com... more La segmentation thématique et l'identification des thèmes d'un document sont souvent traitées comme des problèmes séparés, même si elles relèvent toutes deux de l'analyse thématique. Dans cet article, nous proposons d'examiner comment l'identification thématique peut contribuer à améliorer la segmentation de documents lorsque celle-ci ne s'appuie que sur la récurrence lexicale. Nous présentons d'abord une méthode non supervisée de découverte des thèmes d'un document ; puis nous détaillons comment ces thèmes sont utilisés dans la segmentation pour aider à reconnaître les similarités thématiques entre des segments de documents. Nous montrons enfin, au travers d'une évaluation faite à la fois pour le français et pour l'anglais, l'intérêt effectif de la méthode proposée.
... [BFB+07] Florian Boudin, Benoît Favre, Frédéric Béchet, Marc El-Bèze, Laurent Gillard, and Ju... more ... [BFB+07] Florian Boudin, Benoît Favre, Frédéric Béchet, Marc El-Bèze, Laurent Gillard, and Juan-Manuel Torres-Moreno ... In SIGIR'98: Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pages 335336 ...