Philippe Langlais - Academia.edu (original) (raw)
Papers by Philippe Langlais
Lecture Notes in Computer Science, 2002
The past decade has witnessed exciting work in the field of Statistical Machine Translation (SMT)... more The past decade has witnessed exciting work in the field of Statistical Machine Translation (SMT). However, accurate evaluation of its potential in real-life contexts is still a questionable issue.In this study, we investigate the behavior of a SMT engine faced with a corpus far different from the one it has been trained on. We show that terminological databases are obvious resources that should be used to boost the performance of a statistical engine. We propose and evaluate a way of integrating terminology into a SMT engine which yields a significant reduction in word error rate.
Eleven groups participated in the event. This paper describes the goals, the task definition and ... more Eleven groups participated in the event. This paper describes the goals, the task definition and resources, as well as results and some analysis.
Text prediction is a form of interactive machine translation that is well suited to skilled trans... more Text prediction is a form of interactive machine translation that is well suited to skilled translators. In principle it can assist in the production of a target text with minimal disruption to a translator's normal routine. However, recent evaluations of a prototype prediction system showed that it significantly decreased the productivity of most translators who used it. In this paper, we analyze the reasons for this and propose a solution which consists in seeking predictions that maximize the expected benefit to the translator, rather than just trying to anticipate some amount of upcoming text. Using a model of a "typical translator" constructed from data collected in the evaluations of the prediction prototype, we show that this approach has the potential to turn text prediction into a help rather than a hindrance to a translator.
This paper describes the work achieved in the #rst half of a 4-year cooperative research project ... more This paper describes the work achieved in the #rst half of a 4-year cooperative research project #ARCADE#, #nanced byAUPELF-UREF. The project is devoted to the evaluation of parallel text alignment techniques. In its #rst period ARCADE ran a competition between six systems on a sentence-to-sentence alignment task which yielded two main types of results. First, a large reference bilingual corpus comprising of texts of di#erent genres was created, each presenting various degrees of di#culty with respect to the alignment task. Second, signi#cant methodological progress was made both on the evaluation protocols and metrics, and the algorithms used by the different systems. For the second phase, which is now underway, ARCADE has been opened to a larger number of teams who will tackle the problem of word-level alignment. 1 Introduction In the last few years, there has been a growing interest in parallel text alignment techniques. These techniques attempt to map various textual units to th...
Proceedings of the Eighteenth International Conference on Artificial Intelligence and Law
In recent years, transformer [4] and BERT models [1] have been widely used in plain NLP tasks wit... more In recent years, transformer [4] and BERT models [1] have been widely used in plain NLP tasks with the assumption that models first pretrained on massive corpora then fine-tuned on the dataset of a given task may suffice to achieve significant improvements. At the intersection of machine learning and law, legal judgment prediction (LJP) is a task that aims at predicting the outcome of a lawsuit based on a representation of the case. Such task is usually formalized in NLP as a text classification with different classes or labels corresponding to the verdicts. One specificity of court rulings is that their decisions are based on the application of legal articles to the facts described by the two parties (applicant and defendant).
Cet article présente une analyse détaillée des facteurs qui déterminent les performances des appr... more Cet article présente une analyse détaillée des facteurs qui déterminent les performances des approches de désambiguïsation dérivées de la méthode de Lesk (1986). Notre étude porte sur une série d’expériences concernant la méthode originelle de Lesk et des variantes que nous avons adaptées aux caractéristiques de WORDNET. Les variantes implémentées ont été évaluées sur le corpus de test de SENSEVAL2, English All Words, ainsi que sur des extraits du corpus SEMCOR. Notre évaluation se base d’un côté, sur le calcul de la précision et du rappel, selon le modèle de SENSEVAL, et d’un autre côté, sur une taxonomie des réponses qui permet de mesurer la prise de risque d’un décideur par rapport à un système de référence.
We propose a way to acquire rules for Open Information Extraction, based on lemma sequence patter... more We propose a way to acquire rules for Open Information Extraction, based on lemma sequence patterns (including potential typographical symbols) linking two named entities in a sentence. Rule acquisition is data-driven and requires little supervision. Given an arbitrary relation, we identify, in a large corpus, pairs of entities that are linked by the relation and then gather, score and rank other phrases that link the same entity pairs. We experimented with 81 relations and acquired 20 extraction rules for each by mining ClueWeb12. We devised a semi-automatic evaluation protocol to measure recall and precision and found them to be at most 79.9% and 62.4% respectively. Verbal patterns are of better quality than non-verbal ones, although the latter achieve a maximum recall of 76.5%. The strategy proposed does not necessitate expensive resources or time-consuming handcrafted resources, but does require a large amount of text.
Neural network approaches to Named-Entity Recognition reduce the need for carefully hand-crafted ... more Neural network approaches to Named-Entity Recognition reduce the need for carefully hand-crafted features. While some features do remain in state-of-the-art systems, lexical features have been mostly discarded, with the exception of gazetteers. In this work, we show that this is unfair: lexical features are actually quite useful. We propose to embed words and entity types into a low-dimensional vector space we train from annotated data produced by distant supervision thanks to Wikipedia. From this, we compute — offline — a feature vector representing each word. When used with a vanilla recurrent neural network model, this representation yields substantial improvements. We establish a new state-of-the-art F1 score of 87.95 on ONTONOTES 5.0, while matching state-of-the-art performance with a F1 score of 91.73 on the over-studied CONLL-2003 dataset.
Analogical learning is a lazy learning mechanism which maps input forms (e.g. strings) to output ... more Analogical learning is a lazy learning mechanism which maps input forms (e.g. strings) to output ones, thanks to analogies identified in a training material. It has proven effective in a number of Natural Language Processing (NLP) tasks such as machine translation. One challenge with this approach is the identification of analogies in the training material. In this study, we revisit an offline algorithm that has been proposed for enumerating analogies in a corpus. We extend it in order to scale to larger datasets, as well as to deal with new forms. On a task of translating unfrequent words of Wikipedia, we observe that our approach is much more efficient at identifying analogies than previously published methods, and that the resulting engine competes with a state-of-the-art phrase-based statistical machine translation (SMT) system.
Although good sentence aligners are freely available, our laboratory regularly receives requests ... more Although good sentence aligners are freely available, our laboratory regularly receives requests from researchers and industries for aligning parallel data. This motivated us to release yet another open-source sentence aligner we wrote nearly 20 years ago. This aligner is simple but it performs surprisingly well and often better than more elaborated ones, and do so very fast, allowing to align very large corpora. We analyze the robustness of our aligner across different text genres and level of noise. We also revisit the alignment procedure with which the Europarl corpus has been prepared and show that better SMT performance can be obtained by simply using our aligner.
In this study we compare two machine translation devices on twelve machine translation medicaldom... more In this study we compare two machine translation devices on twelve machine translation medicaldomain specific tasks, and two transliteration tasks, altogether involving twelve language pairs, including English-Chinese and English-Russian, which do not share the same scripts. We implemented an analogical device and compared its performance to the state-of-the-art phrase-based machine translation engine Moses. On most translation tasks, the analogical device outperforms the phrase-based one, and several combinations of both systems significantly outperform each system individually. For the sake of reproducibility, we share the datasets used in this study.
Parallel sentence extraction is a task addressing the data sparsity problem found in multilingual... more Parallel sentence extraction is a task addressing the data sparsity problem found in multilingual natural language processing applications. We propose an end-to-end deep neural network approach to detect translational equivalence between sentences in two different languages. In contrast to previous approaches, which typically rely on multiples models and various word alignment features, by leveraging continuous vector representation of sentences we remove the need of any domain specific feature engineering. Using a siamese bidirectional recurrent neural networks, our results against a strong baseline based on a state-of-the-art parallel sentence extraction system show a significant improvement in both the quality of the extracted parallel sentences and the translation performance of statistical machine translation systems. We believe this study is the first one to investigate deep learning for the parallel sentence extraction task.
Proceedings of the Third Workshop on Discourse in Machine Translation, 2017
Implicit discourse connectives and relations are distributed more widely in Chinese texts, when t... more Implicit discourse connectives and relations are distributed more widely in Chinese texts, when translating into English, such connectives are usually translated explicitly. Towards Chinese-English MT, in this paper we describe cross-lingual annotation and alignment of discourse connectives in a parallel corpus, describing related surveys and findings. We then conduct some evaluation experiments to testify the translation of implicit connectives and whether representing implicit connectives explicitly in source language can improve the final translation performance significantly. Preliminary results show it has little improvement by just inserting explicit connectives for implicit relations.
Lecture Notes in Computer Science, 2017
Proceedings of the 4th Workshop on Building and Using Comparable Corpora Comparable Corpora and the Web, Jun 24, 2011
While several recent works on dealing with large bilingual collections of texts, e.g. (Smith et a... more While several recent works on dealing with large bilingual collections of texts, e.g. (Smith et al., 2010), seek for extracting parallel sentences from comparable corpora, we present PARADOCS, a system designed to recognize pairs of parallel documents in a (large) bilingual collection of texts. We show that this system outperforms a fair baseline (Enright and Kondrak, 2007) in a number of controlled tasks. We applied it on the French-English cross-language linked article pairs of Wikipedia in order see whether parallel articles in this resource are available, and if our system is able to locate them. According to some manual evaluation we conducted, a fourth of the article pairs in Wikipedia are indeed in translation relation, and PARADOCS identifies parallel or noisy parallel article pairs with a precision of 80%.
Nous présentons notre participation à la deuxième campagne d'évaluation de CESTA, un projet EVALD... more Nous présentons notre participation à la deuxième campagne d'évaluation de CESTA, un projet EVALDA de l'action Technolangue. Le but de cette campagne consistait à tester l'aptitude des systèmes de traduction à s'adapter rapidement à une tâche spécifique. Nous analysons la fragilité d'un système de traduction probabiliste entraîné sur un corpus hors-domaine et dressons la liste des expériences que nous avons réalisées pour adapter notre système au domaine médical.
Lecture Notes in Computer Science, 2002
The past decade has witnessed exciting work in the field of Statistical Machine Translation (SMT)... more The past decade has witnessed exciting work in the field of Statistical Machine Translation (SMT). However, accurate evaluation of its potential in real-life contexts is still a questionable issue.In this study, we investigate the behavior of a SMT engine faced with a corpus far different from the one it has been trained on. We show that terminological databases are obvious resources that should be used to boost the performance of a statistical engine. We propose and evaluate a way of integrating terminology into a SMT engine which yields a significant reduction in word error rate.
Eleven groups participated in the event. This paper describes the goals, the task definition and ... more Eleven groups participated in the event. This paper describes the goals, the task definition and resources, as well as results and some analysis.
Text prediction is a form of interactive machine translation that is well suited to skilled trans... more Text prediction is a form of interactive machine translation that is well suited to skilled translators. In principle it can assist in the production of a target text with minimal disruption to a translator's normal routine. However, recent evaluations of a prototype prediction system showed that it significantly decreased the productivity of most translators who used it. In this paper, we analyze the reasons for this and propose a solution which consists in seeking predictions that maximize the expected benefit to the translator, rather than just trying to anticipate some amount of upcoming text. Using a model of a "typical translator" constructed from data collected in the evaluations of the prediction prototype, we show that this approach has the potential to turn text prediction into a help rather than a hindrance to a translator.
This paper describes the work achieved in the #rst half of a 4-year cooperative research project ... more This paper describes the work achieved in the #rst half of a 4-year cooperative research project #ARCADE#, #nanced byAUPELF-UREF. The project is devoted to the evaluation of parallel text alignment techniques. In its #rst period ARCADE ran a competition between six systems on a sentence-to-sentence alignment task which yielded two main types of results. First, a large reference bilingual corpus comprising of texts of di#erent genres was created, each presenting various degrees of di#culty with respect to the alignment task. Second, signi#cant methodological progress was made both on the evaluation protocols and metrics, and the algorithms used by the different systems. For the second phase, which is now underway, ARCADE has been opened to a larger number of teams who will tackle the problem of word-level alignment. 1 Introduction In the last few years, there has been a growing interest in parallel text alignment techniques. These techniques attempt to map various textual units to th...
Proceedings of the Eighteenth International Conference on Artificial Intelligence and Law
In recent years, transformer [4] and BERT models [1] have been widely used in plain NLP tasks wit... more In recent years, transformer [4] and BERT models [1] have been widely used in plain NLP tasks with the assumption that models first pretrained on massive corpora then fine-tuned on the dataset of a given task may suffice to achieve significant improvements. At the intersection of machine learning and law, legal judgment prediction (LJP) is a task that aims at predicting the outcome of a lawsuit based on a representation of the case. Such task is usually formalized in NLP as a text classification with different classes or labels corresponding to the verdicts. One specificity of court rulings is that their decisions are based on the application of legal articles to the facts described by the two parties (applicant and defendant).
Cet article présente une analyse détaillée des facteurs qui déterminent les performances des appr... more Cet article présente une analyse détaillée des facteurs qui déterminent les performances des approches de désambiguïsation dérivées de la méthode de Lesk (1986). Notre étude porte sur une série d’expériences concernant la méthode originelle de Lesk et des variantes que nous avons adaptées aux caractéristiques de WORDNET. Les variantes implémentées ont été évaluées sur le corpus de test de SENSEVAL2, English All Words, ainsi que sur des extraits du corpus SEMCOR. Notre évaluation se base d’un côté, sur le calcul de la précision et du rappel, selon le modèle de SENSEVAL, et d’un autre côté, sur une taxonomie des réponses qui permet de mesurer la prise de risque d’un décideur par rapport à un système de référence.
We propose a way to acquire rules for Open Information Extraction, based on lemma sequence patter... more We propose a way to acquire rules for Open Information Extraction, based on lemma sequence patterns (including potential typographical symbols) linking two named entities in a sentence. Rule acquisition is data-driven and requires little supervision. Given an arbitrary relation, we identify, in a large corpus, pairs of entities that are linked by the relation and then gather, score and rank other phrases that link the same entity pairs. We experimented with 81 relations and acquired 20 extraction rules for each by mining ClueWeb12. We devised a semi-automatic evaluation protocol to measure recall and precision and found them to be at most 79.9% and 62.4% respectively. Verbal patterns are of better quality than non-verbal ones, although the latter achieve a maximum recall of 76.5%. The strategy proposed does not necessitate expensive resources or time-consuming handcrafted resources, but does require a large amount of text.
Neural network approaches to Named-Entity Recognition reduce the need for carefully hand-crafted ... more Neural network approaches to Named-Entity Recognition reduce the need for carefully hand-crafted features. While some features do remain in state-of-the-art systems, lexical features have been mostly discarded, with the exception of gazetteers. In this work, we show that this is unfair: lexical features are actually quite useful. We propose to embed words and entity types into a low-dimensional vector space we train from annotated data produced by distant supervision thanks to Wikipedia. From this, we compute — offline — a feature vector representing each word. When used with a vanilla recurrent neural network model, this representation yields substantial improvements. We establish a new state-of-the-art F1 score of 87.95 on ONTONOTES 5.0, while matching state-of-the-art performance with a F1 score of 91.73 on the over-studied CONLL-2003 dataset.
Analogical learning is a lazy learning mechanism which maps input forms (e.g. strings) to output ... more Analogical learning is a lazy learning mechanism which maps input forms (e.g. strings) to output ones, thanks to analogies identified in a training material. It has proven effective in a number of Natural Language Processing (NLP) tasks such as machine translation. One challenge with this approach is the identification of analogies in the training material. In this study, we revisit an offline algorithm that has been proposed for enumerating analogies in a corpus. We extend it in order to scale to larger datasets, as well as to deal with new forms. On a task of translating unfrequent words of Wikipedia, we observe that our approach is much more efficient at identifying analogies than previously published methods, and that the resulting engine competes with a state-of-the-art phrase-based statistical machine translation (SMT) system.
Although good sentence aligners are freely available, our laboratory regularly receives requests ... more Although good sentence aligners are freely available, our laboratory regularly receives requests from researchers and industries for aligning parallel data. This motivated us to release yet another open-source sentence aligner we wrote nearly 20 years ago. This aligner is simple but it performs surprisingly well and often better than more elaborated ones, and do so very fast, allowing to align very large corpora. We analyze the robustness of our aligner across different text genres and level of noise. We also revisit the alignment procedure with which the Europarl corpus has been prepared and show that better SMT performance can be obtained by simply using our aligner.
In this study we compare two machine translation devices on twelve machine translation medicaldom... more In this study we compare two machine translation devices on twelve machine translation medicaldomain specific tasks, and two transliteration tasks, altogether involving twelve language pairs, including English-Chinese and English-Russian, which do not share the same scripts. We implemented an analogical device and compared its performance to the state-of-the-art phrase-based machine translation engine Moses. On most translation tasks, the analogical device outperforms the phrase-based one, and several combinations of both systems significantly outperform each system individually. For the sake of reproducibility, we share the datasets used in this study.
Parallel sentence extraction is a task addressing the data sparsity problem found in multilingual... more Parallel sentence extraction is a task addressing the data sparsity problem found in multilingual natural language processing applications. We propose an end-to-end deep neural network approach to detect translational equivalence between sentences in two different languages. In contrast to previous approaches, which typically rely on multiples models and various word alignment features, by leveraging continuous vector representation of sentences we remove the need of any domain specific feature engineering. Using a siamese bidirectional recurrent neural networks, our results against a strong baseline based on a state-of-the-art parallel sentence extraction system show a significant improvement in both the quality of the extracted parallel sentences and the translation performance of statistical machine translation systems. We believe this study is the first one to investigate deep learning for the parallel sentence extraction task.
Proceedings of the Third Workshop on Discourse in Machine Translation, 2017
Implicit discourse connectives and relations are distributed more widely in Chinese texts, when t... more Implicit discourse connectives and relations are distributed more widely in Chinese texts, when translating into English, such connectives are usually translated explicitly. Towards Chinese-English MT, in this paper we describe cross-lingual annotation and alignment of discourse connectives in a parallel corpus, describing related surveys and findings. We then conduct some evaluation experiments to testify the translation of implicit connectives and whether representing implicit connectives explicitly in source language can improve the final translation performance significantly. Preliminary results show it has little improvement by just inserting explicit connectives for implicit relations.
Lecture Notes in Computer Science, 2017
Proceedings of the 4th Workshop on Building and Using Comparable Corpora Comparable Corpora and the Web, Jun 24, 2011
While several recent works on dealing with large bilingual collections of texts, e.g. (Smith et a... more While several recent works on dealing with large bilingual collections of texts, e.g. (Smith et al., 2010), seek for extracting parallel sentences from comparable corpora, we present PARADOCS, a system designed to recognize pairs of parallel documents in a (large) bilingual collection of texts. We show that this system outperforms a fair baseline (Enright and Kondrak, 2007) in a number of controlled tasks. We applied it on the French-English cross-language linked article pairs of Wikipedia in order see whether parallel articles in this resource are available, and if our system is able to locate them. According to some manual evaluation we conducted, a fourth of the article pairs in Wikipedia are indeed in translation relation, and PARADOCS identifies parallel or noisy parallel article pairs with a precision of 80%.
Nous présentons notre participation à la deuxième campagne d'évaluation de CESTA, un projet EVALD... more Nous présentons notre participation à la deuxième campagne d'évaluation de CESTA, un projet EVALDA de l'action Technolangue. Le but de cette campagne consistait à tester l'aptitude des systèmes de traduction à s'adapter rapidement à une tâche spécifique. Nous analysons la fragilité d'un système de traduction probabiliste entraîné sur un corpus hors-domaine et dressons la liste des expériences que nous avons réalisées pour adapter notre système au domaine médical.