Houda Bouamor | Paris Sud XI University (original) (raw)
Papers by Houda Bouamor
We present a novel system combination of machine translation and text summarization which provide... more We present a novel system combination of machine translation and text summarization which provides high quality summary translations superior to the baseline translation of the entire document. We first use supervised learning and build a classifier that predicts if the translation of a sentence has high or low translation quality. This is a reference-free estimation of MT quality which helps us to distinguish the subset of sentences which have better translation quality. We pair this classifier with a state-of-the-art summarization system to build an MT-aware summarization system. To evaluate summarization quality, we build a test set by summarizing a bilingual corpus. We evaluate the performance of our system with respect to both MT and summarization quality and, demonstrate that we can balance between improving MT quality and maintaining a decent summarization quality.
Advances in Natural Language Processing, Jan 1, 2010
In this article, the task of acquisition of subsentential paraphrases is discussed and several au... more In this article, the task of acquisition of subsentential paraphrases is discussed and several automatic techniques are presented. We describe an evaluation methodology to compare these techniques and some of their combinations. This methodology is applied on two corpora of sentential paraphrases obtained by multiple translations. The conclusions that are drawn can be used to guide future work for improving existing techniques.
… of the 2009 IEEE/WIC/ACM …, Jan 1, 2009
Geographical gazetteers are necessary in a wide variety of applications. In the past, the constru... more Geographical gazetteers are necessary in a wide variety of applications. In the past, the construction of such gazetteers has been a tedious, manual process and only recently have the first attempts to automate the gazetteers creation been made. Here we describe our approach for mining accurate but large-scale multilingual geographic information by successively filtering information found in heterogeneous data sources (Flickr, Wikipedia, Panoramio, Web pages indexed by search engines). Statistically crosschecking information found in each site, we are able to identify new geographic objects, and to indicate, for each one, its name, its GPS coordinates, its encompassing regions (city, region, country), the language of the name, its popularity, and the type of the object (church, bridge, etc.). We evaluate our approach by comparing, wherever possible, our multilingual gazetteer to other known attempts at automatically building a geographic database and to Geonames, a manually built gazetteer.
… of ACL, Short Papers session, Portland, …, Jan 1, 2011
In this paper, we present a novel way of tackling the monolingual alignment problem on pairs of s... more In this paper, we present a novel way of tackling the monolingual alignment problem on pairs of sentential paraphrases by means of edit rate computation. In order to inform the edit rate, information in the form of subsentential paraphrases is provided by a range of techniques built for different purposes. We show that the tunable TER-PLUS metric from Machine Translation evaluation can achieve good performance on this task and that it can effectively exploit information coming from complementary sources.
Actes de TALN, Jan 1, 2010
Actes de TALN, session de …, Jan 1, 2009
Dans cet article, nous présentons une application sur le web pour l'acquisition de paraphrases ph... more Dans cet article, nous présentons une application sur le web pour l'acquisition de paraphrases phrastiques et sous-phrastiques sous forme de jeu. L'application permet l'acquisition à la fois de paraphrases et de jugements humains multiples sur ces paraphrases, ce qui constitue des données particulièrement utiles pour les applications du TAL basées sur les phénomènes paraphrastiques.
ACL HLT 2011, Jan 1, 2011
We selected seven long papers and four short papers. Together, they tackle a diverse range of res... more We selected seven long papers and four short papers. Together, they tackle a diverse range of research questions: reflecting upon the scope of what might be generated in a text-to-text process, examining new generation methods, and addressing the ever challenging issue of evaluation.
Notes et documents du …, Jan 1, 2011
Resumen: Enéste artículo, se analizan las modificaciones accesibles a través del historial de rev... more Resumen: Enéste artículo, se analizan las modificaciones accesibles a través del historial de revisiones de Wikipedia en francés. Se define una tipologia de modificaciones basada en el estudio detallado de WiCoPaCo, un recurso gratuito construido a través de un estudio del historial de revisiones de Wikipedia. Conforme aésta tipologia, detallamos el estudio de la anotación manual de un subconjunto del corpus, con la intención de evaluar la dificultad de la tarea de identificación automática de paráfrasis en el mismo corpus. Finalmente, evaluamos una herramienta de identificación de paráfrasis a base de reglas. Palabras clave: Wikipedia, revisiones, identificación de paráfrasis Abstract: In this article, we analyse the modifications available in the French Wikipedia revision history. We define a typology of modifications based on a detailed study of WiCoPaCo, a freely-available resource built by automatically mining Wikipedia's revision history. Based on this typology, we detail a manual annotation study of a subpart of the corpus aimed at assessing the difficulty of automatic paraphrase identification in such a corpus. Finally, we assess a rule-based paraphrase identification tool.
Actes de RÉCITAL, Jan 1, 2010
Les corpus de paraphrases à large échelle sont importants dans de nombreuses applications de TAL.... more Les corpus de paraphrases à large échelle sont importants dans de nombreuses applications de TAL. Dans cet article nous présentons une méthode visant à obtenir un corpus parallèle de paraphrases d'énoncés en français. Elle vise à collecter des traductions multiples proposées par des contributeurs volontaires francophones à partir de plusieurs langues européennes. Nous formulons l'hypothèse que deux traductions soumises indépendamment par deux participants conservent généralement le sens de la phrase d'origine, quelle que soit la langue à partir de laquelle la traduction est effectuée. L'analyse des résultats nous permet de discuter cette hypothèse.
We present a novel system combination of machine translation and text summarization which provide... more We present a novel system combination of machine translation and text summarization which provides high quality summary translations superior to the baseline translation of the entire document. We first use supervised learning and build a classifier that predicts if the translation of a sentence has high or low translation quality. This is a reference-free estimation of MT quality which helps us to distinguish the subset of sentences which have better translation quality. We pair this classifier with a state-of-the-art summarization system to build an MT-aware summarization system. To evaluate summarization quality, we build a test set by summarizing a bilingual corpus. We evaluate the performance of our system with respect to both MT and summarization quality and, demonstrate that we can balance between improving MT quality and maintaining a decent summarization quality.
Advances in Natural Language Processing, Jan 1, 2010
In this article, the task of acquisition of subsentential paraphrases is discussed and several au... more In this article, the task of acquisition of subsentential paraphrases is discussed and several automatic techniques are presented. We describe an evaluation methodology to compare these techniques and some of their combinations. This methodology is applied on two corpora of sentential paraphrases obtained by multiple translations. The conclusions that are drawn can be used to guide future work for improving existing techniques.
… of the 2009 IEEE/WIC/ACM …, Jan 1, 2009
Geographical gazetteers are necessary in a wide variety of applications. In the past, the constru... more Geographical gazetteers are necessary in a wide variety of applications. In the past, the construction of such gazetteers has been a tedious, manual process and only recently have the first attempts to automate the gazetteers creation been made. Here we describe our approach for mining accurate but large-scale multilingual geographic information by successively filtering information found in heterogeneous data sources (Flickr, Wikipedia, Panoramio, Web pages indexed by search engines). Statistically crosschecking information found in each site, we are able to identify new geographic objects, and to indicate, for each one, its name, its GPS coordinates, its encompassing regions (city, region, country), the language of the name, its popularity, and the type of the object (church, bridge, etc.). We evaluate our approach by comparing, wherever possible, our multilingual gazetteer to other known attempts at automatically building a geographic database and to Geonames, a manually built gazetteer.
… of ACL, Short Papers session, Portland, …, Jan 1, 2011
In this paper, we present a novel way of tackling the monolingual alignment problem on pairs of s... more In this paper, we present a novel way of tackling the monolingual alignment problem on pairs of sentential paraphrases by means of edit rate computation. In order to inform the edit rate, information in the form of subsentential paraphrases is provided by a range of techniques built for different purposes. We show that the tunable TER-PLUS metric from Machine Translation evaluation can achieve good performance on this task and that it can effectively exploit information coming from complementary sources.
Actes de TALN, Jan 1, 2010
Actes de TALN, session de …, Jan 1, 2009
Dans cet article, nous présentons une application sur le web pour l'acquisition de paraphrases ph... more Dans cet article, nous présentons une application sur le web pour l'acquisition de paraphrases phrastiques et sous-phrastiques sous forme de jeu. L'application permet l'acquisition à la fois de paraphrases et de jugements humains multiples sur ces paraphrases, ce qui constitue des données particulièrement utiles pour les applications du TAL basées sur les phénomènes paraphrastiques.
ACL HLT 2011, Jan 1, 2011
We selected seven long papers and four short papers. Together, they tackle a diverse range of res... more We selected seven long papers and four short papers. Together, they tackle a diverse range of research questions: reflecting upon the scope of what might be generated in a text-to-text process, examining new generation methods, and addressing the ever challenging issue of evaluation.
Notes et documents du …, Jan 1, 2011
Resumen: Enéste artículo, se analizan las modificaciones accesibles a través del historial de rev... more Resumen: Enéste artículo, se analizan las modificaciones accesibles a través del historial de revisiones de Wikipedia en francés. Se define una tipologia de modificaciones basada en el estudio detallado de WiCoPaCo, un recurso gratuito construido a través de un estudio del historial de revisiones de Wikipedia. Conforme aésta tipologia, detallamos el estudio de la anotación manual de un subconjunto del corpus, con la intención de evaluar la dificultad de la tarea de identificación automática de paráfrasis en el mismo corpus. Finalmente, evaluamos una herramienta de identificación de paráfrasis a base de reglas. Palabras clave: Wikipedia, revisiones, identificación de paráfrasis Abstract: In this article, we analyse the modifications available in the French Wikipedia revision history. We define a typology of modifications based on a detailed study of WiCoPaCo, a freely-available resource built by automatically mining Wikipedia's revision history. Based on this typology, we detail a manual annotation study of a subpart of the corpus aimed at assessing the difficulty of automatic paraphrase identification in such a corpus. Finally, we assess a rule-based paraphrase identification tool.
Actes de RÉCITAL, Jan 1, 2010
Les corpus de paraphrases à large échelle sont importants dans de nombreuses applications de TAL.... more Les corpus de paraphrases à large échelle sont importants dans de nombreuses applications de TAL. Dans cet article nous présentons une méthode visant à obtenir un corpus parallèle de paraphrases d'énoncés en français. Elle vise à collecter des traductions multiples proposées par des contributeurs volontaires francophones à partir de plusieurs langues européennes. Nous formulons l'hypothèse que deux traductions soumises indépendamment par deux participants conservent généralement le sens de la phrase d'origine, quelle que soit la langue à partir de laquelle la traduction est effectuée. L'analyse des résultats nous permet de discuter cette hypothèse.