Gael Lejeune | Université Paris-Sorbonne (Paris IV) (original) (raw)
Papers by Gael Lejeune
IE systems nowadays work very well, but they are mostly monolingual and difficult to convert to o... more IE systems nowadays work very well, but they are mostly monolingual and difficult to convert to other languages. We maybe have then to stop thinking only with traditional pattern-based approaches. Our project, PULS, makes epidemic surveillance through analysis of On-Line News in collaboration with MedISys, developed at the European Commission's Joint Research Centre (EC-JRC). PULS had only an English pattern-based system and we worked on a pilot study on French to prepare a multilingual extension. We will present here why we chose to ignore classical approaches and how we can use it with a mainly language-independent based only on discourse properties of press articlestructure. Our results show a precision of 87% and a recall of 93%. And we have good reasons to think that this approach will also be efficient for other languages.
This paper proposes a corpus for the development and evaluation of tools and techniques for ident... more This paper proposes a corpus for the development and evaluation of tools and techniques for identifying emerging infectious disease threats in online news text. The corpus can not only be used for information extraction, but also for other natural language processing (NLP) tasks such as text classification. We make use of articles published on the Program for Monitoring Emerging Diseases (ProMED) platform, which provides current information about outbreaks of infectious diseases globally. Among the key pieces of information present in the articles is the uniform resource locator (URL) to the online news sources where the outbreaks were originally reported. We detail the procedure followed to build the dataset, which includes leveraging the source URLs to retrieve the news reports and subsequently pre-processing the retrieved documents. We also report on experimental results of event extraction on the dataset using the Data Analysis for Information Extraction in any Language(DAnIEL) ...
Artificial intelligence in medicine, Jan 17, 2015
This paper presents a multilingual news surveillance system applied to tele-epidemiology. It has ... more This paper presents a multilingual news surveillance system applied to tele-epidemiology. It has been shown that multilingual approaches improve timeliness in detection of epidemic events across the globe, eliminating the wait for local news to be translated into major languages. We present here a system to extract epidemic events in potentially any language, provided a Wikipedia seed for common disease names exists. The Daniel system presented herein relies on properties that are common to news writing (the journalistic genre), the most useful being repetition and saliency. Wikipedia is used to screen common disease names to be matched with repeated characters strings. Language variations, such as declensions, are handled by processing text at the character-level, rather than at the word level. This additionally makes it possible to handle various writing systems in a similar fashion. As no multilingual ground truth existed to evaluate the Daniel system, we built a multilingual cor...
Lecture Notes in Computer Science, 2013
The early detection of disease outbursts is an important objective of epidemic surveillance. The ... more The early detection of disease outbursts is an important objective of epidemic surveillance. The web news are one of the information bases for detecting epidemic events as soon as possible, but to analyze tens of thousands articles published daily is costly. Recently, automatic systems have been devoted to epidemiological surveillance. The main issue for these systems is to process more languages at a limited cost. However, existing systems mainly process major languages (English, French, Russian, Spanish.. .). Thus, when the first news reporting a disease is in a minor language, the timeliness of event detection is worsened. In this paper, we test an automatic style-based method, designed to fill the gaps of existing automatic systems. It is parsimonious in resources and specially designed for multilingual issues. The events detected by the human-moderated ProMED mail between November 2011 and January 2012 are used as a reference dataset and compared to events detected in 17 languages by the system DAnIEL2 from web articles of this timewindow. We show how being able to process press articles in languages less-spoken allows quicker detection of epidemic events in some regions of the world.
2013 IEEE International Conference on Healthcare Informatics, 2013
In this paper, we introduce a multilingual epidemiological news surveillance system. Its main con... more In this paper, we introduce a multilingual epidemiological news surveillance system. Its main contribution is its ability to extract epidemic events in any language, hence succeeding where state-of-the-art in surveillance systems usually fails : the objective of reactivity. Most systems indeed focus on a selected list of languages, deemed important. However, evidence shows that events are first described in the local language, and translated to other languages later, if and only if they contained important information. Hence, while systems handling only a sample of human languages may indeed succeed at extracting epidemic events, they will only do so after someone else detected the importance of the news, and made the decision to translate it. Thus, with events first described in other languages, such automated systems, that may only detect events that were already detected by humans, are essentially irrelevant for early detection. To overcome this weakness of the state-of-the-art in terms of reactivity, we designed a system that can detect epidemiological events in any language, without requiring any translation, be it automated or human-written. The solution presented in this paper relies on properties that may be called language universals. First, we observe and exploit properties of the news genre that remain unchanged, whatever the writing language. Second, we handle language variations, such as declensions, by processing text at the character-level, rather than at the word level. This additionally allows to handle various writing systems in a similar fashion. We present experiments with 5 languages, steoreotypical of different language families and writing systems : English, Chinese, Greek, Polish and Russian. Our system, DAnIEL, achieves an average F-measure score around 85%, slightly below top-performing systems for the languages that such systems are able to handle. However, its performance is superior for morphologically-rich languages. And it performs of course infinitely better for the languages that other systems are not able to handle : The richest system in the state-of-the-art handles around 10 languages, while there exists about 6,000 languages in the world, 300 of which are spoken by more than one million people. The DAnIEL system is able to process each of them.
Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, 2010
†qigD niversity of gen (rstnmeFlstnmedinfoFunienFfr ‡horemiD niversity of relsinki (rstnmeFlstnme... more †qigD niversity of gen (rstnmeFlstnmedinfoFunienFfr ‡horemiD niversity of relsinki (rstnmeFlstnmedsFhelsinkiF(ummryF sn epidemi surveillneD monitoring numerous lnguges is gret issueF in this pper we will present system designed to work on prenhD pnish nd inglishF he originlity of our system is tht we use only few resoures to do our informtion extrtion tsksF snsted of using ontologiesD we use struture ptterns of newsppers rtilesF he results on these three lnguges re very good t this stge nd we will present few exmples of interesting experiments in other lngugesF
Advances in Natural Language Processing, 2012
This study aims at developing a news surveillance system able to address multilingual web corpora... more This study aims at developing a news surveillance system able to address multilingual web corpora. As an example of a domain where multilingual capacity is crucial, we focus on Epidemic Surveillance. This task necessitates worldwide coverage of news in order to detect new events as quickly as possible, anywhere, whatever the language it is rst reported in. In this study, text-genre is used rather than sentence analysis. The news-genre properties allow us to assess the thematic relevance of news, ltered with the help of a specialised lexicon that is automatically collected on Wikipedia. Afterwards, a more detailed analysis of text specic properties is applied to relevant documents to better characterize the epidemic event (i.e., which disease spreads where?). Results from 400 documents in each language demonstrate the interest of this multilingual approach with light resources. DAnIEL achieves an F1-measure score around 85%. Two issues are addressed: the rst is morphology rich languages, e.g. Greek, Polish and Russian as compared to English. The second is event location detection as related to disease detection. This system provides a reliable alternative to the generic IE architecture that is constrained by the lack of numerous components in many languages.
AMICT volume 11, May 10, 2010
IE systems nowadays work very well, but they are mostly monolingual and difficult to convert to o... more IE systems nowadays work very well, but they are mostly monolingual and difficult to convert to other languages. We maybe have then to stop thinking only with traditional pattern-based approaches. Our project, PULS, makes epidemic surveillance through analysis of On-Line News in collaboration with MedISys, developed at the European Commission's Joint Research Centre (EC-JRC). PULS had only an English pattern-based system and we worked on a pilot study on French to prepare a multilingual extension. We ...
Actes du septième DÉfi Fouille de Textes, Jul 1, 2011
Résumé Nous présentons ici une expérimentation dans le cadre de la seconde tâche du défi fouille ... more Résumé Nous présentons ici une expérimentation dans le cadre de la seconde tâche du défi fouille de textes (DEFT) 2011: appariement de résumés et d'articles scientifiques en français. Nous avons fondé nos travaux sur une approche à base de distribution de chaînes de caractères de manière à construire un système simple et correspondant à une conception endogène et multilingue des systèmes. Notre méthode a obtenu de très bons résultats pour la piste 1" articles complets"(100%) mais a été moins efficace sur la piste 2" ...
Actes de l’atelier de clôture du huitième défi fouille de texte (DEFT), Jun 8, 2012
RÉSUMÉ Nous présentons dans cet article les méthodes utilisées par l'équipe HULTECH pour sa ... more RÉSUMÉ Nous présentons dans cet article les méthodes utilisées par l'équipe HULTECH pour sa participation au Défi Fouille de Textes 2012 (Deft 2012). La tâche de cette édition du défi consiste à retrouver dans des articles scientifiques, les mots-clés choisis par les auteurs. Nous nous appuyons sur la détection de chaînes répétées maximales (rstrmax), au grain caractère et au grain mot. La méthode développée est simple et non supervisée. Elle a permis à notre système d'atteindre la 3e place (sur 10 équipes) sur la première piste du ...
Proceedings of the 4th …, Aug 28, 2010
Processing content for security becomes more and more important since every local danger can have... more Processing content for security becomes more and more important since every local danger can have global consequences. Being able to collect and analyse information in different languages is a great issue. This paper addresses multilingual solutions for analysis of press articles for epidemiological surveillance. The system described here relies on pragmatics and stylistics, giving up "bag of sentences" approach in favour of discourse repetition patterns. It only needs light resources (compared to existing systems) in order to process new languages easily. In this paper we present here results in English, French and Chinese, three languages with quite different characteristics. These results show that simple rules allow selection of relevant documents in a specialized database improving the reliability of information extraction.
Proceedings of the 28th International Conference on Computational Linguistics, 2020
In this paper, we approach the multilingual text classification task in the context of the epidem... more In this paper, we approach the multilingual text classification task in the context of the epidemiological field. Multilingual text classification models tend to perform differently across different languages (low-or high-resource), more particularly when the dataset is highly imbalanced, which is the case for epidemiological datasets. We conduct a comparative study of different machine and deep learning text classification models using a dataset comprising news articles related to epidemic outbreaks from six languages, four low-resourced and two high-resourced, in order to analyze the influence of the nature of the language, the structure of the document, and the size of the data. Our findings indicate that the performance of the models based on fine-tuned language models exceeds by more than 50% the chosen baseline models that include a specialized epidemiological news surveillance system and several machine learning models. Also, low-resource languages are highly influenced not only by the typology of the languages on which the models have been pre-trained or/and fine-tuned but also by their size. Furthermore, we discover that the beginning and the end of documents provide the most salient features for this task and, as expected, the performance of the models was proportionate to the training data size.
DAnIEL est un système multilingue de veille épidémiologique. DAnIEL permet de traiter un grand no... more DAnIEL est un système multilingue de veille épidémiologique. DAnIEL permet de traiter un grand nombre de langues à faible coût grâce à une approche parcimonieuse en ressources.
This article tackles the Authorship Attribution task according to the language independence issue... more This article tackles the Authorship Attribution task according to the language independence issue. We propose an alternative of variable length character \emph{n}-gram features in supervised methods: \emph{maximal repeats} in strings. When character \emph{n}-grams are by essence redundant, maximal repeats are a condensed way to represent any substring of a corpus. Our experiments show that the redundant aspect of character \emph{n}-grams contributes to the efficiency of character-based Authorship Attribution techniques. Therefore, we introduce a new way to weight features in vector based classifier by introducing \emph{n}-th \emph{order maximal repeats} (maximal repeats detected in a set of maximal repeats). The experimental results show higher performance with maximal repeats, with less data than \emph{n}-grams based approach.
In this article, we tackle the problem of evaluation of web page cleaning tools. This task is sel... more In this article, we tackle the problem of evaluation of web page cleaning tools. This task is seldom studied in the literature although it has consequences on the linguistic processing performed on web-based corpora. We propose two types of evaluation : (I) an intrinsic (content-based) evaluation with measures on words, tags and characters ; (II) an extrinsic (task-based) evaluation on the same corpus by studying the effects of the cleaning step on the performances of an NLP pipeline. We show that the results are not consistent in both evaluations. We show as well that there are important differences in the results between the studied languages. We conclude that the choice of a web page cleaning tool should be made in view of the aimed task rather than on the performances of the tools in an intrinsic evaluation. Mots-clés : Nettoyage de pages Web, collecte de corpus, évaluation intrinsèque, évaluation extrinsèque, détourage.
Researches in the field of Word Sense Disambiguation focus on identifying the precise meaning of ... more Researches in the field of Word Sense Disambiguation focus on identifying the precise meaning of a lexical unit found in a text. This article tackles another kind of problem : assessing the ambiguity of a lexical unit. In other words, we try to identify if a particular unit is ambiguous or not, we define this task as ambiguity diagnosis. Our evaluation dataset contains scientific articles where ambiguous words have been tagged by experts. In order to give an ambiguity diagnosis for each term, we use two types of features : POS tags and positions in the text. We show that the position of an occurrence in the text is a strong hint for such a task. Mots-clés : diagnostic d'ambiguïté, extraction de mot-clés, terminologie.
This archive contains the documents in html format as well as annotations in json format. The cor... more This archive contains the documents in html format as well as annotations in json format. The corpus contains 2089 documents in 5 languages (Chinese, English, Greek, Polish and Russian). Each document has been manually cleaned in order to keep the text and the paragraph marks. Each file is encoded in UTF-8. This corpus has been annotated by native speakers not involved in DAnIEL's developement. The guidelines given to our annotators can be found here : https://daniel.greyc.fr/guidelines.pdf
This article tackles the Authorship Attribution task according to the language independence issue... more This article tackles the Authorship Attribution task according to the language independence issue. We propose an alternative of variable length character n-gram features in supervised methods : maximal repeats in strings. When character n-grams are by essence redundant, maximal repeats are a condensed way to represent any substring of a corpus. Our experiments show that the redundant aspect of character n-grams contributes to the efficiency of character-based Authorship Attribution techniques. Therefore, we introduce a new way to weight features in vector based classifier by introducing n-th order maximal repeats (maximal repeats detected in a set of maximal repeats). The experimental results show higher performance with maximal repeats, with less data than n-grams based approach. Source-code and algorithm for detecting maximal repeats are proposed as well.
RÉSUMÉ Cet article aborde une question centrale de l'alignement automatique, celle du diagnos... more RÉSUMÉ Cet article aborde une question centrale de l'alignement automatique, celle du diagnostic de parallélisme des documents à aligner. Les recherches en la matière se sont jusqu'alors concentrées sur l'analyse de documents parallèles par nature : corpus de textes réglementaires, documents techniques ou phrases isolées. Les phénomènes d'inversions et de suppressions/ajouts pouvant exister entre les différentes versions d'un document sont ainsi souvent ignorées. Nous proposons donc une méthode pour diagnostiquer en contexte des zones parallèles à l'intérieur des documents. Cette méthode permet la détection d'inversions ou de suppressions entre les documents à aligner. Elle repose sur l'affranchissement de la notion de mot et de phrase, ainsi que sur la prise en compte de la Mise en Forme Matérielle du texte (MFM). Sa mise en oeuvre est basée sur des similitudes de répartition de chaînes de caractères répétées dans les différents documents. Ces répart...
We present here our work on the second task of 2011's Deft: pairing scientific articles and t... more We present here our work on the second task of 2011's Deft: pairing scientific articles and their abstract. Our approach is based on distribution of character strings. Our aim was not only to be efficient on that particular task on French but to build a system that can easily be used for other languages. Our method achieved very good results on track 1 "full articles" (100%) but had more problems with track 2 where introduction and conclusion were removed (96%). Keywords: Maximal repeated character strings, endogenous method, multilingual approach, differential linguistics, stringology
IE systems nowadays work very well, but they are mostly monolingual and difficult to convert to o... more IE systems nowadays work very well, but they are mostly monolingual and difficult to convert to other languages. We maybe have then to stop thinking only with traditional pattern-based approaches. Our project, PULS, makes epidemic surveillance through analysis of On-Line News in collaboration with MedISys, developed at the European Commission's Joint Research Centre (EC-JRC). PULS had only an English pattern-based system and we worked on a pilot study on French to prepare a multilingual extension. We will present here why we chose to ignore classical approaches and how we can use it with a mainly language-independent based only on discourse properties of press articlestructure. Our results show a precision of 87% and a recall of 93%. And we have good reasons to think that this approach will also be efficient for other languages.
This paper proposes a corpus for the development and evaluation of tools and techniques for ident... more This paper proposes a corpus for the development and evaluation of tools and techniques for identifying emerging infectious disease threats in online news text. The corpus can not only be used for information extraction, but also for other natural language processing (NLP) tasks such as text classification. We make use of articles published on the Program for Monitoring Emerging Diseases (ProMED) platform, which provides current information about outbreaks of infectious diseases globally. Among the key pieces of information present in the articles is the uniform resource locator (URL) to the online news sources where the outbreaks were originally reported. We detail the procedure followed to build the dataset, which includes leveraging the source URLs to retrieve the news reports and subsequently pre-processing the retrieved documents. We also report on experimental results of event extraction on the dataset using the Data Analysis for Information Extraction in any Language(DAnIEL) ...
Artificial intelligence in medicine, Jan 17, 2015
This paper presents a multilingual news surveillance system applied to tele-epidemiology. It has ... more This paper presents a multilingual news surveillance system applied to tele-epidemiology. It has been shown that multilingual approaches improve timeliness in detection of epidemic events across the globe, eliminating the wait for local news to be translated into major languages. We present here a system to extract epidemic events in potentially any language, provided a Wikipedia seed for common disease names exists. The Daniel system presented herein relies on properties that are common to news writing (the journalistic genre), the most useful being repetition and saliency. Wikipedia is used to screen common disease names to be matched with repeated characters strings. Language variations, such as declensions, are handled by processing text at the character-level, rather than at the word level. This additionally makes it possible to handle various writing systems in a similar fashion. As no multilingual ground truth existed to evaluate the Daniel system, we built a multilingual cor...
Lecture Notes in Computer Science, 2013
The early detection of disease outbursts is an important objective of epidemic surveillance. The ... more The early detection of disease outbursts is an important objective of epidemic surveillance. The web news are one of the information bases for detecting epidemic events as soon as possible, but to analyze tens of thousands articles published daily is costly. Recently, automatic systems have been devoted to epidemiological surveillance. The main issue for these systems is to process more languages at a limited cost. However, existing systems mainly process major languages (English, French, Russian, Spanish.. .). Thus, when the first news reporting a disease is in a minor language, the timeliness of event detection is worsened. In this paper, we test an automatic style-based method, designed to fill the gaps of existing automatic systems. It is parsimonious in resources and specially designed for multilingual issues. The events detected by the human-moderated ProMED mail between November 2011 and January 2012 are used as a reference dataset and compared to events detected in 17 languages by the system DAnIEL2 from web articles of this timewindow. We show how being able to process press articles in languages less-spoken allows quicker detection of epidemic events in some regions of the world.
2013 IEEE International Conference on Healthcare Informatics, 2013
In this paper, we introduce a multilingual epidemiological news surveillance system. Its main con... more In this paper, we introduce a multilingual epidemiological news surveillance system. Its main contribution is its ability to extract epidemic events in any language, hence succeeding where state-of-the-art in surveillance systems usually fails : the objective of reactivity. Most systems indeed focus on a selected list of languages, deemed important. However, evidence shows that events are first described in the local language, and translated to other languages later, if and only if they contained important information. Hence, while systems handling only a sample of human languages may indeed succeed at extracting epidemic events, they will only do so after someone else detected the importance of the news, and made the decision to translate it. Thus, with events first described in other languages, such automated systems, that may only detect events that were already detected by humans, are essentially irrelevant for early detection. To overcome this weakness of the state-of-the-art in terms of reactivity, we designed a system that can detect epidemiological events in any language, without requiring any translation, be it automated or human-written. The solution presented in this paper relies on properties that may be called language universals. First, we observe and exploit properties of the news genre that remain unchanged, whatever the writing language. Second, we handle language variations, such as declensions, by processing text at the character-level, rather than at the word level. This additionally allows to handle various writing systems in a similar fashion. We present experiments with 5 languages, steoreotypical of different language families and writing systems : English, Chinese, Greek, Polish and Russian. Our system, DAnIEL, achieves an average F-measure score around 85%, slightly below top-performing systems for the languages that such systems are able to handle. However, its performance is superior for morphologically-rich languages. And it performs of course infinitely better for the languages that other systems are not able to handle : The richest system in the state-of-the-art handles around 10 languages, while there exists about 6,000 languages in the world, 300 of which are spoken by more than one million people. The DAnIEL system is able to process each of them.
Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, 2010
†qigD niversity of gen (rstnmeFlstnmedinfoFunienFfr ‡horemiD niversity of relsinki (rstnmeFlstnme... more †qigD niversity of gen (rstnmeFlstnmedinfoFunienFfr ‡horemiD niversity of relsinki (rstnmeFlstnmedsFhelsinkiF(ummryF sn epidemi surveillneD monitoring numerous lnguges is gret issueF in this pper we will present system designed to work on prenhD pnish nd inglishF he originlity of our system is tht we use only few resoures to do our informtion extrtion tsksF snsted of using ontologiesD we use struture ptterns of newsppers rtilesF he results on these three lnguges re very good t this stge nd we will present few exmples of interesting experiments in other lngugesF
Advances in Natural Language Processing, 2012
This study aims at developing a news surveillance system able to address multilingual web corpora... more This study aims at developing a news surveillance system able to address multilingual web corpora. As an example of a domain where multilingual capacity is crucial, we focus on Epidemic Surveillance. This task necessitates worldwide coverage of news in order to detect new events as quickly as possible, anywhere, whatever the language it is rst reported in. In this study, text-genre is used rather than sentence analysis. The news-genre properties allow us to assess the thematic relevance of news, ltered with the help of a specialised lexicon that is automatically collected on Wikipedia. Afterwards, a more detailed analysis of text specic properties is applied to relevant documents to better characterize the epidemic event (i.e., which disease spreads where?). Results from 400 documents in each language demonstrate the interest of this multilingual approach with light resources. DAnIEL achieves an F1-measure score around 85%. Two issues are addressed: the rst is morphology rich languages, e.g. Greek, Polish and Russian as compared to English. The second is event location detection as related to disease detection. This system provides a reliable alternative to the generic IE architecture that is constrained by the lack of numerous components in many languages.
AMICT volume 11, May 10, 2010
IE systems nowadays work very well, but they are mostly monolingual and difficult to convert to o... more IE systems nowadays work very well, but they are mostly monolingual and difficult to convert to other languages. We maybe have then to stop thinking only with traditional pattern-based approaches. Our project, PULS, makes epidemic surveillance through analysis of On-Line News in collaboration with MedISys, developed at the European Commission's Joint Research Centre (EC-JRC). PULS had only an English pattern-based system and we worked on a pilot study on French to prepare a multilingual extension. We ...
Actes du septième DÉfi Fouille de Textes, Jul 1, 2011
Résumé Nous présentons ici une expérimentation dans le cadre de la seconde tâche du défi fouille ... more Résumé Nous présentons ici une expérimentation dans le cadre de la seconde tâche du défi fouille de textes (DEFT) 2011: appariement de résumés et d'articles scientifiques en français. Nous avons fondé nos travaux sur une approche à base de distribution de chaînes de caractères de manière à construire un système simple et correspondant à une conception endogène et multilingue des systèmes. Notre méthode a obtenu de très bons résultats pour la piste 1" articles complets"(100%) mais a été moins efficace sur la piste 2" ...
Actes de l’atelier de clôture du huitième défi fouille de texte (DEFT), Jun 8, 2012
RÉSUMÉ Nous présentons dans cet article les méthodes utilisées par l'équipe HULTECH pour sa ... more RÉSUMÉ Nous présentons dans cet article les méthodes utilisées par l'équipe HULTECH pour sa participation au Défi Fouille de Textes 2012 (Deft 2012). La tâche de cette édition du défi consiste à retrouver dans des articles scientifiques, les mots-clés choisis par les auteurs. Nous nous appuyons sur la détection de chaînes répétées maximales (rstrmax), au grain caractère et au grain mot. La méthode développée est simple et non supervisée. Elle a permis à notre système d'atteindre la 3e place (sur 10 équipes) sur la première piste du ...
Proceedings of the 4th …, Aug 28, 2010
Processing content for security becomes more and more important since every local danger can have... more Processing content for security becomes more and more important since every local danger can have global consequences. Being able to collect and analyse information in different languages is a great issue. This paper addresses multilingual solutions for analysis of press articles for epidemiological surveillance. The system described here relies on pragmatics and stylistics, giving up "bag of sentences" approach in favour of discourse repetition patterns. It only needs light resources (compared to existing systems) in order to process new languages easily. In this paper we present here results in English, French and Chinese, three languages with quite different characteristics. These results show that simple rules allow selection of relevant documents in a specialized database improving the reliability of information extraction.
Proceedings of the 28th International Conference on Computational Linguistics, 2020
In this paper, we approach the multilingual text classification task in the context of the epidem... more In this paper, we approach the multilingual text classification task in the context of the epidemiological field. Multilingual text classification models tend to perform differently across different languages (low-or high-resource), more particularly when the dataset is highly imbalanced, which is the case for epidemiological datasets. We conduct a comparative study of different machine and deep learning text classification models using a dataset comprising news articles related to epidemic outbreaks from six languages, four low-resourced and two high-resourced, in order to analyze the influence of the nature of the language, the structure of the document, and the size of the data. Our findings indicate that the performance of the models based on fine-tuned language models exceeds by more than 50% the chosen baseline models that include a specialized epidemiological news surveillance system and several machine learning models. Also, low-resource languages are highly influenced not only by the typology of the languages on which the models have been pre-trained or/and fine-tuned but also by their size. Furthermore, we discover that the beginning and the end of documents provide the most salient features for this task and, as expected, the performance of the models was proportionate to the training data size.
DAnIEL est un système multilingue de veille épidémiologique. DAnIEL permet de traiter un grand no... more DAnIEL est un système multilingue de veille épidémiologique. DAnIEL permet de traiter un grand nombre de langues à faible coût grâce à une approche parcimonieuse en ressources.
This article tackles the Authorship Attribution task according to the language independence issue... more This article tackles the Authorship Attribution task according to the language independence issue. We propose an alternative of variable length character \emph{n}-gram features in supervised methods: \emph{maximal repeats} in strings. When character \emph{n}-grams are by essence redundant, maximal repeats are a condensed way to represent any substring of a corpus. Our experiments show that the redundant aspect of character \emph{n}-grams contributes to the efficiency of character-based Authorship Attribution techniques. Therefore, we introduce a new way to weight features in vector based classifier by introducing \emph{n}-th \emph{order maximal repeats} (maximal repeats detected in a set of maximal repeats). The experimental results show higher performance with maximal repeats, with less data than \emph{n}-grams based approach.
In this article, we tackle the problem of evaluation of web page cleaning tools. This task is sel... more In this article, we tackle the problem of evaluation of web page cleaning tools. This task is seldom studied in the literature although it has consequences on the linguistic processing performed on web-based corpora. We propose two types of evaluation : (I) an intrinsic (content-based) evaluation with measures on words, tags and characters ; (II) an extrinsic (task-based) evaluation on the same corpus by studying the effects of the cleaning step on the performances of an NLP pipeline. We show that the results are not consistent in both evaluations. We show as well that there are important differences in the results between the studied languages. We conclude that the choice of a web page cleaning tool should be made in view of the aimed task rather than on the performances of the tools in an intrinsic evaluation. Mots-clés : Nettoyage de pages Web, collecte de corpus, évaluation intrinsèque, évaluation extrinsèque, détourage.
Researches in the field of Word Sense Disambiguation focus on identifying the precise meaning of ... more Researches in the field of Word Sense Disambiguation focus on identifying the precise meaning of a lexical unit found in a text. This article tackles another kind of problem : assessing the ambiguity of a lexical unit. In other words, we try to identify if a particular unit is ambiguous or not, we define this task as ambiguity diagnosis. Our evaluation dataset contains scientific articles where ambiguous words have been tagged by experts. In order to give an ambiguity diagnosis for each term, we use two types of features : POS tags and positions in the text. We show that the position of an occurrence in the text is a strong hint for such a task. Mots-clés : diagnostic d'ambiguïté, extraction de mot-clés, terminologie.
This archive contains the documents in html format as well as annotations in json format. The cor... more This archive contains the documents in html format as well as annotations in json format. The corpus contains 2089 documents in 5 languages (Chinese, English, Greek, Polish and Russian). Each document has been manually cleaned in order to keep the text and the paragraph marks. Each file is encoded in UTF-8. This corpus has been annotated by native speakers not involved in DAnIEL's developement. The guidelines given to our annotators can be found here : https://daniel.greyc.fr/guidelines.pdf
This article tackles the Authorship Attribution task according to the language independence issue... more This article tackles the Authorship Attribution task according to the language independence issue. We propose an alternative of variable length character n-gram features in supervised methods : maximal repeats in strings. When character n-grams are by essence redundant, maximal repeats are a condensed way to represent any substring of a corpus. Our experiments show that the redundant aspect of character n-grams contributes to the efficiency of character-based Authorship Attribution techniques. Therefore, we introduce a new way to weight features in vector based classifier by introducing n-th order maximal repeats (maximal repeats detected in a set of maximal repeats). The experimental results show higher performance with maximal repeats, with less data than n-grams based approach. Source-code and algorithm for detecting maximal repeats are proposed as well.
RÉSUMÉ Cet article aborde une question centrale de l'alignement automatique, celle du diagnos... more RÉSUMÉ Cet article aborde une question centrale de l'alignement automatique, celle du diagnostic de parallélisme des documents à aligner. Les recherches en la matière se sont jusqu'alors concentrées sur l'analyse de documents parallèles par nature : corpus de textes réglementaires, documents techniques ou phrases isolées. Les phénomènes d'inversions et de suppressions/ajouts pouvant exister entre les différentes versions d'un document sont ainsi souvent ignorées. Nous proposons donc une méthode pour diagnostiquer en contexte des zones parallèles à l'intérieur des documents. Cette méthode permet la détection d'inversions ou de suppressions entre les documents à aligner. Elle repose sur l'affranchissement de la notion de mot et de phrase, ainsi que sur la prise en compte de la Mise en Forme Matérielle du texte (MFM). Sa mise en oeuvre est basée sur des similitudes de répartition de chaînes de caractères répétées dans les différents documents. Ces répart...
We present here our work on the second task of 2011's Deft: pairing scientific articles and t... more We present here our work on the second task of 2011's Deft: pairing scientific articles and their abstract. Our approach is based on distribution of character strings. Our aim was not only to be efficient on that particular task on French but to build a system that can easily be used for other languages. Our method achieved very good results on track 1 "full articles" (100%) but had more problems with track 2 where introduction and conclusion were removed (96%). Keywords: Maximal repeated character strings, endogenous method, multilingual approach, differential linguistics, stringology
OBJECTIVE: This paper presents a multilingual news surveillance system applied to tele-epidemiol... more OBJECTIVE:
This paper presents a multilingual news surveillance system applied to tele-epidemiology. It has been shown that multilingual approaches improve timeliness in detection of epidemic events across the globe, eliminating the wait for local news to be translated into major languages. We present here a system to extract epidemic events in potentially any language, provided a Wikipedia seed for common disease names exists.
METHODS:
The Daniel system presented herein relies on properties that are common to news writing (the journalistic genre), the most useful being repetition and saliency. Wikipedia is used to screen common disease names to be matched with repeated characters strings. Language variations, such as declensions, are handled by processing text at the character-level, rather than at the word level. This additionally makes it possible to handle various writing systems in a similar fashion.
MATERIAL:
As no multilingual ground truth existed to evaluate the Daniel system, we built a multilingual corpus from the Web, and collected annotations from native speakers of Chinese, English, Greek, Polish and Russian, with no connection or interest in the Daniel system. This data set is available online freely, and can be used for the evaluation of other event extraction systems.
RESULTS:
Experiments for 5 languages out of 17 tested are detailed in this paper: Chinese, English, Greek, Polish and Russian. The Daniel system achieves an average F-measure of 82% in these 5 languages. It reaches 87% on BEcorpus, the state-of-the-art corpus in English, slightly below top-performing systems, which are tailored with numerous language-specific resources. The consistent performance of Daniel on multiple languages is an important contribution to the reactivity and the coverage of epidemiological event detection systems.
CONCLUSIONS:
Most event extraction systems rely on extensive resources that are language-specific. While their sophistication induces excellent results (over 90% precision and recall), it restricts their coverage in terms of languages and geographic areas. In contrast, in order to detect epidemic events in any language, the Daniel system only requires a list of a few hundreds of disease names and locations, which can actually be acquired automatically. The system can perform consistently well on any language, with precision and recall around 82% on average, according to this paper's evaluation. Daniel's character-based approach is especially interesting for morphologically-rich and low-resourced languages. The lack of resources to be exploited and the state of the art string matching algorithms imply that Daniel can process thousands of documents per minute on a simple laptop. In the context of epidemic surveillance, reactivity and geographic coverage are of primary importance, since no one knows where the next event will strike, and therefore in what vernacular language it will first be reported. By being able to process any language, the Daniel system offers unique coverage for poorly endowed languages, and can complete state of the art techniques for major languages.