Mikaela Keller - Academia.edu (original) (raw)
Papers by Mikaela Keller
HAL (Le Centre pour la Communication Scientifique Directe), 2021
Face aux limites de la démocratie représentative, les consultations numériques participatives pub... more Face aux limites de la démocratie représentative, les consultations numériques participatives publiques permettent de solliciter, à différents niveaux de pouvoir, des contributions de citoyens pour essayer de mieux impliquer les individus dans les décisions politiques. Leur conception et leur mise en place posent des problèmes bien connus tels que les biais dans les questions ou la représentativité de la population participante. Nous considérons, dans cet article, les problèmes nouveaux liés à l'utilisation de méthodes issues de l'intelligence artificielle pour l'analyse automatique de contributions en langage naturel. Réaliser une telle analyse est un problème difficile pour lequel il existe de nombreuses méthodes reposant sur des hypothèses et des modèles variés. En considérant comme cas d'étude les contributions aux questions ouvertes du Grand Débat National 1 , nous montrons qu'il est impossible de reproduire les résultats de l'analyse officielle commandée par le gouvernement. En outre, nous identifions des choix arbitraires non explicités dans l'analyse officielle qui conduisent à émettre des doutes sur certains de ses résultats. Nous montrons également que différentes méthodes peuvent mener à des conclusions différentes. Notre étude 2 met ainsi en lumière la nécessité d'une plus grande transparence dans les analyses automatiques de consultations ouvertes pour assurer leur reproductibilité et la confiance du public dans leur restitution. Nous concluons par des pistes d'amélioration des consultations participatives et de leur analyse pour qu'elles puissent inciter à la participation et être des outils utiles au débat public.
BMC Bioinformatics, Nov 24, 2009
Background: Automated surveillance of the Internet provides a timely and sensitive method for ale... more Background: Automated surveillance of the Internet provides a timely and sensitive method for alerting on global emerging infectious disease threats. HealthMap is part of a new generation of online systems designed to monitor and visualize, on a real-time basis, disease outbreak alerts as reported by online news media and public health sources. HealthMap is of specific interest for national and international public health organizations and international travelers. A particular task that makes such a surveillance useful is the automated discovery of the geographic references contained in the retrieved outbreak alerts. This task is sometimes referred to as "geo-parsing". A typical approach to geo-parsing would demand an expensive training corpus of alerts manually tagged by a human. Results: Given that human readers perform this kind of task by using both their lexical and contextual knowledge, we developed an approach which relies on a relatively small expert-built gazetteer, thus limiting the need of human input, but focuses on learning the context in which geographic references appear. We show in a set of experiments, that this approach exhibits a substantial capacity to discover geographic locations outside of its initial lexicon. Conclusion: The results of this analysis provide a framework for future automated global surveillance efforts that reduce manual input and improve timeliness of reporting.
Encoded text representations often capture sensitive attributes about individuals (e.g., race or ... more Encoded text representations often capture sensitive attributes about individuals (e.g., race or gender), which raise privacy concerns and can make downstream models unfair to certain groups. In this work, we propose FEDERATE, an approach that combines ideas from differential privacy and adversarial training to learn private text representations which also induces fairer models. We empirically evaluate the trade-off between the privacy of the representations and the fairness and accuracy of the downstream model on four NLP datasets. Our results show that FEDERATE consistently improves upon previous methods, and thus suggest that privacy and fairness can positively reinforce each other.
HAL (Le Centre pour la Communication Scientifique Directe), May 2, 2016
The efficiency of graph-based semi-supervised algorithms depends on the graph of instances on whi... more The efficiency of graph-based semi-supervised algorithms depends on the graph of instances on which they are applied. The instances are often in a vectorial form before a graph linking them is built. The construction of the graph relies on a metric over the vectorial space that help define the weight of the connection between entities. The classic choice for this metric is usually a distance measure or a similarity measure based on the euclidean norm. We claim that in some cases the euclidean norm on the initial vectorial space might not be the more appropriate to solve the task efficiently. We propose an algorithm that aims at learning the most appropriate vectorial representation for building a graph on which the task at hand is solved efficiently.
In Automatic Text Processing tasks, documents are usually represented in the bag-ofwords space. H... more In Automatic Text Processing tasks, documents are usually represented in the bag-ofwords space. However, this representation does not take into account the possible relations between words. We propose here a review of a family of document density estimation models for representing documents. Inside this family we derive another possible model: the Theme Topic Mixture Model (TTMM). This model assumes two types of relations among textual data. Topics link words to each other and Themes gather documents with particular distribution over the topics. An experiment reports the performance of the different models in this family over a common task.
Findings of the Association for Computational Linguistics: EMNLP 2022
Encoded text representations often capture sen-001 sitive attributes about individuals (e.g., rac... more Encoded text representations often capture sen-001 sitive attributes about individuals (e.g., race 002 or gender), which raise privacy concerns and 003 can make downstream models unfair to certain 004 groups. In this work, we propose FEDERATE, 005 an approach that combines ideas from differ-006 ential privacy and adversarial training to learn 007 private text representations which also induces 008 fairer models. We empirically evaluate the 009 trade-off between the privacy of the represen-010 tations and the fairness and accuracy of the 011 downstream model on four NLP datasets. Our 012 results show that FEDERATE consistently im-013 proves upon previous methods, and thus sug-014 gest that privacy and fairness can positively re-015 inforce each other. 016 1 Introduction 017 Algorithmically-driven decision-making systems 018 raise fairness concerns (Raghavan et al., 2020; 019 van den Broek et al., 2019) as they can be discrim-020 inatory against specific groups of people. These 021 systems have also been shown to leak sensitive in-022 formation about the data of individuals used for 023 training or inference, and thus pose privacy risks 024 (Shokri et al., 2017). Societal pressure as well 025 as recent regulations push for enforcing both pri-026 vacy and fairness in real-world deployments, which 027 is challenging as these notions are multi-faceted 028 concepts that need to be tailored to the context.
HAL (Le Centre pour la Communication Scientifique Directe), Jul 8, 2021
La discipline du Traitement Automatique du Langage Naturel (TALN) a connu d'importants progrès ce... more La discipline du Traitement Automatique du Langage Naturel (TALN) a connu d'importants progrès ces dix dernières années grâce aux avancées de l'intelligence artificielle et en particulier des réseaux de neurones profonds. Ces techniques sont aujourd'hui largement utilisées pour l'extraction de connaissances, l'analyse de sentiments ou encore la traduction automatique de textes. Les similarités structurelles et conceptuelles entre le langage naturel et la musique ont motivé de nombreuses initiatives de recherche visant à adapter les outils du TALN pour le traitement de données musicales symboliques ou audio. Ces démarches ont fourni des résultats prometteurs, notamment dans les domaines de l'analyse et la génération automatique de musique. Au-delà de leur performance, le présent projet vise à étudier le fonctionnement interne de deux de ces modèles, les plongements de mots et les transformeurs, ainsi que leur aptitude à s'adapter à des données musicales plutôt que textuelles. Ces expériences bénéficient à notre maîtrise de ces outils adaptés à la musique et contribuent à clarifier de nombreux parallèles entre le langage naturel et le langage musical.
Since their conception for NLP tasks in 2017, Transformer neural networks have been increasingly ... more Since their conception for NLP tasks in 2017, Transformer neural networks have been increasingly used with compelling results for a variety of symbolic MIR tasks including music analysis, classification and generation. Although the concept of self-attention between words in text can intuitively be transposed as a relation between musical objects such as notes or chords in a score, it remains relatively unknown what kind of musical relations precisely tend to be captured by self attention mechanisms when applied to musical data. Moreover, the principle of self-attention has been elaborated in NLP to help model the “meaning” of a sentence while in the musical domain this concept appears to be more subjective. In this explorative work, we open the music transformer black box looking to identify which aspects of music are actually learnt by the self-attention mechanism. We apply this approach to two MIR probing tasks : composer classification and cadence identification.
arXiv: Learning, 2015
The efficiency of graph-based semi-supervised algorithms depends on the graph of instances on whi... more The efficiency of graph-based semi-supervised algorithms depends on the graph of instances on which they are applied. The instances are often in a vectorial form before a graph linking them is built. The construction of the graph relies on a metric over the vectorial space that help define the weight of the connection between entities. The classic choice for this metric is usually a distance measure or a similarity measure based on the euclidean norm. We claim that in some cases the euclidean norm on the initial vectorial space might not be the more appropriate to solve the task efficiently. We propose an algorithm that aims at learning the most appropriate vectorial representation for building a graph on which the task at hand is solved efficiently.
Yearbook of the German Cognitive Linguistics Association, 2018
Fake is often considered the textbook example of a so-called ‘privative’ adjective, one which, in... more Fake is often considered the textbook example of a so-called ‘privative’ adjective, one which, in other words, allows the proposition that ‘(a) fake x is not (an) x’. This study tests the hypothesis that the contexts of an adjectivenoun combination are more different from the contexts of the noun when the adjective is such a ‘privative’ one than when it is an ordinary (subsective) one. We here use ‘embeddings’, that is, dense vector representations based on word co-occurrences in a large corpus, which in our study is the entire English Wikipedia as it was in 2013. Comparing the cosine distance between the adjective-noun bigram and single noun embeddings across two sets of adjectives, privative and ordinary ones, we fail to find a noticeable difference. However, we contest that fake is an across-the-board privative adjective, since a fake article, for instance, is most definitely still an article. We extend a recent proposal involving the noun’s qualia roles (how an entity is made, w...
Abstract. Although non-parametric tests have already been proposed for that purpose, sta- tistica... more Abstract. Although non-parametric tests have already been proposed for that purpose, sta- tistical signicance tests for non-standard measures (dieren t from the classication error) are less often used in the literature. This paper is an attempt at empirically verifying how these tests compare with more classical tests, on various conditions. More precisely, using a very large dataset to estimate the
Proceedings of the Tenth Conference on Computational Natural Language Learning - CoNLL-X '06, 2006
This paper investigates an isolated setting of the lexical substitution task of replacing words w... more This paper investigates an isolated setting of the lexical substitution task of replacing words with their synonyms. In particular, we examine this problem in the setting of subtitle generation and evaluate state of the art scoring methods that predict the validity of a given substitution. The paper evaluates two context independent models and two contextual models. The major findings suggest that distributional similarity provides a useful complementary estimate for the likelihood that two Wordnet synonyms are indeed substitutable, while proper modeling of contextual constraints is still a challenging task for future research.
We address in this report the problem of representing formally textual data. First, this problem ... more We address in this report the problem of representing formally textual data. First, this problem is replaced in the context of automatic text processing. Then, the weaknesses of the basic document representation, i.e. the bag-of-words representation, are explained and some state-ofthe-art methods claiming to overcome these weaknesses are reviewed. Moreover we propose a novel graphical model, the Theme Topic Mixture Model, which also claims to do so, in addition of giving a probabilistic framework in which documents are considered.
Although non-parametric tests have already been proposed for that purpose, statistical significan... more Although non-parametric tests have already been proposed for that purpose, statistical significance tests for non-standard measures (different from the classification error) are less often used in the literature. This paper is an attempt at empirically verifying how these tests compare with more classical tests, on various conditions. More precisely, using a very large dataset to estimate the whole "population", we analyzed the behavior of several statistical test, varying the class unbalance, the compared models, the performance measure, and the sample size. The main result is that providing big enough evaluation sets non-parametric tests are relatively reliable in all conditions.
Lecture Notes in Computer Science
Text categorization and retrieval tasks are often based on a good representation of textual data.... more Text categorization and retrieval tasks are often based on a good representation of textual data. Departing from the classical vector space model, several probabilistic models have been proposed recently, such as PLSA and LDA. In this paper, we propose the use of a neural network based, non-probabilistic, solution, which captures jointly a rich representation of words and documents. Experiments performed on two information retrieval tasks using the TDT2 database and the TREC-8 and 9 sets of queries yielded a better performance for the proposed neural network model, as compared to PLSA and the classical TFIDF representations.
BMC Bioinformatics, 2009
Background: Automated surveillance of the Internet provides a timely and sensitive method for ale... more Background: Automated surveillance of the Internet provides a timely and sensitive method for alerting on global emerging infectious disease threats. HealthMap is part of a new generation of online systems designed to monitor and visualize, on a real-time basis, disease outbreak alerts as reported by online news media and public health sources. HealthMap is of specific interest for national and international public health organizations and international travelers. A particular task that makes such a surveillance useful is the automated discovery of the geographic references contained in the retrieved outbreak alerts. This task is sometimes referred to as "geo-parsing". A typical approach to geo-parsing would demand an expensive training corpus of alerts manually tagged by a human. Results: Given that human readers perform this kind of task by using both their lexical and contextual knowledge, we developed an approach which relies on a relatively small expert-built gazetteer, thus limiting the need of human input, but focuses on learning the context in which geographic references appear. We show in a set of experiments, that this approach exhibits a substantial capacity to discover geographic locations outside of its initial lexicon. Conclusion: The results of this analysis provide a framework for future automated global surveillance efforts that reduce manual input and improve timeliness of reporting.
Statistical significance tests are often used in machine learning to compare the performance of t... more Statistical significance tests are often used in machine learning to compare the performance of two learning algorithms or two models. However, in most cases, one of the underlying assumptions behind these tests is that the error measure used to assess the performance of one model/algorithm is computed as the sum of errors obtained on each example of the test set. This is however not the case for several well-known measures such as F1, used in text categorization, or DCF, used in person authentication. We propose here a practical methodology to either adapt the existing tests or develop non-parametric solutions for such bizarre measures. We furthermore assess the quality of these tests on a real-life large dataset.
Text categorization is intrinsically a supervised learning task, which aims at relating a given t... more Text categorization is intrinsically a supervised learning task, which aims at relating a given text document to one or more predefined categories. Unfortunately, labeling such databases of documents is a painful task. We present in this paper a method that takes advantage of huge amounts of unlabeled text documents available in digital format, to counter balance the relatively smaller available amount of labeled text documents. A Siamese MLP is trained in a multi-task framework in order to solve two concurrent tasks: using the unlabeled data, we search for a mapping from the documents' bag-of-word representation to a new feature space emphasizing similarities and dissimilarities among documents; simultaneously, this mapping is constrained to also give good text categorization performance over the labeled dataset. Experimental results on Reuters RCV1 suggest that, as expected, performance over the labeled task increases as the amount of unlabeled data increases.
In Automatic Text Processing tasks, documents are usually represented in the bag-ofwords space. H... more In Automatic Text Processing tasks, documents are usually represented in the bag-ofwords space. However, this representation does not take into account the possible relations between words. We propose here a review of a family of document density estimation models for representing documents. Inside this family we derive another possible model: the Theme Topic Mixture Model (TTMM). This model assumes two types of relations among textual data. Topics link words to each other and Themes gather documents with particular distribution over the topics. An experiment reports the performance of the different models in this family over a common task.
… , ICML, Workshop on ROC Analysis in …, 2005
In several research domains concerned with classification tasks, curves like ROC are often used t... more In several research domains concerned with classification tasks, curves like ROC are often used to assess the quality of a particular model or to compare two or more models with respect to various operating points. Researchers also often publish some statistics coming from the ROC, such as the socalled break-even point or equal error rate. The purpose of this paper is to first argue that these measures can be misleading in a machine learning context and should be used with care. Instead, we propose to use the Expected Performance Curves (EPC) which provide unbiased estimates of performance at various operating points. Furthermore, we show how to use adequately a non-parametric statistical test in order to produce EPCs with confidence intervals or assess the statistical significant difference between two models under various settings.
HAL (Le Centre pour la Communication Scientifique Directe), 2021
Face aux limites de la démocratie représentative, les consultations numériques participatives pub... more Face aux limites de la démocratie représentative, les consultations numériques participatives publiques permettent de solliciter, à différents niveaux de pouvoir, des contributions de citoyens pour essayer de mieux impliquer les individus dans les décisions politiques. Leur conception et leur mise en place posent des problèmes bien connus tels que les biais dans les questions ou la représentativité de la population participante. Nous considérons, dans cet article, les problèmes nouveaux liés à l'utilisation de méthodes issues de l'intelligence artificielle pour l'analyse automatique de contributions en langage naturel. Réaliser une telle analyse est un problème difficile pour lequel il existe de nombreuses méthodes reposant sur des hypothèses et des modèles variés. En considérant comme cas d'étude les contributions aux questions ouvertes du Grand Débat National 1 , nous montrons qu'il est impossible de reproduire les résultats de l'analyse officielle commandée par le gouvernement. En outre, nous identifions des choix arbitraires non explicités dans l'analyse officielle qui conduisent à émettre des doutes sur certains de ses résultats. Nous montrons également que différentes méthodes peuvent mener à des conclusions différentes. Notre étude 2 met ainsi en lumière la nécessité d'une plus grande transparence dans les analyses automatiques de consultations ouvertes pour assurer leur reproductibilité et la confiance du public dans leur restitution. Nous concluons par des pistes d'amélioration des consultations participatives et de leur analyse pour qu'elles puissent inciter à la participation et être des outils utiles au débat public.
BMC Bioinformatics, Nov 24, 2009
Background: Automated surveillance of the Internet provides a timely and sensitive method for ale... more Background: Automated surveillance of the Internet provides a timely and sensitive method for alerting on global emerging infectious disease threats. HealthMap is part of a new generation of online systems designed to monitor and visualize, on a real-time basis, disease outbreak alerts as reported by online news media and public health sources. HealthMap is of specific interest for national and international public health organizations and international travelers. A particular task that makes such a surveillance useful is the automated discovery of the geographic references contained in the retrieved outbreak alerts. This task is sometimes referred to as "geo-parsing". A typical approach to geo-parsing would demand an expensive training corpus of alerts manually tagged by a human. Results: Given that human readers perform this kind of task by using both their lexical and contextual knowledge, we developed an approach which relies on a relatively small expert-built gazetteer, thus limiting the need of human input, but focuses on learning the context in which geographic references appear. We show in a set of experiments, that this approach exhibits a substantial capacity to discover geographic locations outside of its initial lexicon. Conclusion: The results of this analysis provide a framework for future automated global surveillance efforts that reduce manual input and improve timeliness of reporting.
Encoded text representations often capture sensitive attributes about individuals (e.g., race or ... more Encoded text representations often capture sensitive attributes about individuals (e.g., race or gender), which raise privacy concerns and can make downstream models unfair to certain groups. In this work, we propose FEDERATE, an approach that combines ideas from differential privacy and adversarial training to learn private text representations which also induces fairer models. We empirically evaluate the trade-off between the privacy of the representations and the fairness and accuracy of the downstream model on four NLP datasets. Our results show that FEDERATE consistently improves upon previous methods, and thus suggest that privacy and fairness can positively reinforce each other.
HAL (Le Centre pour la Communication Scientifique Directe), May 2, 2016
The efficiency of graph-based semi-supervised algorithms depends on the graph of instances on whi... more The efficiency of graph-based semi-supervised algorithms depends on the graph of instances on which they are applied. The instances are often in a vectorial form before a graph linking them is built. The construction of the graph relies on a metric over the vectorial space that help define the weight of the connection between entities. The classic choice for this metric is usually a distance measure or a similarity measure based on the euclidean norm. We claim that in some cases the euclidean norm on the initial vectorial space might not be the more appropriate to solve the task efficiently. We propose an algorithm that aims at learning the most appropriate vectorial representation for building a graph on which the task at hand is solved efficiently.
In Automatic Text Processing tasks, documents are usually represented in the bag-ofwords space. H... more In Automatic Text Processing tasks, documents are usually represented in the bag-ofwords space. However, this representation does not take into account the possible relations between words. We propose here a review of a family of document density estimation models for representing documents. Inside this family we derive another possible model: the Theme Topic Mixture Model (TTMM). This model assumes two types of relations among textual data. Topics link words to each other and Themes gather documents with particular distribution over the topics. An experiment reports the performance of the different models in this family over a common task.
Findings of the Association for Computational Linguistics: EMNLP 2022
Encoded text representations often capture sen-001 sitive attributes about individuals (e.g., rac... more Encoded text representations often capture sen-001 sitive attributes about individuals (e.g., race 002 or gender), which raise privacy concerns and 003 can make downstream models unfair to certain 004 groups. In this work, we propose FEDERATE, 005 an approach that combines ideas from differ-006 ential privacy and adversarial training to learn 007 private text representations which also induces 008 fairer models. We empirically evaluate the 009 trade-off between the privacy of the represen-010 tations and the fairness and accuracy of the 011 downstream model on four NLP datasets. Our 012 results show that FEDERATE consistently im-013 proves upon previous methods, and thus sug-014 gest that privacy and fairness can positively re-015 inforce each other. 016 1 Introduction 017 Algorithmically-driven decision-making systems 018 raise fairness concerns (Raghavan et al., 2020; 019 van den Broek et al., 2019) as they can be discrim-020 inatory against specific groups of people. These 021 systems have also been shown to leak sensitive in-022 formation about the data of individuals used for 023 training or inference, and thus pose privacy risks 024 (Shokri et al., 2017). Societal pressure as well 025 as recent regulations push for enforcing both pri-026 vacy and fairness in real-world deployments, which 027 is challenging as these notions are multi-faceted 028 concepts that need to be tailored to the context.
HAL (Le Centre pour la Communication Scientifique Directe), Jul 8, 2021
La discipline du Traitement Automatique du Langage Naturel (TALN) a connu d'importants progrès ce... more La discipline du Traitement Automatique du Langage Naturel (TALN) a connu d'importants progrès ces dix dernières années grâce aux avancées de l'intelligence artificielle et en particulier des réseaux de neurones profonds. Ces techniques sont aujourd'hui largement utilisées pour l'extraction de connaissances, l'analyse de sentiments ou encore la traduction automatique de textes. Les similarités structurelles et conceptuelles entre le langage naturel et la musique ont motivé de nombreuses initiatives de recherche visant à adapter les outils du TALN pour le traitement de données musicales symboliques ou audio. Ces démarches ont fourni des résultats prometteurs, notamment dans les domaines de l'analyse et la génération automatique de musique. Au-delà de leur performance, le présent projet vise à étudier le fonctionnement interne de deux de ces modèles, les plongements de mots et les transformeurs, ainsi que leur aptitude à s'adapter à des données musicales plutôt que textuelles. Ces expériences bénéficient à notre maîtrise de ces outils adaptés à la musique et contribuent à clarifier de nombreux parallèles entre le langage naturel et le langage musical.
Since their conception for NLP tasks in 2017, Transformer neural networks have been increasingly ... more Since their conception for NLP tasks in 2017, Transformer neural networks have been increasingly used with compelling results for a variety of symbolic MIR tasks including music analysis, classification and generation. Although the concept of self-attention between words in text can intuitively be transposed as a relation between musical objects such as notes or chords in a score, it remains relatively unknown what kind of musical relations precisely tend to be captured by self attention mechanisms when applied to musical data. Moreover, the principle of self-attention has been elaborated in NLP to help model the “meaning” of a sentence while in the musical domain this concept appears to be more subjective. In this explorative work, we open the music transformer black box looking to identify which aspects of music are actually learnt by the self-attention mechanism. We apply this approach to two MIR probing tasks : composer classification and cadence identification.
arXiv: Learning, 2015
The efficiency of graph-based semi-supervised algorithms depends on the graph of instances on whi... more The efficiency of graph-based semi-supervised algorithms depends on the graph of instances on which they are applied. The instances are often in a vectorial form before a graph linking them is built. The construction of the graph relies on a metric over the vectorial space that help define the weight of the connection between entities. The classic choice for this metric is usually a distance measure or a similarity measure based on the euclidean norm. We claim that in some cases the euclidean norm on the initial vectorial space might not be the more appropriate to solve the task efficiently. We propose an algorithm that aims at learning the most appropriate vectorial representation for building a graph on which the task at hand is solved efficiently.
Yearbook of the German Cognitive Linguistics Association, 2018
Fake is often considered the textbook example of a so-called ‘privative’ adjective, one which, in... more Fake is often considered the textbook example of a so-called ‘privative’ adjective, one which, in other words, allows the proposition that ‘(a) fake x is not (an) x’. This study tests the hypothesis that the contexts of an adjectivenoun combination are more different from the contexts of the noun when the adjective is such a ‘privative’ one than when it is an ordinary (subsective) one. We here use ‘embeddings’, that is, dense vector representations based on word co-occurrences in a large corpus, which in our study is the entire English Wikipedia as it was in 2013. Comparing the cosine distance between the adjective-noun bigram and single noun embeddings across two sets of adjectives, privative and ordinary ones, we fail to find a noticeable difference. However, we contest that fake is an across-the-board privative adjective, since a fake article, for instance, is most definitely still an article. We extend a recent proposal involving the noun’s qualia roles (how an entity is made, w...
Abstract. Although non-parametric tests have already been proposed for that purpose, sta- tistica... more Abstract. Although non-parametric tests have already been proposed for that purpose, sta- tistical signicance tests for non-standard measures (dieren t from the classication error) are less often used in the literature. This paper is an attempt at empirically verifying how these tests compare with more classical tests, on various conditions. More precisely, using a very large dataset to estimate the
Proceedings of the Tenth Conference on Computational Natural Language Learning - CoNLL-X '06, 2006
This paper investigates an isolated setting of the lexical substitution task of replacing words w... more This paper investigates an isolated setting of the lexical substitution task of replacing words with their synonyms. In particular, we examine this problem in the setting of subtitle generation and evaluate state of the art scoring methods that predict the validity of a given substitution. The paper evaluates two context independent models and two contextual models. The major findings suggest that distributional similarity provides a useful complementary estimate for the likelihood that two Wordnet synonyms are indeed substitutable, while proper modeling of contextual constraints is still a challenging task for future research.
We address in this report the problem of representing formally textual data. First, this problem ... more We address in this report the problem of representing formally textual data. First, this problem is replaced in the context of automatic text processing. Then, the weaknesses of the basic document representation, i.e. the bag-of-words representation, are explained and some state-ofthe-art methods claiming to overcome these weaknesses are reviewed. Moreover we propose a novel graphical model, the Theme Topic Mixture Model, which also claims to do so, in addition of giving a probabilistic framework in which documents are considered.
Although non-parametric tests have already been proposed for that purpose, statistical significan... more Although non-parametric tests have already been proposed for that purpose, statistical significance tests for non-standard measures (different from the classification error) are less often used in the literature. This paper is an attempt at empirically verifying how these tests compare with more classical tests, on various conditions. More precisely, using a very large dataset to estimate the whole "population", we analyzed the behavior of several statistical test, varying the class unbalance, the compared models, the performance measure, and the sample size. The main result is that providing big enough evaluation sets non-parametric tests are relatively reliable in all conditions.
Lecture Notes in Computer Science
Text categorization and retrieval tasks are often based on a good representation of textual data.... more Text categorization and retrieval tasks are often based on a good representation of textual data. Departing from the classical vector space model, several probabilistic models have been proposed recently, such as PLSA and LDA. In this paper, we propose the use of a neural network based, non-probabilistic, solution, which captures jointly a rich representation of words and documents. Experiments performed on two information retrieval tasks using the TDT2 database and the TREC-8 and 9 sets of queries yielded a better performance for the proposed neural network model, as compared to PLSA and the classical TFIDF representations.
BMC Bioinformatics, 2009
Background: Automated surveillance of the Internet provides a timely and sensitive method for ale... more Background: Automated surveillance of the Internet provides a timely and sensitive method for alerting on global emerging infectious disease threats. HealthMap is part of a new generation of online systems designed to monitor and visualize, on a real-time basis, disease outbreak alerts as reported by online news media and public health sources. HealthMap is of specific interest for national and international public health organizations and international travelers. A particular task that makes such a surveillance useful is the automated discovery of the geographic references contained in the retrieved outbreak alerts. This task is sometimes referred to as "geo-parsing". A typical approach to geo-parsing would demand an expensive training corpus of alerts manually tagged by a human. Results: Given that human readers perform this kind of task by using both their lexical and contextual knowledge, we developed an approach which relies on a relatively small expert-built gazetteer, thus limiting the need of human input, but focuses on learning the context in which geographic references appear. We show in a set of experiments, that this approach exhibits a substantial capacity to discover geographic locations outside of its initial lexicon. Conclusion: The results of this analysis provide a framework for future automated global surveillance efforts that reduce manual input and improve timeliness of reporting.
Statistical significance tests are often used in machine learning to compare the performance of t... more Statistical significance tests are often used in machine learning to compare the performance of two learning algorithms or two models. However, in most cases, one of the underlying assumptions behind these tests is that the error measure used to assess the performance of one model/algorithm is computed as the sum of errors obtained on each example of the test set. This is however not the case for several well-known measures such as F1, used in text categorization, or DCF, used in person authentication. We propose here a practical methodology to either adapt the existing tests or develop non-parametric solutions for such bizarre measures. We furthermore assess the quality of these tests on a real-life large dataset.
Text categorization is intrinsically a supervised learning task, which aims at relating a given t... more Text categorization is intrinsically a supervised learning task, which aims at relating a given text document to one or more predefined categories. Unfortunately, labeling such databases of documents is a painful task. We present in this paper a method that takes advantage of huge amounts of unlabeled text documents available in digital format, to counter balance the relatively smaller available amount of labeled text documents. A Siamese MLP is trained in a multi-task framework in order to solve two concurrent tasks: using the unlabeled data, we search for a mapping from the documents' bag-of-word representation to a new feature space emphasizing similarities and dissimilarities among documents; simultaneously, this mapping is constrained to also give good text categorization performance over the labeled dataset. Experimental results on Reuters RCV1 suggest that, as expected, performance over the labeled task increases as the amount of unlabeled data increases.
In Automatic Text Processing tasks, documents are usually represented in the bag-ofwords space. H... more In Automatic Text Processing tasks, documents are usually represented in the bag-ofwords space. However, this representation does not take into account the possible relations between words. We propose here a review of a family of document density estimation models for representing documents. Inside this family we derive another possible model: the Theme Topic Mixture Model (TTMM). This model assumes two types of relations among textual data. Topics link words to each other and Themes gather documents with particular distribution over the topics. An experiment reports the performance of the different models in this family over a common task.
… , ICML, Workshop on ROC Analysis in …, 2005
In several research domains concerned with classification tasks, curves like ROC are often used t... more In several research domains concerned with classification tasks, curves like ROC are often used to assess the quality of a particular model or to compare two or more models with respect to various operating points. Researchers also often publish some statistics coming from the ROC, such as the socalled break-even point or equal error rate. The purpose of this paper is to first argue that these measures can be misleading in a machine learning context and should be used with care. Instead, we propose to use the Expected Performance Curves (EPC) which provide unbiased estimates of performance at various operating points. Furthermore, we show how to use adequately a non-parametric statistical test in order to produce EPCs with confidence intervals or assess the statistical significant difference between two models under various settings.