Luisa Coheur - Academia.edu (original) (raw)

Papers by Luisa Coheur

Research paper thumbnail of A Rewriting Approach for Gender Inclusivity in Portuguese

In recent years, there has been a notable rise in research interest regarding the integration of ... more In recent years, there has been a notable rise in research interest regarding the integration of gender-inclusive and gender-neutral language in natural language processing models. A specific area of focus that has gained practical and academic significant interest is gender-neutral rewriting, which involves converting binarygendered text to its gender-neutral counterpart. However, current approaches to gender-neutral rewriting for gendered languages tend to rely on large datasets, which may not be an option for languages with fewer resources, such as Portuguese. In this paper, we present a rule-based and a neural-based tool for genderneutral rewriting for Portuguese, a heavily gendered Romance language whose morphology creates different challenges from the ones tackled by other gender-neutral rewriters. Our neural approach relies on fine-tuning large multilingual machine translation models on examples generated by the rule-based model. We evaluate both models on texts from different sources and contexts. We provide the first Portuguese dataset explicitly containing gender-neutral language and neopronouns, as well as a manually annotated golden collection of 500 sentences that allows for evaluation of future work.

Research paper thumbnail of Towards a Fully Unsupervised Framework for Intent Induction in Customer Support Dialogues

arXiv (Cornell University), Jul 28, 2023

State of the art models in intent induction require annotated datasets. However, annotating dialo... more State of the art models in intent induction require annotated datasets. However, annotating dialogues is time-consuming, laborious and expensive. In this work, we propose a completely unsupervised framework for intent induction within a dialogue. In addition, we show how pre-processing the dialogue corpora can improve results. Finally, we show how to extract the dialogue flows of intentions by investigating the most common sequences. Although we test our work in the MultiWOZ dataset, the fact that this framework requires no prior knowledge make it applicable to any possible use case, making it very relevant to real world customer support applications across industry.

Research paper thumbnail of PE2LGP: tradutor de português europeu para língua gestual portuguesa em glosas

Linguamática

A língua gestual portuguesa, tal como a língua portuguesa, evoluiu de forma natural, adquirindo c... more A língua gestual portuguesa, tal como a língua portuguesa, evoluiu de forma natural, adquirindo características gramaticais distintas do português. Assim, o desenvolvimento de um tradutor entre as duas não consiste somente no mapeamento de uma palavra num gesto (português gestuado), mas em garantir que os gestos resultantes satisfazem a gramática da língua gestual portuguesa e que as traduções estejam semanticamente corretas. Trabalhos desenvolvidos anteriormente utilizam exclusivamente regras de tradução manuais, sendo muito limitados na quantidade de fenómenos gramaticais abrangidos, produzindo pouco mais que português gestuado. Neste artigo, apresenta-se o primeiro sistema de tradução de português para a língua gestual portuguesa, o PE2LGP, que, para além de regras manuais, se baseia em regras de tradução construídas automaticamente a partir de um corpus de referência. Dada uma frase em português, o sistema devolve uma sequência de glosas com marcadores que identificam expressões...

Research paper thumbnail of SUMBot: Summarizing Context in Open-Domain Dialogue Systems

IberSPEECH 2022

In this paper, we investigate the problem of including relevant information as context in open-do... more In this paper, we investigate the problem of including relevant information as context in open-domain dialogue systems. Most models struggle to identify and incorporate important knowledge from dialogues and simply use the entire turns as context, which increases the size of the input fed to the model with unnecessary information. Additionally, due to the input size limitation of a few hundred tokens of large pre-trained models, regions of the history are not included and informative parts from the dialogue may be omitted. In order to surpass this problem, we introduce a simple method that substitutes part of the context with a summary instead of the whole history, which increases the ability of models to keep track of all the previous relevant information. We show that the inclusion of a summary may improve the answer generation task and discuss some examples to further understand the system's weaknesses.

Research paper thumbnail of CometKiwi: IST-Unbabel 2022 Submission for the Quality Estimation Shared Task

arXiv (Cornell University), Sep 13, 2022

We present the joint contribution of IST and Unbabel to the WMT 2022 Shared Task on Quality Estim... more We present the joint contribution of IST and Unbabel to the WMT 2022 Shared Task on Quality Estimation (QE). Our team participated on all three subtasks: (i) Sentence and Word-level Quality Prediction; (ii) Explainable QE; and (iii) Critical Error Detection. For all tasks we build on top of the COMET framework, connecting it with the predictorestimator architecture of OPENKIWI, and equipping it with a word-level sequence tagger and an explanation extractor. Our results suggest that incorporating references during pretraining improves performance across several language pairs on downstream tasks, and that jointly training with sentence and word-level objectives yields a further boost. Furthermore, combining attention and gradient information proved to be the top strategy for extracting good explanations of sentence-level QE models. Overall, our submissions achieved the best results for all three tasks for almost all language pairs by a considerable margin. 1

Research paper thumbnail of Question rewriting? Assessing its importance for conversational question answering

In conversational question answering, systems must correctly interpret the interconnected interac... more In conversational question answering, systems must correctly interpret the interconnected interactions and generate knowledgeable answers, which may require the retrieval of relevant information from a background repository. Recent approaches to this problem leverage neural language models, although different alternatives can be considered in terms of modules for (a) representing user questions in context, (b) retrieving the relevant background information, and (c) generating the answer. This work presents a conversational question answering system designed specifically for the Search-Oriented Conversational AI (SCAI) shared task, and reports on a detailed analysis of its question rewriting module. In particular, we considered different variations of the question rewriting module to evaluate the influence on the subsequent components, and performed a careful analysis of the results obtained with the best system configuration. Our system achieved the best performance in the shared ta...

Research paper thumbnail of Towards a Conversational Agent with “Character”

Lecture Notes in Computer Science, 2020

We present a simple approach to create a “persona” conversational agent. First, we take advantage... more We present a simple approach to create a “persona” conversational agent. First, we take advantage of a large collection of subtitles to train a generative model based on neural networks. Second, we manually handcraft a small corpus of interactions that specify our character (from now on the “persona corpus”). Third, we enrich a retrieval based engine with this corpus. Finally, we combine both into a single agent. A preliminary evaluation shows that the generative model can hardly implement a coherent “persona, but can successfully complement the retrieval model.

Research paper thumbnail of Online Learning for Conversational Agents

Progress in Artificial Intelligence, 2017

Agents relying on large collections of interactions face the challenge of choosing an appropriate... more Agents relying on large collections of interactions face the challenge of choosing an appropriate answer from such collections. Several works address this challenge by using offline learning approaches, which do not take advantage of how user-agent conversations unfold. In this work, we propose an alternative approach: incorporating user feedback at each interaction with the agent, in order to enhance its ability to choose an answer. We focus on the case of adjusting the weights of the features used by the agent to choose an answer, using an online learning algorithm (the Exponentially Weighted Average Forecaster) for that purpose. We validate our hypothesis with an experiment featuring a specific agent and simulating user feedback using a reference corpus. The results of our experiment suggest that the adjustment of the agent's feature weights can improve its answers, provided that an appropriate reward function is designed, as this aspect is critical in the agent's performance.

Research paper thumbnail of A Conversational Agent Powered by Online Learning

Adaptive Agents and Multi-Agents Systems, May 8, 2017

In this work, we improve the performance of a dialogue engine, Say Something Smart, using online ... more In this work, we improve the performance of a dialogue engine, Say Something Smart, using online learning. Given a request by a user, this engine selects an answer from a corpus of movie subtitles, weighting the quality of each candidate answer according to several criteria and selecting the one that is chosen by the most representative criteria. We contribute with an online approach, using sequential learning, that adjusts the weights of the different criteria using a reference corpus of actual dialogues as input to simulate user feedback. This approach effectively allowed Say Something Smart to improve its performance at each interaction, as shown in an experiment performed in a test corpus. CCS Concepts •Computing methodologies → Intelligent agents; Discourse, dialogue and pragmatics;

Research paper thumbnail of HamNoSyS2SiGML: Translating HamNoSys Into SiGML

Sign Languages are visual languages and the main means of communication used by Deaf people. Howe... more Sign Languages are visual languages and the main means of communication used by Deaf people. However, the majority of the information available online is presented through written form. Hence, it is not of easy access to the Deaf community. Avatars that can animate sign languages have gained an increase of interest in this area due to their flexibility in the process of generation and edition. Synthetic animation of conversational agents can be achieved through the use of notation systems. HamNoSys is one of these systems, which describes movements of the body through symbols. Its XML-compliant, SiGML, is a machine-readable input of HamNoSys able to animate avatars. Nevertheless, current tools have no freely available open source libraries that allow the conversion from HamNoSys to SiGML. Our goal is to develop a tool of open access, which can perform this conversion independently from other platforms. This system represents a crucial intermediate step in the bigger pipeline of anim...

Research paper thumbnail of PE2LGP Animator: A Tool To Animate A Portuguese Sign Language Avatar

Software for the production of sign languages is much less common than for spoken languages. Such... more Software for the production of sign languages is much less common than for spoken languages. Such software usually relies on 3D humanoid avatars to produce signs which, inevitably, necessitates the use of animation. One barrier to the use of popular animation tools is their complexity and steep learning curve, which can be hard to master for inexperienced users. Here, we present PE2LGP, an authoring system that features a 3D avatar that signs Portuguese Sign Language. Our Animator is designed specifically to craft sign language animations using a key frame method, and is meant to be easy to use and learn to users without animation skills. We conducted a preliminary evaluation of the Animator, where we animated seven Portuguese Sign Language sentences and asked four sign language users to evaluate their quality. This evaluation revealed that the system, in spite of its simplicity, is indeed capable of producing comprehensible messages.

Research paper thumbnail of JUST.ASK, a QA system that learns to answer new questions from previous interactions

We present JUST.ASK, a publicly available Question Answering system, which is freely available. I... more We present JUST.ASK, a publicly available Question Answering system, which is freely available. Its architecture is composed of the usual Question Processing, Passage Retrieval and Answer Extraction components. Several details on the information generated and manipulated by each of these components are also provided to the user when interacting with the demonstration. Since JUST.ASK also learns to answer new questions based on users’ feedback, (s)he is invited to identify the correct answers. These will then be used to retrieve answers to future questions.

Research paper thumbnail of Back to the Feature, in Entailment Detection and Similarity Measurement for Portuguese

This paper describes a system to identify entailment and quantify semantic similarity among pairs... more This paper describes a system to identify entailment and quantify semantic similarity among pairs of Portuguese sentences. The system relies on a corpus to build a supervised model, and employs the same features regardless of the task. Our experiments cover two types of features, contextualized embeddings and lexical features, which we evaluate separately and in combination. The model is derived from a voting strategy on an ensemble of distinct regressors, on similarity measurement, or calibrated classifiers, on entailment detection. Applying such system to other languages mainly depends on the availability of corpora, since all features are either multilingual or language independent. We obtain competitive results on a recent Portuguese corpus, where our best result is obtained by joining embeddings with lexical features.

Research paper thumbnail of An English-Portuguese parallel corpus of questions: translation guidelines and application in SMT

The task of Statistical Machine Translation depends on large amounts of training corpora. Despite... more The task of Statistical Machine Translation depends on large amounts of training corpora. Despite the availability of several parallel corpora, these are typically composed of declarative sentences, which may not be appropriate when the goal is to translate other types of sentences, e.g., interrogatives. There have been efforts to create corpora of questions, specially in the context of the evaluation of Question-Answering systems. One of those corpora is the UIUC dataset, composed of nearly 6,000 questions, widely used in the task of Question Classification. In this work, we make available the Portuguese version of the UIUC dataset, which we manually translated, as well as the translation guidelines. We show the impact of this corpus in the performance of a state-of-the-art SMT system when translating questions. Finally, we present a taxonomy of translation errors, according to which we analyze the output of the automatic translation before and after using the corpus as training data.

Research paper thumbnail of Improving Question Generation with the Teacher’s Implicit Feedback

Lecture Notes in Computer Science, 2018

Although current Question Generation systems can be used to automatically generate questions for ... more Although current Question Generation systems can be used to automatically generate questions for students’ assessments, these need validation and, often, manual corrections. However, this information is never used to improve the performance of QG systems, where it can play an important role. In this work, we present a system, GEN, that learns from such (implicit) feedback in a online learning setting. Following an example-based approach, it takes as input a small set of sentence/question pairs and creates patterns which are then applied to learning materials. Each generated question, after being corrected by the teacher, is used as a new seed in the next iteration, so more patterns are created each time. We also take advantage of the corrections made by the teacher to score the patterns and therefore rank the generated questions. We measure the teacher’s effort in post-editing required and show that GEN improves over time, reducing from 70% to 30% in average corrections needed per question.

Research paper thumbnail of To BERT or Not to BERT Dealing with Possible BERT Failures in an Entailment Task

Information Processing and Management of Uncertainty in Knowledge-Based Systems, 2020

In this paper we focus on an Natural Language Inference task. Being given two sentences, we class... more In this paper we focus on an Natural Language Inference task. Being given two sentences, we classify their relation as NEUTRAL, ENTAILMENT or CONTRADICTION. Considering the achievements of BERT (Bidirectional Encoder Representations from Transformers) in many Natural Language Processing tasks, we use BERT features to create our base model for this task. However, several questions arise: can other features improve the performance obtained with BERT? If we are able to predict the situations in which BERT will fail, can we improve the performance by providing alternative models for these situations? We test several strategies and models, as alternatives to the standalone BERT model in the possible failure situations, and we take advantage of semantic features extracted from Discourse Representation Structures.

Research paper thumbnail of One Arm to Rule Them All: Online Learning with Multi-armed Bandits for Low-Resource Conversational Agents

Progress in Artificial Intelligence, 2021

In a low-resource scenario, the lack of annotated data can be an obstacle not only to train a rob... more In a low-resource scenario, the lack of annotated data can be an obstacle not only to train a robust system, but also to evaluate and compare different approaches before deploying the best one for a given setting. We propose to dynamically find the best approach for a given setting by taking advantage of feedback naturally present on the scenario in hand (when it exists). To this end, we present a novel application of online learning algorithms, where we frame the choice of the best approach as a multi-armed bandits problem. Our proof-of-concept is a retrieval-based conversational agent, in which the answer selection criteria available to the agent are the competing approaches (arms). In our experiment, an adversarial multi-armed bandits approach converges to the performance of the best criterion after just three interaction turns, which suggests the appropriateness of our approach in a low-resource conversational agent 3 .

Research paper thumbnail of Benchmarking Natural Language Inference and Semantic Textual Similarity for Portuguese

Information, 2020

Two sentences can be related in many different ways. Distinct tasks in natural language processin... more Two sentences can be related in many different ways. Distinct tasks in natural language processing aim to identify different semantic relations between sentences. We developed several models for natural language inference and semantic textual similarity for the Portuguese language. We took advantage of pre-trained models (BERT); additionally, we studied the roles of lexical features. We tested our models in several datasets—ASSIN, SICK-BR and ASSIN2—and the best results were usually achieved with ptBERT-Large, trained in a Brazilian corpus and tuned in the latter datasets. Besides obtaining state-of-the-art results, this is, to the best of our knowledge, the most all-inclusive study about natural language inference and semantic textual similarity for the Portuguese language.

Research paper thumbnail of L2F/INESC-ID at SemEval-2019 Task 2: Unsupervised Lexical Semantic Frame Induction using Contextualized Word Representations

Proceedings of the 13th International Workshop on Semantic Evaluation, 2019

Building large datasets annotated with semantic information, such as FrameNet, is an expensive pr... more Building large datasets annotated with semantic information, such as FrameNet, is an expensive process. Consequently, such resources are unavailable for many languages and specific domains. This problem can be alleviated by using unsupervised approaches to induce the frames evoked by a collection of documents. That is the objective of the second task of SemEval 2019, which comprises three subtasks: clustering of verbs that evoke the same frame and clustering of arguments into both frame-specific slots and semantic roles. We approach all the subtasks by applying a graph clustering algorithm on contextualized embedding representations of the verbs and arguments. Using such representations is appropriate in the context of this task, since they provide cues for word-sense disambiguation. Thus, they can be used to identify different frames evoked by the same words. Using this approach we were able to outperform all of the baselines reported for the task on the test set in terms of Purity F 1 , as well as in terms of BCubed F 1 in most cases.

Research paper thumbnail of BeamSeg: A Joint Model for Multi-Document Segmentation and Topic Identification

Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), 2019

We propose BeamSeg, a joint model for segmentation and topic identification of documents from the... more We propose BeamSeg, a joint model for segmentation and topic identification of documents from the same domain. The model assumes that lexical cohesion can be observed across documents, meaning that segments describing the same topic use a similar lexical distribution over the vocabulary. The model implements lexical cohesion in an unsupervised Bayesian setting by drawing from the same language model segments with the same topic. Contrary to previous approaches, we assume that language models are not independent, since the vocabulary changes in consecutive segments are expected to be smooth and not abrupt. We achieve this by using a dynamic Dirichlet prior that takes into account data contributions from other topics. BeamSeg also models segment length properties of documents based on modality (textbooks, slides, etc.). The evaluation is carried out in three datasets. In two of them, improvements of up to 4.8% and 7.3% are obtained in the segmentation and topic identifications tasks, indicating that both tasks should be jointly modeled.

Research paper thumbnail of A Rewriting Approach for Gender Inclusivity in Portuguese

In recent years, there has been a notable rise in research interest regarding the integration of ... more In recent years, there has been a notable rise in research interest regarding the integration of gender-inclusive and gender-neutral language in natural language processing models. A specific area of focus that has gained practical and academic significant interest is gender-neutral rewriting, which involves converting binarygendered text to its gender-neutral counterpart. However, current approaches to gender-neutral rewriting for gendered languages tend to rely on large datasets, which may not be an option for languages with fewer resources, such as Portuguese. In this paper, we present a rule-based and a neural-based tool for genderneutral rewriting for Portuguese, a heavily gendered Romance language whose morphology creates different challenges from the ones tackled by other gender-neutral rewriters. Our neural approach relies on fine-tuning large multilingual machine translation models on examples generated by the rule-based model. We evaluate both models on texts from different sources and contexts. We provide the first Portuguese dataset explicitly containing gender-neutral language and neopronouns, as well as a manually annotated golden collection of 500 sentences that allows for evaluation of future work.

Research paper thumbnail of Towards a Fully Unsupervised Framework for Intent Induction in Customer Support Dialogues

arXiv (Cornell University), Jul 28, 2023

State of the art models in intent induction require annotated datasets. However, annotating dialo... more State of the art models in intent induction require annotated datasets. However, annotating dialogues is time-consuming, laborious and expensive. In this work, we propose a completely unsupervised framework for intent induction within a dialogue. In addition, we show how pre-processing the dialogue corpora can improve results. Finally, we show how to extract the dialogue flows of intentions by investigating the most common sequences. Although we test our work in the MultiWOZ dataset, the fact that this framework requires no prior knowledge make it applicable to any possible use case, making it very relevant to real world customer support applications across industry.

Research paper thumbnail of PE2LGP: tradutor de português europeu para língua gestual portuguesa em glosas

Linguamática

A língua gestual portuguesa, tal como a língua portuguesa, evoluiu de forma natural, adquirindo c... more A língua gestual portuguesa, tal como a língua portuguesa, evoluiu de forma natural, adquirindo características gramaticais distintas do português. Assim, o desenvolvimento de um tradutor entre as duas não consiste somente no mapeamento de uma palavra num gesto (português gestuado), mas em garantir que os gestos resultantes satisfazem a gramática da língua gestual portuguesa e que as traduções estejam semanticamente corretas. Trabalhos desenvolvidos anteriormente utilizam exclusivamente regras de tradução manuais, sendo muito limitados na quantidade de fenómenos gramaticais abrangidos, produzindo pouco mais que português gestuado. Neste artigo, apresenta-se o primeiro sistema de tradução de português para a língua gestual portuguesa, o PE2LGP, que, para além de regras manuais, se baseia em regras de tradução construídas automaticamente a partir de um corpus de referência. Dada uma frase em português, o sistema devolve uma sequência de glosas com marcadores que identificam expressões...

Research paper thumbnail of SUMBot: Summarizing Context in Open-Domain Dialogue Systems

IberSPEECH 2022

In this paper, we investigate the problem of including relevant information as context in open-do... more In this paper, we investigate the problem of including relevant information as context in open-domain dialogue systems. Most models struggle to identify and incorporate important knowledge from dialogues and simply use the entire turns as context, which increases the size of the input fed to the model with unnecessary information. Additionally, due to the input size limitation of a few hundred tokens of large pre-trained models, regions of the history are not included and informative parts from the dialogue may be omitted. In order to surpass this problem, we introduce a simple method that substitutes part of the context with a summary instead of the whole history, which increases the ability of models to keep track of all the previous relevant information. We show that the inclusion of a summary may improve the answer generation task and discuss some examples to further understand the system's weaknesses.

Research paper thumbnail of CometKiwi: IST-Unbabel 2022 Submission for the Quality Estimation Shared Task

arXiv (Cornell University), Sep 13, 2022

We present the joint contribution of IST and Unbabel to the WMT 2022 Shared Task on Quality Estim... more We present the joint contribution of IST and Unbabel to the WMT 2022 Shared Task on Quality Estimation (QE). Our team participated on all three subtasks: (i) Sentence and Word-level Quality Prediction; (ii) Explainable QE; and (iii) Critical Error Detection. For all tasks we build on top of the COMET framework, connecting it with the predictorestimator architecture of OPENKIWI, and equipping it with a word-level sequence tagger and an explanation extractor. Our results suggest that incorporating references during pretraining improves performance across several language pairs on downstream tasks, and that jointly training with sentence and word-level objectives yields a further boost. Furthermore, combining attention and gradient information proved to be the top strategy for extracting good explanations of sentence-level QE models. Overall, our submissions achieved the best results for all three tasks for almost all language pairs by a considerable margin. 1

Research paper thumbnail of Question rewriting? Assessing its importance for conversational question answering

In conversational question answering, systems must correctly interpret the interconnected interac... more In conversational question answering, systems must correctly interpret the interconnected interactions and generate knowledgeable answers, which may require the retrieval of relevant information from a background repository. Recent approaches to this problem leverage neural language models, although different alternatives can be considered in terms of modules for (a) representing user questions in context, (b) retrieving the relevant background information, and (c) generating the answer. This work presents a conversational question answering system designed specifically for the Search-Oriented Conversational AI (SCAI) shared task, and reports on a detailed analysis of its question rewriting module. In particular, we considered different variations of the question rewriting module to evaluate the influence on the subsequent components, and performed a careful analysis of the results obtained with the best system configuration. Our system achieved the best performance in the shared ta...

Research paper thumbnail of Towards a Conversational Agent with “Character”

Lecture Notes in Computer Science, 2020

We present a simple approach to create a “persona” conversational agent. First, we take advantage... more We present a simple approach to create a “persona” conversational agent. First, we take advantage of a large collection of subtitles to train a generative model based on neural networks. Second, we manually handcraft a small corpus of interactions that specify our character (from now on the “persona corpus”). Third, we enrich a retrieval based engine with this corpus. Finally, we combine both into a single agent. A preliminary evaluation shows that the generative model can hardly implement a coherent “persona, but can successfully complement the retrieval model.

Research paper thumbnail of Online Learning for Conversational Agents

Progress in Artificial Intelligence, 2017

Agents relying on large collections of interactions face the challenge of choosing an appropriate... more Agents relying on large collections of interactions face the challenge of choosing an appropriate answer from such collections. Several works address this challenge by using offline learning approaches, which do not take advantage of how user-agent conversations unfold. In this work, we propose an alternative approach: incorporating user feedback at each interaction with the agent, in order to enhance its ability to choose an answer. We focus on the case of adjusting the weights of the features used by the agent to choose an answer, using an online learning algorithm (the Exponentially Weighted Average Forecaster) for that purpose. We validate our hypothesis with an experiment featuring a specific agent and simulating user feedback using a reference corpus. The results of our experiment suggest that the adjustment of the agent's feature weights can improve its answers, provided that an appropriate reward function is designed, as this aspect is critical in the agent's performance.

Research paper thumbnail of A Conversational Agent Powered by Online Learning

Adaptive Agents and Multi-Agents Systems, May 8, 2017

In this work, we improve the performance of a dialogue engine, Say Something Smart, using online ... more In this work, we improve the performance of a dialogue engine, Say Something Smart, using online learning. Given a request by a user, this engine selects an answer from a corpus of movie subtitles, weighting the quality of each candidate answer according to several criteria and selecting the one that is chosen by the most representative criteria. We contribute with an online approach, using sequential learning, that adjusts the weights of the different criteria using a reference corpus of actual dialogues as input to simulate user feedback. This approach effectively allowed Say Something Smart to improve its performance at each interaction, as shown in an experiment performed in a test corpus. CCS Concepts •Computing methodologies → Intelligent agents; Discourse, dialogue and pragmatics;

Research paper thumbnail of HamNoSyS2SiGML: Translating HamNoSys Into SiGML

Sign Languages are visual languages and the main means of communication used by Deaf people. Howe... more Sign Languages are visual languages and the main means of communication used by Deaf people. However, the majority of the information available online is presented through written form. Hence, it is not of easy access to the Deaf community. Avatars that can animate sign languages have gained an increase of interest in this area due to their flexibility in the process of generation and edition. Synthetic animation of conversational agents can be achieved through the use of notation systems. HamNoSys is one of these systems, which describes movements of the body through symbols. Its XML-compliant, SiGML, is a machine-readable input of HamNoSys able to animate avatars. Nevertheless, current tools have no freely available open source libraries that allow the conversion from HamNoSys to SiGML. Our goal is to develop a tool of open access, which can perform this conversion independently from other platforms. This system represents a crucial intermediate step in the bigger pipeline of anim...

Research paper thumbnail of PE2LGP Animator: A Tool To Animate A Portuguese Sign Language Avatar

Software for the production of sign languages is much less common than for spoken languages. Such... more Software for the production of sign languages is much less common than for spoken languages. Such software usually relies on 3D humanoid avatars to produce signs which, inevitably, necessitates the use of animation. One barrier to the use of popular animation tools is their complexity and steep learning curve, which can be hard to master for inexperienced users. Here, we present PE2LGP, an authoring system that features a 3D avatar that signs Portuguese Sign Language. Our Animator is designed specifically to craft sign language animations using a key frame method, and is meant to be easy to use and learn to users without animation skills. We conducted a preliminary evaluation of the Animator, where we animated seven Portuguese Sign Language sentences and asked four sign language users to evaluate their quality. This evaluation revealed that the system, in spite of its simplicity, is indeed capable of producing comprehensible messages.

Research paper thumbnail of JUST.ASK, a QA system that learns to answer new questions from previous interactions

We present JUST.ASK, a publicly available Question Answering system, which is freely available. I... more We present JUST.ASK, a publicly available Question Answering system, which is freely available. Its architecture is composed of the usual Question Processing, Passage Retrieval and Answer Extraction components. Several details on the information generated and manipulated by each of these components are also provided to the user when interacting with the demonstration. Since JUST.ASK also learns to answer new questions based on users’ feedback, (s)he is invited to identify the correct answers. These will then be used to retrieve answers to future questions.

Research paper thumbnail of Back to the Feature, in Entailment Detection and Similarity Measurement for Portuguese

This paper describes a system to identify entailment and quantify semantic similarity among pairs... more This paper describes a system to identify entailment and quantify semantic similarity among pairs of Portuguese sentences. The system relies on a corpus to build a supervised model, and employs the same features regardless of the task. Our experiments cover two types of features, contextualized embeddings and lexical features, which we evaluate separately and in combination. The model is derived from a voting strategy on an ensemble of distinct regressors, on similarity measurement, or calibrated classifiers, on entailment detection. Applying such system to other languages mainly depends on the availability of corpora, since all features are either multilingual or language independent. We obtain competitive results on a recent Portuguese corpus, where our best result is obtained by joining embeddings with lexical features.

Research paper thumbnail of An English-Portuguese parallel corpus of questions: translation guidelines and application in SMT

The task of Statistical Machine Translation depends on large amounts of training corpora. Despite... more The task of Statistical Machine Translation depends on large amounts of training corpora. Despite the availability of several parallel corpora, these are typically composed of declarative sentences, which may not be appropriate when the goal is to translate other types of sentences, e.g., interrogatives. There have been efforts to create corpora of questions, specially in the context of the evaluation of Question-Answering systems. One of those corpora is the UIUC dataset, composed of nearly 6,000 questions, widely used in the task of Question Classification. In this work, we make available the Portuguese version of the UIUC dataset, which we manually translated, as well as the translation guidelines. We show the impact of this corpus in the performance of a state-of-the-art SMT system when translating questions. Finally, we present a taxonomy of translation errors, according to which we analyze the output of the automatic translation before and after using the corpus as training data.

Research paper thumbnail of Improving Question Generation with the Teacher’s Implicit Feedback

Lecture Notes in Computer Science, 2018

Although current Question Generation systems can be used to automatically generate questions for ... more Although current Question Generation systems can be used to automatically generate questions for students’ assessments, these need validation and, often, manual corrections. However, this information is never used to improve the performance of QG systems, where it can play an important role. In this work, we present a system, GEN, that learns from such (implicit) feedback in a online learning setting. Following an example-based approach, it takes as input a small set of sentence/question pairs and creates patterns which are then applied to learning materials. Each generated question, after being corrected by the teacher, is used as a new seed in the next iteration, so more patterns are created each time. We also take advantage of the corrections made by the teacher to score the patterns and therefore rank the generated questions. We measure the teacher’s effort in post-editing required and show that GEN improves over time, reducing from 70% to 30% in average corrections needed per question.

Research paper thumbnail of To BERT or Not to BERT Dealing with Possible BERT Failures in an Entailment Task

Information Processing and Management of Uncertainty in Knowledge-Based Systems, 2020

In this paper we focus on an Natural Language Inference task. Being given two sentences, we class... more In this paper we focus on an Natural Language Inference task. Being given two sentences, we classify their relation as NEUTRAL, ENTAILMENT or CONTRADICTION. Considering the achievements of BERT (Bidirectional Encoder Representations from Transformers) in many Natural Language Processing tasks, we use BERT features to create our base model for this task. However, several questions arise: can other features improve the performance obtained with BERT? If we are able to predict the situations in which BERT will fail, can we improve the performance by providing alternative models for these situations? We test several strategies and models, as alternatives to the standalone BERT model in the possible failure situations, and we take advantage of semantic features extracted from Discourse Representation Structures.

Research paper thumbnail of One Arm to Rule Them All: Online Learning with Multi-armed Bandits for Low-Resource Conversational Agents

Progress in Artificial Intelligence, 2021

In a low-resource scenario, the lack of annotated data can be an obstacle not only to train a rob... more In a low-resource scenario, the lack of annotated data can be an obstacle not only to train a robust system, but also to evaluate and compare different approaches before deploying the best one for a given setting. We propose to dynamically find the best approach for a given setting by taking advantage of feedback naturally present on the scenario in hand (when it exists). To this end, we present a novel application of online learning algorithms, where we frame the choice of the best approach as a multi-armed bandits problem. Our proof-of-concept is a retrieval-based conversational agent, in which the answer selection criteria available to the agent are the competing approaches (arms). In our experiment, an adversarial multi-armed bandits approach converges to the performance of the best criterion after just three interaction turns, which suggests the appropriateness of our approach in a low-resource conversational agent 3 .

Research paper thumbnail of Benchmarking Natural Language Inference and Semantic Textual Similarity for Portuguese

Information, 2020

Two sentences can be related in many different ways. Distinct tasks in natural language processin... more Two sentences can be related in many different ways. Distinct tasks in natural language processing aim to identify different semantic relations between sentences. We developed several models for natural language inference and semantic textual similarity for the Portuguese language. We took advantage of pre-trained models (BERT); additionally, we studied the roles of lexical features. We tested our models in several datasets—ASSIN, SICK-BR and ASSIN2—and the best results were usually achieved with ptBERT-Large, trained in a Brazilian corpus and tuned in the latter datasets. Besides obtaining state-of-the-art results, this is, to the best of our knowledge, the most all-inclusive study about natural language inference and semantic textual similarity for the Portuguese language.

Research paper thumbnail of L2F/INESC-ID at SemEval-2019 Task 2: Unsupervised Lexical Semantic Frame Induction using Contextualized Word Representations

Proceedings of the 13th International Workshop on Semantic Evaluation, 2019

Building large datasets annotated with semantic information, such as FrameNet, is an expensive pr... more Building large datasets annotated with semantic information, such as FrameNet, is an expensive process. Consequently, such resources are unavailable for many languages and specific domains. This problem can be alleviated by using unsupervised approaches to induce the frames evoked by a collection of documents. That is the objective of the second task of SemEval 2019, which comprises three subtasks: clustering of verbs that evoke the same frame and clustering of arguments into both frame-specific slots and semantic roles. We approach all the subtasks by applying a graph clustering algorithm on contextualized embedding representations of the verbs and arguments. Using such representations is appropriate in the context of this task, since they provide cues for word-sense disambiguation. Thus, they can be used to identify different frames evoked by the same words. Using this approach we were able to outperform all of the baselines reported for the task on the test set in terms of Purity F 1 , as well as in terms of BCubed F 1 in most cases.

Research paper thumbnail of BeamSeg: A Joint Model for Multi-Document Segmentation and Topic Identification

Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), 2019

We propose BeamSeg, a joint model for segmentation and topic identification of documents from the... more We propose BeamSeg, a joint model for segmentation and topic identification of documents from the same domain. The model assumes that lexical cohesion can be observed across documents, meaning that segments describing the same topic use a similar lexical distribution over the vocabulary. The model implements lexical cohesion in an unsupervised Bayesian setting by drawing from the same language model segments with the same topic. Contrary to previous approaches, we assume that language models are not independent, since the vocabulary changes in consecutive segments are expected to be smooth and not abrupt. We achieve this by using a dynamic Dirichlet prior that takes into account data contributions from other topics. BeamSeg also models segment length properties of documents based on modality (textbooks, slides, etc.). The evaluation is carried out in three datasets. In two of them, improvements of up to 4.8% and 7.3% are obtained in the segmentation and topic identifications tasks, indicating that both tasks should be jointly modeled.