Emmanuele Chersoni | The Hong Kong Polytechnic University (original) (raw)
Papers by Emmanuele Chersoni
Cornell University - arXiv, Oct 21, 2022
Medical term normalization consists in mapping a piece of text to a large number of output classe... more Medical term normalization consists in mapping a piece of text to a large number of output classes. Given the small size of the annotated datasets and the extremely long tail distribution of the concepts, it is of utmost importance to develop models that are capable to generalize to scarce or unseen concepts. An important attribute of most target ontologies is their hierarchical structure. In this paper we introduce a simple and effective learning strategy that leverages such information to enhance the generalizability of both discriminative and generative models. The evaluation shows that the proposed strategy produces state-of-the-art performance on seen concepts and consistent improvements on unseen ones, allowing also for efficient zero-shot knowledge transfer across text typologies and datasets. * Equal contribution † The author was affiliated with Bayer Pharmaceuticals at the time of the experiments, and is currently affiliated with Bloomberg.
Cornell University - arXiv, Sep 7, 2022
This paper describes the models developed by the AILAB-Udine team for the SMM4H'22 Shared Task. W... more This paper describes the models developed by the AILAB-Udine team for the SMM4H'22 Shared Task. We explored the limits of Transformer based models on text classification, entity extraction and entity normalization, tackling Tasks 1, 2, 5, 6 and 10. The main takeaways we got from participating in different tasks are: the overwhelming positive effects of combining different architectures when using ensemble learning, and the great potential of generative models for term normalization.
Proceedings of the Second Workshop on Understanding Implicit and Underspecified Language
An intelligent system is expected to perform reasonable inferences, accounting for both the liter... more An intelligent system is expected to perform reasonable inferences, accounting for both the literal meaning of a word and the meanings a word can acquire in different contexts. A specific kind of inference concerns the connective and, which in some cases gives rise to a temporal succession or causal interpretation in contrast with the logic, commutative one (Levinson, 2000). In this work, we investigate the phenomenon by creating a new dataset for evaluating the interpretation of and by NLI systems, which we use to test three Transformer-based models. Our results show that all systems generalize patterns that are consistent with both the logical and the pragmatic interpretation, perform inferences that are inconsistent with each other, and show clear divergences with both theoretical accounts and humans' behavior.
Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics
Distributional Semantic Models have been successfully used for modeling selectional preferences i... more Distributional Semantic Models have been successfully used for modeling selectional preferences in a variety of scenarios, since distributional similarity naturally provides an estimate of the degree to which an argument satisfies the requirement of a given predicate. However, we argue that the performance of such models on rare verb-argument combinations has received relatively little attention: it is not clear whether they are able to distinguish the combinations that are simply atypical, or implausible, from the semantically anomalous ones, and in particular, they have never been tested on the task of modeling their differences in processing complexity. In this paper, we compare two different models of thematic fit by testing their ability of identifying violations of selectional restrictions in two datasets from the experimental studies.
International audienceno abstrac
Cornell University - arXiv, Dec 2, 2022
People constantly use language to learn about the world. Computational linguists have capitalized... more People constantly use language to learn about the world. Computational linguists have capitalized on this fact to build large language models (LLMs) that acquire co-occurrence-based knowledge from language corpora. LLMs achieve impressive performance on many tasks, but the robustness of their world knowledge has been questioned. Here, we ask: do LLMs acquire generalized knowledge about real-world events? Using curated sets of minimal sentence pairs (n=1215), we tested whether LLMs are more likely to generate plausible event descriptions compared to their implausible counterparts. We found that LLMs systematically distinguish possible and impossible events (The teacher bought the laptop vs. The laptop bought the teacher) but fall short of human performance when distinguishing likely and unlikely events (The nanny tutored the boy vs. The boy tutored the nanny). In follow-up analyses, we show that (i) LLM scores are driven by both plausibility and surface-level sentence features, (ii) LLMs generalize well across syntactic sentence variants (active vs passive) but less well across semantic sentence variants (synonymous sentences), (iii) some, but not all LLM deviations from ground-truth labels align with crowdsourced human judgments, and (iv) explicit event plausibility information emerges in middle LLM layers and remains high thereafter. Overall, our analyses reveal a gap in LLMs' event knowledge, highlighting their limitations as generalized knowledge bases. We conclude by speculating that the differential performance on impossible vs. unlikely events is not a temporary setback but an inherent property of LLMs, reflecting a fundamental difference between linguistic knowledge and world knowledge in intelligent systems.
Australian Journal of Linguistics
An iconic pattern across spoken languages is that words for 'this' and 'h... more An iconic pattern across spoken languages is that words for 'this' and 'here' tend to have high front vowels, whereas words for 'that' and 'there' tend to have low and/or back vowels. In Italian, there are two synonymous Italian words for 'here', namely qui and qua, and two synonymous words for 'there', lì and là. Qui 'here' and là 'there' are iconic because qui has the high front vowel /i/ and là has the low vowel /a/, whereas qua 'here' and lì 'there' are counter-iconic, since their vowels are the opposite. Based on corpus, survey and computational data, we demonstrate that (i) qui 'here' and là 'there' have been consistently used more frequently throughout history compared to qua 'here' and lì 'there', respectively; and (ii) in present-day Italian, qui 'here' tends to refer to a location that is closer to the speaker than qua 'here' does, whereas là 'there' tends to refer to a location that is further away from the speaker than lì 'there' does. In summary, the iconic demonstrative pronouns (qui and là) are used more frequently and are closer to the prototypical meanings of 'here' and 'there'. We argue that their frequency and prototypicality are motivated by their iconic power. This case study shows how iconicity may work as pressure on language use and language change.
Journal of Medical Internet Research
Background In the current phase of the COVID-19 pandemic, we are witnessing the most massive vacc... more Background In the current phase of the COVID-19 pandemic, we are witnessing the most massive vaccine rollout in human history. Like any other drug, vaccines may cause unexpected side effects, which need to be investigated in a timely manner to minimize harm in the population. If not properly dealt with, side effects may also impact public trust in the vaccination campaigns carried out by national governments. Objective Monitoring social media for the early identification of side effects, and understanding the public opinion on the vaccines are of paramount importance to ensure a successful and harmless rollout. The objective of this study was to create a web portal to monitor the opinion of social media users on COVID-19 vaccines, which can offer a tool for journalists, scientists, and users alike to visualize how the general public is reacting to the vaccination campaign. Methods We developed a tool to analyze the public opinion on COVID-19 vaccines from Twitter, exploiting, among ...
Eye-tracking psycholinguistic studies have suggested that context-word semantic coherence and pre... more Eye-tracking psycholinguistic studies have suggested that context-word semantic coherence and predictability influence language processing during the reading activity. In this study, we investigate the correlation between the cosine similarities computed with word embedding models (both static and contextualized) and eye-tracking data from two naturalistic reading corpora. We also studied the correlations of surprisal scores computed with three state-of-the-art language models. Our results show strong correlation for the scores computed with BERT and GloVe, suggesting that similarity can play an important role in modeling reading times.
Adverse Drug Event (ADE) extraction models can rapidly examine large collections of social media ... more Adverse Drug Event (ADE) extraction models can rapidly examine large collections of social media texts, detecting mentions of drug-related adverse reactions and trigger medical investigations. However, despite the recent advances in NLP, it is currently unknown if such models are robust in face of negation, which is pervasive across language varieties. In this paper we evaluate three state-of-the-art systems, showing their fragility against negation, and then we introduce two possible strategies to increase the robustness of these models: a pipeline approach, relying on a specific component for negation detection; an augmentation of an ADE extraction dataset to artificially create negated samples and further train the models. We show that both strategies bring significant increases in performance, lowering the number of spurious entities predicted by the models. Our dataset and code will be publicly released to encourage research on the topic.
Conference of the European Chapter of the Association for Computational Linguistics, 2021
Workshop on Computational Linguistics for Linguistic Complexity, 2016
In this paper, we introduce a new distributional method for modeling predicate-argument thematic ... more In this paper, we introduce a new distributional method for modeling predicate-argument thematic fit judgments. We use a syntax-based DSM to build a prototypical representation of verb-specific roles: for every verb, we extract the most salient second order contexts for each of its roles (i.e. the most salient dimensions of typical role fillers), and then we compute thematic fit as a weighted overlap between the top features of candidate fillers and role prototypes. Our experiments show that our method consistently outperforms a baseline re-implementing a state-of-the-art system, and achieves better or comparable results to those reported in the literature for the other unsupervised systems. Moreover, it provides an explicit representation of the features characterizing verb-specific semantic roles.
Prior research has explored the ability of computational models to predict a word semantic fit wi... more Prior research has explored the ability of computational models to predict a word semantic fit with a given predicate. While much work has been devoted to modeling the typicality relation between verbs and arguments in isolation, in this paper we take a broader perspective by assessing whether and to what extent computational approaches have access to the information about the typicality of entire events and situations described in language (Generalized Event Knowledge). Given the recent success of Transformers Language Models (TLMs), we decided to test them on a benchmark for the dynamic estimation of thematic fit. The evaluation of these models was performed in comparison with SDM, a framework specifically designed to integrate events in sentence meaning representations, and we conducted a detailed error analysis to investigate which factors affect their behavior. Our results show that TLMs can reach performances that are comparable to those achieved by SDM. However, additional an...
In Distributional Semantic Models (DSMs), Vector Cosine is widely used to estimate similarity bet... more In Distributional Semantic Models (DSMs), Vector Cosine is widely used to estimate similarity between word vectors, although this measure was noticed to suffer from several shortcomings. The recent literature has proposed other methods which attempt to mitigate such biases. In this paper, we intend to investigate APSyn, a measure that computes the extent of the intersection between the most associated contexts of two target words, weighting it by context relevance. We evaluated this metric in a similarity estimation task on several popular test sets, and our results show that APSyn is in fact highly competitive, even with respect to the results reported in the literature for word embeddings. On top of it, APSyn addresses some of the weaknesses of Vector Cosine, performing well also on genuine similarity estimation.
Proceedings of The 12th International Workshop on Semantic Evaluation, 2018
This paper describes BomJi, a supervised system for capturing discriminative attributes in word p... more This paper describes BomJi, a supervised system for capturing discriminative attributes in word pairs (e.g. yellow as discriminative for banana over watermelon). The system relies on an XGB classifier trained on carefully engineered graph-, pattern-and word embeddingbased features. It participated in the SemEval-2018 Task 10 on Capturing Discriminative Attributes, achieving an F1 score of 0.73 and ranking 2nd out of 26 participant systems.
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 2018
Proceedings of the 6th Joint Conference on Lexical and Computational Semantics (*SEM 2017), 2017
Cornell University - arXiv, Oct 21, 2022
Medical term normalization consists in mapping a piece of text to a large number of output classe... more Medical term normalization consists in mapping a piece of text to a large number of output classes. Given the small size of the annotated datasets and the extremely long tail distribution of the concepts, it is of utmost importance to develop models that are capable to generalize to scarce or unseen concepts. An important attribute of most target ontologies is their hierarchical structure. In this paper we introduce a simple and effective learning strategy that leverages such information to enhance the generalizability of both discriminative and generative models. The evaluation shows that the proposed strategy produces state-of-the-art performance on seen concepts and consistent improvements on unseen ones, allowing also for efficient zero-shot knowledge transfer across text typologies and datasets. * Equal contribution † The author was affiliated with Bayer Pharmaceuticals at the time of the experiments, and is currently affiliated with Bloomberg.
Cornell University - arXiv, Sep 7, 2022
This paper describes the models developed by the AILAB-Udine team for the SMM4H'22 Shared Task. W... more This paper describes the models developed by the AILAB-Udine team for the SMM4H'22 Shared Task. We explored the limits of Transformer based models on text classification, entity extraction and entity normalization, tackling Tasks 1, 2, 5, 6 and 10. The main takeaways we got from participating in different tasks are: the overwhelming positive effects of combining different architectures when using ensemble learning, and the great potential of generative models for term normalization.
Proceedings of the Second Workshop on Understanding Implicit and Underspecified Language
An intelligent system is expected to perform reasonable inferences, accounting for both the liter... more An intelligent system is expected to perform reasonable inferences, accounting for both the literal meaning of a word and the meanings a word can acquire in different contexts. A specific kind of inference concerns the connective and, which in some cases gives rise to a temporal succession or causal interpretation in contrast with the logic, commutative one (Levinson, 2000). In this work, we investigate the phenomenon by creating a new dataset for evaluating the interpretation of and by NLI systems, which we use to test three Transformer-based models. Our results show that all systems generalize patterns that are consistent with both the logical and the pragmatic interpretation, perform inferences that are inconsistent with each other, and show clear divergences with both theoretical accounts and humans' behavior.
Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics
Distributional Semantic Models have been successfully used for modeling selectional preferences i... more Distributional Semantic Models have been successfully used for modeling selectional preferences in a variety of scenarios, since distributional similarity naturally provides an estimate of the degree to which an argument satisfies the requirement of a given predicate. However, we argue that the performance of such models on rare verb-argument combinations has received relatively little attention: it is not clear whether they are able to distinguish the combinations that are simply atypical, or implausible, from the semantically anomalous ones, and in particular, they have never been tested on the task of modeling their differences in processing complexity. In this paper, we compare two different models of thematic fit by testing their ability of identifying violations of selectional restrictions in two datasets from the experimental studies.
International audienceno abstrac
Cornell University - arXiv, Dec 2, 2022
People constantly use language to learn about the world. Computational linguists have capitalized... more People constantly use language to learn about the world. Computational linguists have capitalized on this fact to build large language models (LLMs) that acquire co-occurrence-based knowledge from language corpora. LLMs achieve impressive performance on many tasks, but the robustness of their world knowledge has been questioned. Here, we ask: do LLMs acquire generalized knowledge about real-world events? Using curated sets of minimal sentence pairs (n=1215), we tested whether LLMs are more likely to generate plausible event descriptions compared to their implausible counterparts. We found that LLMs systematically distinguish possible and impossible events (The teacher bought the laptop vs. The laptop bought the teacher) but fall short of human performance when distinguishing likely and unlikely events (The nanny tutored the boy vs. The boy tutored the nanny). In follow-up analyses, we show that (i) LLM scores are driven by both plausibility and surface-level sentence features, (ii) LLMs generalize well across syntactic sentence variants (active vs passive) but less well across semantic sentence variants (synonymous sentences), (iii) some, but not all LLM deviations from ground-truth labels align with crowdsourced human judgments, and (iv) explicit event plausibility information emerges in middle LLM layers and remains high thereafter. Overall, our analyses reveal a gap in LLMs' event knowledge, highlighting their limitations as generalized knowledge bases. We conclude by speculating that the differential performance on impossible vs. unlikely events is not a temporary setback but an inherent property of LLMs, reflecting a fundamental difference between linguistic knowledge and world knowledge in intelligent systems.
Australian Journal of Linguistics
An iconic pattern across spoken languages is that words for 'this' and 'h... more An iconic pattern across spoken languages is that words for 'this' and 'here' tend to have high front vowels, whereas words for 'that' and 'there' tend to have low and/or back vowels. In Italian, there are two synonymous Italian words for 'here', namely qui and qua, and two synonymous words for 'there', lì and là. Qui 'here' and là 'there' are iconic because qui has the high front vowel /i/ and là has the low vowel /a/, whereas qua 'here' and lì 'there' are counter-iconic, since their vowels are the opposite. Based on corpus, survey and computational data, we demonstrate that (i) qui 'here' and là 'there' have been consistently used more frequently throughout history compared to qua 'here' and lì 'there', respectively; and (ii) in present-day Italian, qui 'here' tends to refer to a location that is closer to the speaker than qua 'here' does, whereas là 'there' tends to refer to a location that is further away from the speaker than lì 'there' does. In summary, the iconic demonstrative pronouns (qui and là) are used more frequently and are closer to the prototypical meanings of 'here' and 'there'. We argue that their frequency and prototypicality are motivated by their iconic power. This case study shows how iconicity may work as pressure on language use and language change.
Journal of Medical Internet Research
Background In the current phase of the COVID-19 pandemic, we are witnessing the most massive vacc... more Background In the current phase of the COVID-19 pandemic, we are witnessing the most massive vaccine rollout in human history. Like any other drug, vaccines may cause unexpected side effects, which need to be investigated in a timely manner to minimize harm in the population. If not properly dealt with, side effects may also impact public trust in the vaccination campaigns carried out by national governments. Objective Monitoring social media for the early identification of side effects, and understanding the public opinion on the vaccines are of paramount importance to ensure a successful and harmless rollout. The objective of this study was to create a web portal to monitor the opinion of social media users on COVID-19 vaccines, which can offer a tool for journalists, scientists, and users alike to visualize how the general public is reacting to the vaccination campaign. Methods We developed a tool to analyze the public opinion on COVID-19 vaccines from Twitter, exploiting, among ...
Eye-tracking psycholinguistic studies have suggested that context-word semantic coherence and pre... more Eye-tracking psycholinguistic studies have suggested that context-word semantic coherence and predictability influence language processing during the reading activity. In this study, we investigate the correlation between the cosine similarities computed with word embedding models (both static and contextualized) and eye-tracking data from two naturalistic reading corpora. We also studied the correlations of surprisal scores computed with three state-of-the-art language models. Our results show strong correlation for the scores computed with BERT and GloVe, suggesting that similarity can play an important role in modeling reading times.
Adverse Drug Event (ADE) extraction models can rapidly examine large collections of social media ... more Adverse Drug Event (ADE) extraction models can rapidly examine large collections of social media texts, detecting mentions of drug-related adverse reactions and trigger medical investigations. However, despite the recent advances in NLP, it is currently unknown if such models are robust in face of negation, which is pervasive across language varieties. In this paper we evaluate three state-of-the-art systems, showing their fragility against negation, and then we introduce two possible strategies to increase the robustness of these models: a pipeline approach, relying on a specific component for negation detection; an augmentation of an ADE extraction dataset to artificially create negated samples and further train the models. We show that both strategies bring significant increases in performance, lowering the number of spurious entities predicted by the models. Our dataset and code will be publicly released to encourage research on the topic.
Conference of the European Chapter of the Association for Computational Linguistics, 2021
Workshop on Computational Linguistics for Linguistic Complexity, 2016
In this paper, we introduce a new distributional method for modeling predicate-argument thematic ... more In this paper, we introduce a new distributional method for modeling predicate-argument thematic fit judgments. We use a syntax-based DSM to build a prototypical representation of verb-specific roles: for every verb, we extract the most salient second order contexts for each of its roles (i.e. the most salient dimensions of typical role fillers), and then we compute thematic fit as a weighted overlap between the top features of candidate fillers and role prototypes. Our experiments show that our method consistently outperforms a baseline re-implementing a state-of-the-art system, and achieves better or comparable results to those reported in the literature for the other unsupervised systems. Moreover, it provides an explicit representation of the features characterizing verb-specific semantic roles.
Prior research has explored the ability of computational models to predict a word semantic fit wi... more Prior research has explored the ability of computational models to predict a word semantic fit with a given predicate. While much work has been devoted to modeling the typicality relation between verbs and arguments in isolation, in this paper we take a broader perspective by assessing whether and to what extent computational approaches have access to the information about the typicality of entire events and situations described in language (Generalized Event Knowledge). Given the recent success of Transformers Language Models (TLMs), we decided to test them on a benchmark for the dynamic estimation of thematic fit. The evaluation of these models was performed in comparison with SDM, a framework specifically designed to integrate events in sentence meaning representations, and we conducted a detailed error analysis to investigate which factors affect their behavior. Our results show that TLMs can reach performances that are comparable to those achieved by SDM. However, additional an...
In Distributional Semantic Models (DSMs), Vector Cosine is widely used to estimate similarity bet... more In Distributional Semantic Models (DSMs), Vector Cosine is widely used to estimate similarity between word vectors, although this measure was noticed to suffer from several shortcomings. The recent literature has proposed other methods which attempt to mitigate such biases. In this paper, we intend to investigate APSyn, a measure that computes the extent of the intersection between the most associated contexts of two target words, weighting it by context relevance. We evaluated this metric in a similarity estimation task on several popular test sets, and our results show that APSyn is in fact highly competitive, even with respect to the results reported in the literature for word embeddings. On top of it, APSyn addresses some of the weaknesses of Vector Cosine, performing well also on genuine similarity estimation.
Proceedings of The 12th International Workshop on Semantic Evaluation, 2018
This paper describes BomJi, a supervised system for capturing discriminative attributes in word p... more This paper describes BomJi, a supervised system for capturing discriminative attributes in word pairs (e.g. yellow as discriminative for banana over watermelon). The system relies on an XGB classifier trained on carefully engineered graph-, pattern-and word embeddingbased features. It participated in the SemEval-2018 Task 10 on Capturing Discriminative Attributes, achieving an F1 score of 0.73 and ranking 2nd out of 26 participant systems.
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 2018
Proceedings of the 6th Joint Conference on Lexical and Computational Semantics (*SEM 2017), 2017