Ariel Gera | The Hebrew University of Jerusalem (original) (raw)

Papers by Ariel Gera

Research paper thumbnail of Label-Efficient Model Selection for Text Generation

arXiv (Cornell University), Feb 12, 2024

Model selection for a given target task can be costly, as it may entail extensive annotation of t... more Model selection for a given target task can be costly, as it may entail extensive annotation of the quality of outputs of different models. We introduce DiffUse, an efficient method to make an informed decision between candidate text generation models based on preference annotations. DiffUse reduces the required amount of annotations, thus saving valuable time and resources in performing evaluation. DiffUse intelligently selects instances by clustering embeddings that represent the semantic differences between model outputs. Thus, it is able to identify a subset of examples that are more informative for preference decisions. Our method is model-agnostic, and can be applied to any text generation model for selecting between models, prompts and configurations. Moreover, we propose a practical iterative approach for dynamically determining how many instances to annotate. In a series of experiments over hundreds of model pairs, we demonstrate that DiffUse can dramatically reduce the required number of annotations-by up to 75%-while maintaining high evaluation reliability.

Research paper thumbnail of Unitxt: Flexible, Shareable and Reusable Data Preparation and Evaluation for Generative AI

arXiv (Cornell University), 2024

Research paper thumbnail of Efficient Benchmarking (of Language Models)

arXiv (Cornell University), Aug 21, 2023

Research paper thumbnail of Active Learning for Natural Language Generation

Research paper thumbnail of Zero-Shot Text Classification with Self-Training

Recent advances in large pretrained language models have increased attention to zero-shot text cl... more Recent advances in large pretrained language models have increased attention to zero-shot text classification. In particular, models finetuned on natural language inference datasets have been widely adopted as zero-shot classifiers due to their promising results and offthe-shelf availability. However, the fact that such models are unfamiliar with the target task can lead to instability and performance issues. We propose a plug-and-play method to bridge this gap using a simple self-training approach, requiring only the class names along with an unlabeled dataset, and without the need for domain expertise or trial and error. We show that fine-tuning the zero-shot classifier on its most confident predictions leads to significant performance gains across a wide range of text classification tasks, presumably since self-training adapts the zero-shot model to the task at hand.

Research paper thumbnail of The Benefits of Bad Advice: Autocontrastive Decoding across Model Layers

arXiv (Cornell University), May 2, 2023

Applying language models to natural language processing tasks typically relies on the representat... more Applying language models to natural language processing tasks typically relies on the representations in the final model layer, as intermediate hidden layer representations are presumed to be less informative. In this work, we argue that due to the gradual improvement across model layers, additional information can be gleaned from the contrast between higher and lower layers during inference. Specifically, in choosing between the probable next token predictions of a generative model, the predictions of lower layers can be used to highlight which candidates are best avoided. We propose a novel approach that utilizes the contrast between layers to improve text generation outputs, and show that it mitigates degenerative behaviors of the model in open-ended generation, significantly improving the quality of generated texts. Furthermore, our results indicate that contrasting between model layers at inference time can yield substantial benefits to certain aspects of general language model capabilities, more effectively extracting knowledge during inference from a given set of model parameters.

Research paper thumbnail of Active Learning for Natural Language Generation

arXiv (Cornell University), May 24, 2023

The field of Natural Language Generation (NLG) suffers from a severe shortage of labeled data due... more The field of Natural Language Generation (NLG) suffers from a severe shortage of labeled data due to the extremely expensive and timeconsuming process involved in manual annotation. A natural approach for coping with this problem is active learning (AL), a well-known machine learning technique for improving annotation efficiency by selectively choosing the most informative examples to label. However, while AL has been well-researched in the context of text classification, its application to NLG remains largely unexplored. In this paper, we present a first systematic study of active learning for NLG, considering a diverse set of tasks and multiple leading selection strategies, and harnessing a strong instruction-tuned model. Our results indicate that the performance of existing AL strategies is inconsistent, surpassing the baseline of random example selection in some cases but not in others. We highlight some notable differences between the classification and generation scenarios, and analyze the selection behaviors of existing AL strategies. Our findings motivate exploring novel approaches for applying AL to generation tasks.

Research paper thumbnail of The Benefits of Bad Advice: Autocontrastive Decoding across Model Layers

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Research paper thumbnail of Label Sleuth: From Unlabeled Text to a Classifier in a Few Hours

Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

Research paper thumbnail of Zero-Shot Text Classification with Self-Training

Cornell University - arXiv, Oct 31, 2022

Recent advances in large pretrained language models have increased attention to zero-shot text cl... more Recent advances in large pretrained language models have increased attention to zero-shot text classification. In particular, models finetuned on natural language inference datasets have been widely adopted as zero-shot classifiers due to their promising results and offthe-shelf availability. However, the fact that such models are unfamiliar with the target task can lead to instability and performance issues. We propose a plug-and-play method to bridge this gap using a simple self-training approach, requiring only the class names along with an unlabeled dataset, and without the need for domain expertise or trial and error. We show that fine-tuning the zero-shot classifier on its most confident predictions leads to significant performance gains across a wide range of text classification tasks, presumably since self-training adapts the zero-shot model to the task at hand.

Research paper thumbnail of Label Sleuth: From Unlabeled Text to a Classifier in a Few Hours

Cornell University - arXiv, Aug 2, 2022

Text classification can be useful in many realworld scenarios, saving a lot of time for end users... more Text classification can be useful in many realworld scenarios, saving a lot of time for end users. However, building a custom classifier typically requires coding skills and ML knowledge, which poses a significant barrier for many potential users. To lift this barrier, we introduce Label Sleuth, a free open source system for labeling and creating text classifiers. This system is unique for (a) being a nocode system, making NLP accessible to nonexperts, (b) guiding users through the entire labeling process until they obtain a custom classifier, making the process efficient-from cold start to classifier in a few hours, and (c) being open for configuration and extension by developers. By open sourcing Label Sleuth we hope to build a community of users and developers that will broaden the utilization of NLP models.

Research paper thumbnail of Cluster & Tune: Boost Cold Start Performance in Text Classification

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

In real-world scenarios, a text classification task often begins with a cold start, when labeled ... more In real-world scenarios, a text classification task often begins with a cold start, when labeled data is scarce. In such cases, the common practice of fine-tuning pre-trained models, such as BERT, for a target classification task, is prone to produce poor performance. We suggest a method to boost the performance of such models by adding an intermediate unsupervised classification task, between the pretraining and fine-tuning phases. As such an intermediate task, we perform clustering and train the pre-trained model on predicting the cluster labels. We test this hypothesis on various data sets, and show that this additional classification phase can significantly improve performance, mainly for topical classification tasks, when the number of labeled instances available for fine-tuning is only a couple of dozen to a few hundred.

Research paper thumbnail of Cluster & Tune: Enhance BERT Performance in Low Resource Text Classification

Research paper thumbnail of Corpus Wide Argument Mining – a Working Solution

One of the main tasks in argument mining is the retrieval of argumentative content pertaining to ... more One of the main tasks in argument mining is the retrieval of argumentative content pertaining to a given topic. Most previous work addressed this task by retrieving a relatively small number of relevant documents as the initial source for such content. This line of research yielded moderate success, which is of limited use in a real-world system. Furthermore, for such a system to yield a comprehensive set of relevant arguments, over a wide range of topics, it requires leveraging a large and diverse corpus in an appropriate manner. Here we present a first end-to-end high-precision, corpus-wide argument mining system. This is made possible by combining sentence-level queries over an appropriate indexing of a very large corpus of newspaper articles, with an iterative annotation scheme. This scheme addresses the inherent label bias in the data and pinpoints the regions of the sample space whose manual labeling is required to obtain high-precision among top-ranked candidates.

Research paper thumbnail of Controversy in Context

ArXiv, 2019

With the growing interest in social applications of Natural Language Processing and Computational... more With the growing interest in social applications of Natural Language Processing and Computational Argumentation, a natural question is how controversial a given concept is. Prior works relied on Wikipedia's metadata and on content analysis of the articles pertaining to a concept in question. Here we show that the immediate textual context of a concept is strongly indicative of this property, and, using simple and language-independent machine-learning tools, we leverage this observation to achieve state-of-the-art results in controversiality prediction. In addition, we analyze and make available a new dataset of concepts labeled for controversiality. It is significantly larger than existing datasets, and grades concepts on a 0-10 scale, rather than treating controversiality as a binary label.

Research paper thumbnail of An autonomous debating system

Research paper thumbnail of Active Learning for BERT: An Empirical Study

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Research paper thumbnail of Corpus Wide Argument Mining—A Working Solution

Proceedings of the AAAI Conference on Artificial Intelligence

One of the main tasks in argument mining is the retrieval of argumentative content pertaining to ... more One of the main tasks in argument mining is the retrieval of argumentative content pertaining to a given topic. Most previous work addressed this task by retrieving a relatively small number of relevant documents as the initial source for such content. This line of research yielded moderate success, which is of limited use in a real-world system. Furthermore, for such a system to yield a comprehensive set of relevant arguments, over a wide range of topics, it requires leveraging a large and diverse corpus in an appropriate manner. Here we present a first end-to-end high-precision, corpus-wide argument mining system. This is made possible by combining sentence-level queries over an appropriate indexing of a very large corpus of newspaper articles, with an iterative annotation scheme. This scheme addresses the inherent label bias in the data and pinpoints the regions of the sample space whose manual labeling is required to obtain high-precision among top-ranked candidates.

Research paper thumbnail of Financial Event Extraction Using Wikipedia-Based Weak Supervision

Proceedings of the Second Workshop on Economics and Natural Language Processing

Extraction of financial and economic events from text has previously been done mostly using rule-... more Extraction of financial and economic events from text has previously been done mostly using rule-based methods, with more recent works employing machine learning techniques. This work is in line with this latter approach, leveraging relevant Wikipedia sections to extract weak labels for sentences describing economic events. Whereas previous weakly supervised approaches required a knowledgebase of such events, or corresponding financial figures, our approach requires no such additional data, and can be employed to extract economic events related to companies which are not even mentioned in the training data.

Research paper thumbnail of Bumble Bee Workers Give Up Sleep to Care for Offspring that Are Not Their Own

Research paper thumbnail of Label-Efficient Model Selection for Text Generation

arXiv (Cornell University), Feb 12, 2024

Model selection for a given target task can be costly, as it may entail extensive annotation of t... more Model selection for a given target task can be costly, as it may entail extensive annotation of the quality of outputs of different models. We introduce DiffUse, an efficient method to make an informed decision between candidate text generation models based on preference annotations. DiffUse reduces the required amount of annotations, thus saving valuable time and resources in performing evaluation. DiffUse intelligently selects instances by clustering embeddings that represent the semantic differences between model outputs. Thus, it is able to identify a subset of examples that are more informative for preference decisions. Our method is model-agnostic, and can be applied to any text generation model for selecting between models, prompts and configurations. Moreover, we propose a practical iterative approach for dynamically determining how many instances to annotate. In a series of experiments over hundreds of model pairs, we demonstrate that DiffUse can dramatically reduce the required number of annotations-by up to 75%-while maintaining high evaluation reliability.

Research paper thumbnail of Unitxt: Flexible, Shareable and Reusable Data Preparation and Evaluation for Generative AI

arXiv (Cornell University), 2024

Research paper thumbnail of Efficient Benchmarking (of Language Models)

arXiv (Cornell University), Aug 21, 2023

Research paper thumbnail of Active Learning for Natural Language Generation

Research paper thumbnail of Zero-Shot Text Classification with Self-Training

Recent advances in large pretrained language models have increased attention to zero-shot text cl... more Recent advances in large pretrained language models have increased attention to zero-shot text classification. In particular, models finetuned on natural language inference datasets have been widely adopted as zero-shot classifiers due to their promising results and offthe-shelf availability. However, the fact that such models are unfamiliar with the target task can lead to instability and performance issues. We propose a plug-and-play method to bridge this gap using a simple self-training approach, requiring only the class names along with an unlabeled dataset, and without the need for domain expertise or trial and error. We show that fine-tuning the zero-shot classifier on its most confident predictions leads to significant performance gains across a wide range of text classification tasks, presumably since self-training adapts the zero-shot model to the task at hand.

Research paper thumbnail of The Benefits of Bad Advice: Autocontrastive Decoding across Model Layers

arXiv (Cornell University), May 2, 2023

Applying language models to natural language processing tasks typically relies on the representat... more Applying language models to natural language processing tasks typically relies on the representations in the final model layer, as intermediate hidden layer representations are presumed to be less informative. In this work, we argue that due to the gradual improvement across model layers, additional information can be gleaned from the contrast between higher and lower layers during inference. Specifically, in choosing between the probable next token predictions of a generative model, the predictions of lower layers can be used to highlight which candidates are best avoided. We propose a novel approach that utilizes the contrast between layers to improve text generation outputs, and show that it mitigates degenerative behaviors of the model in open-ended generation, significantly improving the quality of generated texts. Furthermore, our results indicate that contrasting between model layers at inference time can yield substantial benefits to certain aspects of general language model capabilities, more effectively extracting knowledge during inference from a given set of model parameters.

Research paper thumbnail of Active Learning for Natural Language Generation

arXiv (Cornell University), May 24, 2023

The field of Natural Language Generation (NLG) suffers from a severe shortage of labeled data due... more The field of Natural Language Generation (NLG) suffers from a severe shortage of labeled data due to the extremely expensive and timeconsuming process involved in manual annotation. A natural approach for coping with this problem is active learning (AL), a well-known machine learning technique for improving annotation efficiency by selectively choosing the most informative examples to label. However, while AL has been well-researched in the context of text classification, its application to NLG remains largely unexplored. In this paper, we present a first systematic study of active learning for NLG, considering a diverse set of tasks and multiple leading selection strategies, and harnessing a strong instruction-tuned model. Our results indicate that the performance of existing AL strategies is inconsistent, surpassing the baseline of random example selection in some cases but not in others. We highlight some notable differences between the classification and generation scenarios, and analyze the selection behaviors of existing AL strategies. Our findings motivate exploring novel approaches for applying AL to generation tasks.

Research paper thumbnail of The Benefits of Bad Advice: Autocontrastive Decoding across Model Layers

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Research paper thumbnail of Label Sleuth: From Unlabeled Text to a Classifier in a Few Hours

Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

Research paper thumbnail of Zero-Shot Text Classification with Self-Training

Cornell University - arXiv, Oct 31, 2022

Recent advances in large pretrained language models have increased attention to zero-shot text cl... more Recent advances in large pretrained language models have increased attention to zero-shot text classification. In particular, models finetuned on natural language inference datasets have been widely adopted as zero-shot classifiers due to their promising results and offthe-shelf availability. However, the fact that such models are unfamiliar with the target task can lead to instability and performance issues. We propose a plug-and-play method to bridge this gap using a simple self-training approach, requiring only the class names along with an unlabeled dataset, and without the need for domain expertise or trial and error. We show that fine-tuning the zero-shot classifier on its most confident predictions leads to significant performance gains across a wide range of text classification tasks, presumably since self-training adapts the zero-shot model to the task at hand.

Research paper thumbnail of Label Sleuth: From Unlabeled Text to a Classifier in a Few Hours

Cornell University - arXiv, Aug 2, 2022

Text classification can be useful in many realworld scenarios, saving a lot of time for end users... more Text classification can be useful in many realworld scenarios, saving a lot of time for end users. However, building a custom classifier typically requires coding skills and ML knowledge, which poses a significant barrier for many potential users. To lift this barrier, we introduce Label Sleuth, a free open source system for labeling and creating text classifiers. This system is unique for (a) being a nocode system, making NLP accessible to nonexperts, (b) guiding users through the entire labeling process until they obtain a custom classifier, making the process efficient-from cold start to classifier in a few hours, and (c) being open for configuration and extension by developers. By open sourcing Label Sleuth we hope to build a community of users and developers that will broaden the utilization of NLP models.

Research paper thumbnail of Cluster & Tune: Boost Cold Start Performance in Text Classification

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

In real-world scenarios, a text classification task often begins with a cold start, when labeled ... more In real-world scenarios, a text classification task often begins with a cold start, when labeled data is scarce. In such cases, the common practice of fine-tuning pre-trained models, such as BERT, for a target classification task, is prone to produce poor performance. We suggest a method to boost the performance of such models by adding an intermediate unsupervised classification task, between the pretraining and fine-tuning phases. As such an intermediate task, we perform clustering and train the pre-trained model on predicting the cluster labels. We test this hypothesis on various data sets, and show that this additional classification phase can significantly improve performance, mainly for topical classification tasks, when the number of labeled instances available for fine-tuning is only a couple of dozen to a few hundred.

Research paper thumbnail of Cluster & Tune: Enhance BERT Performance in Low Resource Text Classification

Research paper thumbnail of Corpus Wide Argument Mining – a Working Solution

One of the main tasks in argument mining is the retrieval of argumentative content pertaining to ... more One of the main tasks in argument mining is the retrieval of argumentative content pertaining to a given topic. Most previous work addressed this task by retrieving a relatively small number of relevant documents as the initial source for such content. This line of research yielded moderate success, which is of limited use in a real-world system. Furthermore, for such a system to yield a comprehensive set of relevant arguments, over a wide range of topics, it requires leveraging a large and diverse corpus in an appropriate manner. Here we present a first end-to-end high-precision, corpus-wide argument mining system. This is made possible by combining sentence-level queries over an appropriate indexing of a very large corpus of newspaper articles, with an iterative annotation scheme. This scheme addresses the inherent label bias in the data and pinpoints the regions of the sample space whose manual labeling is required to obtain high-precision among top-ranked candidates.

Research paper thumbnail of Controversy in Context

ArXiv, 2019

With the growing interest in social applications of Natural Language Processing and Computational... more With the growing interest in social applications of Natural Language Processing and Computational Argumentation, a natural question is how controversial a given concept is. Prior works relied on Wikipedia's metadata and on content analysis of the articles pertaining to a concept in question. Here we show that the immediate textual context of a concept is strongly indicative of this property, and, using simple and language-independent machine-learning tools, we leverage this observation to achieve state-of-the-art results in controversiality prediction. In addition, we analyze and make available a new dataset of concepts labeled for controversiality. It is significantly larger than existing datasets, and grades concepts on a 0-10 scale, rather than treating controversiality as a binary label.

Research paper thumbnail of An autonomous debating system

Research paper thumbnail of Active Learning for BERT: An Empirical Study

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Research paper thumbnail of Corpus Wide Argument Mining—A Working Solution

Proceedings of the AAAI Conference on Artificial Intelligence

One of the main tasks in argument mining is the retrieval of argumentative content pertaining to ... more One of the main tasks in argument mining is the retrieval of argumentative content pertaining to a given topic. Most previous work addressed this task by retrieving a relatively small number of relevant documents as the initial source for such content. This line of research yielded moderate success, which is of limited use in a real-world system. Furthermore, for such a system to yield a comprehensive set of relevant arguments, over a wide range of topics, it requires leveraging a large and diverse corpus in an appropriate manner. Here we present a first end-to-end high-precision, corpus-wide argument mining system. This is made possible by combining sentence-level queries over an appropriate indexing of a very large corpus of newspaper articles, with an iterative annotation scheme. This scheme addresses the inherent label bias in the data and pinpoints the regions of the sample space whose manual labeling is required to obtain high-precision among top-ranked candidates.

Research paper thumbnail of Financial Event Extraction Using Wikipedia-Based Weak Supervision

Proceedings of the Second Workshop on Economics and Natural Language Processing

Extraction of financial and economic events from text has previously been done mostly using rule-... more Extraction of financial and economic events from text has previously been done mostly using rule-based methods, with more recent works employing machine learning techniques. This work is in line with this latter approach, leveraging relevant Wikipedia sections to extract weak labels for sentences describing economic events. Whereas previous weakly supervised approaches required a knowledgebase of such events, or corresponding financial figures, our approach requires no such additional data, and can be employed to extract economic events related to companies which are not even mentioned in the training data.

Research paper thumbnail of Bumble Bee Workers Give Up Sleep to Care for Offspring that Are Not Their Own