Jason Kessler | Indiana University (original) (raw)

Papers by Jason Kessler

Research paper thumbnail of Scattertext: a Browser-Based Tool for Visualizing how Corpora Differ

Scattertext is an open source tool for visualizing linguistic variation between document categori... more Scattertext is an open source tool for visualizing linguistic variation between document categories in a language-independent way. The tool presents a scatterplot, where each axis corresponds to the rank-frequency a term occurs in a category of documents. Through a tie-breaking strategy, the tool is able to display thousands of visible term-representing points and find space to legibly label hundreds of them. Scattertext also lends itself to a query-based visualization of how the use of terms with similar embeddings differs between document categories, as well as a visualization for comparing the importance scores of bag-of-words features to univariate metrics.

Research paper thumbnail of The JDPA Sentiment Corpus for the Automotive Domain

This chapter presents a rich annotation scheme for mentions, co-reference, meronymy, sentiment ex... more This chapter presents a rich annotation scheme for mentions, co-reference, meronymy, sentiment expressions, modifiers of sentiment expressions including neutralizers, negators, and intensifiers, and describes a large corpus annotated with this scheme. We define the various annotation types, provide examples, and show statistics on occurrence and inter-annotator agreement. This resource is the largest sentiment-topical corpus to date and is publicly available. It helps quantify sentiment phenomena, and allows for the construction of advanced sentiment systems and enables direct comparison of different algorithms.

Research paper thumbnail of Scattertext: a Browser-Based Tool for Visualizing how Corpora Differ

Proceedings of ACL 2017, System Demonstrations

Scattertext is an open source tool for visualizing linguistic variation between document categori... more Scattertext is an open source tool for visualizing linguistic variation between document categories in a language-independent way. The tool presents a scatterplot, where each axis corresponds to the rankfrequency a term occurs in a category of documents. Through a tie-breaking strategy, the tool is able to display thousands of visible term-representing points and find space to legibly label hundreds of them. Scattertext also lends itself to a query-based visualization of how the use of terms with similar embeddings differs between document categories, as well as a visualization for comparing the importance scores of bag-of-words features to univariate metrics.

Research paper thumbnail of Polling the Blogosphere: A Rule-Based Approach to Belief Classification

Research paper thumbnail of Polling the blogosphere: a rule-based approach to belief classification

International Conference on Weblogs and Social Media, 2008

The research described here is part of a larger project with the objective of determining if a wr... more The research described here is part of a larger project with the objective of determining if a writer believes a proposition to be true or false. This task requires a deep understanding of a proposition's semantic context, which is far beyond NLP's state of the art. In light of this difficulty, this paper presents a shallow semantic framework that addresses the sub-problem of finding a proposition's truth-value at the sentence level. The framework consists of several classes of linguistic elements that, when linked to a proposition through ...

Research paper thumbnail of Targeting sentiment expressions through supervised ranking of linguistic configurations

Proceedings of the Third International AAAI Conference on Weblogs and Social Media, May 1, 2009

User generated content is extremely valuable for mining market intelligence because it is unsolic... more User generated content is extremely valuable for mining market intelligence because it is unsolicited. We study the problem of analyzing users' sentiment and opinion in their blog, message board, etc. posts with respect to topics expressed as a search query. In the scenario we consider the matches of the search query terms are expanded through coreference and meronymy to produce a set of mentions. The mentions are contextually evaluated for sentiment and their scores are aggregated (using a data structure we ...

Research paper thumbnail of Targeting Sentiment Expressions through Supervised Ranking of Linguistic Configurations

Research paper thumbnail of Efficient Domain Adaptation of Language Models via Adaptive Tokenization

ArXiv, 2021

Contextual embedding-based language models trained on large data sets, such as BERT and RoBERTa, ... more Contextual embedding-based language models trained on large data sets, such as BERT and RoBERTa, provide strong performance across a wide range of tasks and are ubiquitous in modern NLP. It has been observed that fine-tuning these models on tasks involving data from domains different from that on which they were pretrained can lead to suboptimal performance. Recent work has explored approaches to adapt pretrained language models to new domains by incorporating additional pretraining using domain-specific corpora and task data. We propose an alternative approach for transferring pretrained language models to new domains by adapting their tokenizers. We show that domain-specific subword sequences can be efficiently determined directly from divergences in the conditional token distributions of the base and domain-specific corpora. In datasets from four disparate domains, we find adaptive tokenization on a pretrained RoBERTa model provides>97% of the performance benefits of domain sp...

Research paper thumbnail of The ICWSM 2010 JDPA Sentiment Corpus for the Automotive Domain

This paper presents a rich annotation scheme for mentions, co-reference, meronymy, sentiment expr... more This paper presents a rich annotation scheme for mentions, co-reference, meronymy, sentiment expressions, modifiers of sentiment expressions including neutralizers, negators, and intensifiers, and describes a large corpus annotated with this scheme. We describe how this corpus relates to recent, state-of-the-art work in sentiment analysis, and define the various annotation types, provide examples, and show statistics on occurrence and inter-annotator agreement. This resource is the largest sentiment-topical corpus to date and is publicly available. It helps quantify sentiment phenomena, and allows for the construction of advanced sentiment systems and enables direct comparison of different algorithms.

Research paper thumbnail of System and Method to Acquire Paraphrases

Research paper thumbnail of Scattertext: a Browser-Based Tool for Visualizing how Corpora Differ

Proceedings of ACL 2017, System Demonstrations, 2017

Scattertext is an open source tool for visualizing linguistic variation between document categori... more Scattertext is an open source tool for visualizing linguistic variation between document categories in a language-independent way. The tool presents a scatterplot, where each axis corresponds to the rank-frequency a term occurs in a category of documents. Through a tie-breaking strategy, the tool is able to display thousands of visible term-representing points and find space to legibly label hundreds of them. Scattertext also lends itself to a query-based visualization of how the use of terms with similar embeddings differs between document categories, as well as a visualization for comparing the importance scores of bag-of-words features to univariate metrics.

Research paper thumbnail of OpinionFinder

Proceedings of HLT/EMNLP on Interactive Demonstrations -, 2005

Research paper thumbnail of The JDPA Sentiment Corpus for the Automotive Domain

An updated, expanded description of the JDPA Sentiment Corpus, which will appear in the upcoming ... more An updated, expanded description of the JDPA Sentiment Corpus, which will appear in the upcoming Handbook of Linguistic Annotation.

Research paper thumbnail of OpinionFinder: A system for subjectivity analysis

Abstract OpinionFinder is a system that performs subjectivity analysis, automatically identifying... more Abstract OpinionFinder is a system that performs subjectivity analysis, automatically identifying when opinions, sentiments, speculations, and other private states are present in text. Specifically, OpinionFinder aims to identify subjective sentences and to mark various aspects of the subjectivity in these sentences, including the source (holder) of the subjectivity and words that are included in phrases expressing positive or negative sentiments.

Research paper thumbnail of Polling the Blogosphere: a rule-based approach to belief classification

The research described here is part of a larger project with the objective of determining if a wr... more The research described here is part of a larger project with the
objective of determining if a writer believes a proposition to
be true or false. This task requires a deep understanding of
a proposition’s semantic context, which is far beyond NLP’s
state of the art. In light of this difficulty, this paper presents a
shallow semantic framework that addresses the sub-problem
of finding a proposition’s truth-value at the sentence level.
The framework consists of several classes of linguistic elements
that, when linked to a proposition through specific
lexico-syntactic connectors, change its truth-value. A pilot
evaluation of a system implementing this framework yields
promising results.

Research paper thumbnail of Targeting Sentiment Expressions through Supervised Ranking of Linguistic Configurations

User generated content is extremely valuable for mining market intelligence because it is unsolic... more User generated content is extremely valuable for mining
market intelligence because it is unsolicited. We study
the problem of analyzing users’ sentiment and opinion
in their blog, message board, etc. posts with respect
to topics expressed as a search query. In the scenario
we consider the matches of the search query terms are
expanded through coreference and meronymy to produce
a set of mentions. The mentions are contextually
evaluated for sentiment and their scores are aggregated
(using a data structure we introduce call the sentiment
propagation graph) to produce an aggregate score for
the input entity. An extremely crucial part in the contextual
evaluation of individual mentions is finding which
sentiment expressions are semantically related to (target)
which mentions — this is the focus of our paper.
We present an approach where potential target mentions
for a sentiment expression are ranked using supervised
machine learning (Support Vector Machines) where the
main features are the syntactic configurations (typed dependency
paths) connecting the sentiment expression
and the mention. We have created a large English corpus
of product discussions blogs annotated with semantic
types of mentions, coreference, meronymy and sentiment
targets. The corpus proves that coreference and
meronymy are not marginal phenomena but are really
central to determining the overall sentiment for the toplevel
entity. We evaluate a number of techniques for
sentiment targeting and present results which we believe
push the current state-of-the-art.

Research paper thumbnail of The ICWSM 2010 JDPA Sentiment Corpus for the Automotive Domain

This paper presents a rich annotation scheme for mentions, co-reference, meronymy, sentiment expr... more This paper presents a rich annotation scheme for mentions, co-reference, meronymy, sentiment expressions, modifiers of sentiment expressions including neutralizers, negators, and intensifiers, and describes a large corpus annotated with this scheme. We describe how this corpus relates to recent, state-of-the-art work in sentiment analysis, and define the various annotation types, provide examples, and show statistics on occurrence and inter-annotator agreement. This resource is the largest sentiment-topical corpus to date and is publicly available. It helps quantify sentiment phenomena, and allows for the construction of advanced sentiment systems and enables direct comparison of different algorithms.

Research paper thumbnail of Scattertext: a Browser-Based Tool for Visualizing how Corpora Differ

Scattertext is an open source tool for visualizing linguistic variation between document categori... more Scattertext is an open source tool for visualizing linguistic variation between document categories in a language-independent way. The tool presents a scatterplot, where each axis corresponds to the rank-frequency a term occurs in a category of documents. Through a tie-breaking strategy, the tool is able to display thousands of visible term-representing points and find space to legibly label hundreds of them. Scattertext also lends itself to a query-based visualization of how the use of terms with similar embeddings differs between document categories, as well as a visualization for comparing the importance scores of bag-of-words features to univariate metrics.

Research paper thumbnail of The JDPA Sentiment Corpus for the Automotive Domain

This chapter presents a rich annotation scheme for mentions, co-reference, meronymy, sentiment ex... more This chapter presents a rich annotation scheme for mentions, co-reference, meronymy, sentiment expressions, modifiers of sentiment expressions including neutralizers, negators, and intensifiers, and describes a large corpus annotated with this scheme. We define the various annotation types, provide examples, and show statistics on occurrence and inter-annotator agreement. This resource is the largest sentiment-topical corpus to date and is publicly available. It helps quantify sentiment phenomena, and allows for the construction of advanced sentiment systems and enables direct comparison of different algorithms.

Research paper thumbnail of Scattertext: a Browser-Based Tool for Visualizing how Corpora Differ

Proceedings of ACL 2017, System Demonstrations

Scattertext is an open source tool for visualizing linguistic variation between document categori... more Scattertext is an open source tool for visualizing linguistic variation between document categories in a language-independent way. The tool presents a scatterplot, where each axis corresponds to the rankfrequency a term occurs in a category of documents. Through a tie-breaking strategy, the tool is able to display thousands of visible term-representing points and find space to legibly label hundreds of them. Scattertext also lends itself to a query-based visualization of how the use of terms with similar embeddings differs between document categories, as well as a visualization for comparing the importance scores of bag-of-words features to univariate metrics.

Research paper thumbnail of Polling the Blogosphere: A Rule-Based Approach to Belief Classification

Research paper thumbnail of Polling the blogosphere: a rule-based approach to belief classification

International Conference on Weblogs and Social Media, 2008

The research described here is part of a larger project with the objective of determining if a wr... more The research described here is part of a larger project with the objective of determining if a writer believes a proposition to be true or false. This task requires a deep understanding of a proposition's semantic context, which is far beyond NLP's state of the art. In light of this difficulty, this paper presents a shallow semantic framework that addresses the sub-problem of finding a proposition's truth-value at the sentence level. The framework consists of several classes of linguistic elements that, when linked to a proposition through ...

Research paper thumbnail of Targeting sentiment expressions through supervised ranking of linguistic configurations

Proceedings of the Third International AAAI Conference on Weblogs and Social Media, May 1, 2009

User generated content is extremely valuable for mining market intelligence because it is unsolic... more User generated content is extremely valuable for mining market intelligence because it is unsolicited. We study the problem of analyzing users' sentiment and opinion in their blog, message board, etc. posts with respect to topics expressed as a search query. In the scenario we consider the matches of the search query terms are expanded through coreference and meronymy to produce a set of mentions. The mentions are contextually evaluated for sentiment and their scores are aggregated (using a data structure we ...

Research paper thumbnail of Targeting Sentiment Expressions through Supervised Ranking of Linguistic Configurations

Research paper thumbnail of Efficient Domain Adaptation of Language Models via Adaptive Tokenization

ArXiv, 2021

Contextual embedding-based language models trained on large data sets, such as BERT and RoBERTa, ... more Contextual embedding-based language models trained on large data sets, such as BERT and RoBERTa, provide strong performance across a wide range of tasks and are ubiquitous in modern NLP. It has been observed that fine-tuning these models on tasks involving data from domains different from that on which they were pretrained can lead to suboptimal performance. Recent work has explored approaches to adapt pretrained language models to new domains by incorporating additional pretraining using domain-specific corpora and task data. We propose an alternative approach for transferring pretrained language models to new domains by adapting their tokenizers. We show that domain-specific subword sequences can be efficiently determined directly from divergences in the conditional token distributions of the base and domain-specific corpora. In datasets from four disparate domains, we find adaptive tokenization on a pretrained RoBERTa model provides>97% of the performance benefits of domain sp...

Research paper thumbnail of The ICWSM 2010 JDPA Sentiment Corpus for the Automotive Domain

This paper presents a rich annotation scheme for mentions, co-reference, meronymy, sentiment expr... more This paper presents a rich annotation scheme for mentions, co-reference, meronymy, sentiment expressions, modifiers of sentiment expressions including neutralizers, negators, and intensifiers, and describes a large corpus annotated with this scheme. We describe how this corpus relates to recent, state-of-the-art work in sentiment analysis, and define the various annotation types, provide examples, and show statistics on occurrence and inter-annotator agreement. This resource is the largest sentiment-topical corpus to date and is publicly available. It helps quantify sentiment phenomena, and allows for the construction of advanced sentiment systems and enables direct comparison of different algorithms.

Research paper thumbnail of System and Method to Acquire Paraphrases

Research paper thumbnail of Scattertext: a Browser-Based Tool for Visualizing how Corpora Differ

Proceedings of ACL 2017, System Demonstrations, 2017

Scattertext is an open source tool for visualizing linguistic variation between document categori... more Scattertext is an open source tool for visualizing linguistic variation between document categories in a language-independent way. The tool presents a scatterplot, where each axis corresponds to the rank-frequency a term occurs in a category of documents. Through a tie-breaking strategy, the tool is able to display thousands of visible term-representing points and find space to legibly label hundreds of them. Scattertext also lends itself to a query-based visualization of how the use of terms with similar embeddings differs between document categories, as well as a visualization for comparing the importance scores of bag-of-words features to univariate metrics.

Research paper thumbnail of OpinionFinder

Proceedings of HLT/EMNLP on Interactive Demonstrations -, 2005

Research paper thumbnail of The JDPA Sentiment Corpus for the Automotive Domain

An updated, expanded description of the JDPA Sentiment Corpus, which will appear in the upcoming ... more An updated, expanded description of the JDPA Sentiment Corpus, which will appear in the upcoming Handbook of Linguistic Annotation.

Research paper thumbnail of OpinionFinder: A system for subjectivity analysis

Abstract OpinionFinder is a system that performs subjectivity analysis, automatically identifying... more Abstract OpinionFinder is a system that performs subjectivity analysis, automatically identifying when opinions, sentiments, speculations, and other private states are present in text. Specifically, OpinionFinder aims to identify subjective sentences and to mark various aspects of the subjectivity in these sentences, including the source (holder) of the subjectivity and words that are included in phrases expressing positive or negative sentiments.

Research paper thumbnail of Polling the Blogosphere: a rule-based approach to belief classification

The research described here is part of a larger project with the objective of determining if a wr... more The research described here is part of a larger project with the
objective of determining if a writer believes a proposition to
be true or false. This task requires a deep understanding of
a proposition’s semantic context, which is far beyond NLP’s
state of the art. In light of this difficulty, this paper presents a
shallow semantic framework that addresses the sub-problem
of finding a proposition’s truth-value at the sentence level.
The framework consists of several classes of linguistic elements
that, when linked to a proposition through specific
lexico-syntactic connectors, change its truth-value. A pilot
evaluation of a system implementing this framework yields
promising results.

Research paper thumbnail of Targeting Sentiment Expressions through Supervised Ranking of Linguistic Configurations

User generated content is extremely valuable for mining market intelligence because it is unsolic... more User generated content is extremely valuable for mining
market intelligence because it is unsolicited. We study
the problem of analyzing users’ sentiment and opinion
in their blog, message board, etc. posts with respect
to topics expressed as a search query. In the scenario
we consider the matches of the search query terms are
expanded through coreference and meronymy to produce
a set of mentions. The mentions are contextually
evaluated for sentiment and their scores are aggregated
(using a data structure we introduce call the sentiment
propagation graph) to produce an aggregate score for
the input entity. An extremely crucial part in the contextual
evaluation of individual mentions is finding which
sentiment expressions are semantically related to (target)
which mentions — this is the focus of our paper.
We present an approach where potential target mentions
for a sentiment expression are ranked using supervised
machine learning (Support Vector Machines) where the
main features are the syntactic configurations (typed dependency
paths) connecting the sentiment expression
and the mention. We have created a large English corpus
of product discussions blogs annotated with semantic
types of mentions, coreference, meronymy and sentiment
targets. The corpus proves that coreference and
meronymy are not marginal phenomena but are really
central to determining the overall sentiment for the toplevel
entity. We evaluate a number of techniques for
sentiment targeting and present results which we believe
push the current state-of-the-art.

Research paper thumbnail of The ICWSM 2010 JDPA Sentiment Corpus for the Automotive Domain

This paper presents a rich annotation scheme for mentions, co-reference, meronymy, sentiment expr... more This paper presents a rich annotation scheme for mentions, co-reference, meronymy, sentiment expressions, modifiers of sentiment expressions including neutralizers, negators, and intensifiers, and describes a large corpus annotated with this scheme. We describe how this corpus relates to recent, state-of-the-art work in sentiment analysis, and define the various annotation types, provide examples, and show statistics on occurrence and inter-annotator agreement. This resource is the largest sentiment-topical corpus to date and is publicly available. It helps quantify sentiment phenomena, and allows for the construction of advanced sentiment systems and enables direct comparison of different algorithms.