Sentiment Analysis Benchmark Testing: ChatGPT, Claude & Qwen (original) (raw)

Achieving precise labeling of emotions and sentiments, as well as detecting irony, hatefulness, and offensiveness, remains a challenge, requiring further testing and refinement. We tested 10 large language models across five sentiment tasks: emotion, hatefulness, irony, offensiveness, and sentiment. We ranked them by average accuracy across all five.

The results highlight clear distinctions between the tools:

Experimental results: sentiment analysis benchmark

Loading Chart

Ranking: Tools are ranked according to their average accuracy rates aggregated across all tested categories: emotion, hatefulness, irony, offensiveness, and sentiment.

For further details, read the methodology of our benchmark.

Overall accuracy

Combining all tasks, the models’ total accuracy scores provide a holistic view of their capabilities:

1. Emotion detection

Emotion detection is a challenging task in sentiment analysis, often requiring models to discern subtle cues in language. Here’s how the models performed:

Emotion detection had a wide spread: 14 points between the top and bottom models. This makes it one of the two tasks that most clearly separate models.

2. Hatefulness detection

Detecting hateful content is crucial for Twitter sentiment classification and other moderation tasks. The results revealed notable differences:

Hatefulness had the widest spread of any task: 17 points. If moderation is your use case, pick from the top of this column rather than from the average ranking.

3. Irony detection

Irony detection is an area where semantic evaluation plays a pivotal role. Both models delivered high sentiment analysis benchmark performance, but GPT-4o emerged as a clear leader:

This was the easiest task in the set. Even the lowest score was 82%. For work that depends on catching irony or sarcasm, any of these models is a safe starting point.

4. Offensiveness detection

Detecting offensive content is critical for maintaining healthy online communities. The models’ sentiment analysis benchmark performances in this task were as follows:

No model reached 76% on the offensiveness metric. The whole field ranged from 65% to 75%. Context drives this task, and the dataset’s borderline cases trip up every model.

5. Sentiment analysis

The overarching sentiment analysis task focused on classifying data into positive, negative, and neutral sentiments. Accuracy scores for this task varied significantly between the models:

The full range was 3 points, from 72% to 75%. No model handled three-way sentiment well. If the project needs reliable positive, negative, and neutral labels, none of these models is ready to run without a human check.

Observations and insights

Tasks are not equally hard

Irony was easy for every model (82% to 91%). Sentiment and offensiveness were hard for every model, with all scores between 65% and 75%. Pick a model for the task you actually have, not for its average rank.

Emotion and hatefulness separate models best

These two tasks had the widest score gaps: 14 and 17 points. If your use case is emotion tracking or moderation, the choice of model matters more here than anywhere else.

A high average can hide a weak task

GPT 5.5 ranked first overall and remained strong across the board. But Claude Opus 4.8 ranked eighth overall, scoring 86% on irony. Read the column for your task, not the average.

Benchmark dataset and methodology

Analysis dataset

We used the TweetEval dataset, built for sentiment analysis on real Twitter messages.1 It is part of the Association for Computational Linguistics (ACL) work on semantic evaluation. The dataset ships with pre-labeled training and test sets across five task types:

These tasks align with real-world machine-learning approaches, making them ideal for evaluating the experimental results of the two models.

Models tested

We tested 10 large language models, all through the OpenRouter API so the setup was the same for each:

GPT 5.5, ChatGPT 5.4 mini, Claude Sonnet 4.6, Claude Opus 4.8, Gemini 3.1-pro, Gemini 3.5 Flash, Qwen 3.6 Plus, Kimi k2.6, GLM 5.1, and Minimax M2.7.

Experimental setup

We kept every setting the same across all 10 models.

Sample

We used the first 200 tweets of each task’s official test set, with the dataset’s own gold labels. The same 200 tweets went to every model, so the comparison is like-for-like.

Prompting

We used zero-shot prompts: a plain task instruction and the raw tweet, with no worked examples. The model returned one label and nothing else.

We wrote the prompts so they gave nothing away. We did not name the benchmark, call the model an “annotator,” or hint that it was being graded. Naming the test can change how a model answers, so we left it out. The emotion prompt, for example, asked the model to pick one of anger, joy, optimism, or sadness and reply with hat word.

Generation settings

We set temperature to 0, which makes the output as steady as the model allows. We set the token limit to 4,096. The high limit matters for reasoning models: with a small limit they spend the whole budget on hidden reasoning and return a blank answer. The extra room lets them finish reasoning and still print the label. Models that do not reason answer in one short word, so the limit costs nothing there.

Reading the answers

We mapped each reply to a label in steps: first an exact match, then a short list of synonyms (for example, “happy” maps to joy), then a search for any label inside a longer reply. Replies we could not read were counted as wrong.

Metric

The score for each task is not raw accuracy. We used the metric that the TweetEval authors set for each task:

Macro-F1 and macro-recall both weight each class the same, no matter how often it appears. This is the right choice here because classes like hate or irony are rare, and plain accuracy would let a model look good by always picking the common label. The average column is the mean of these five scores.

Reliability

A few models hit rate limits during the run and dropped some calls. We re-ran the failed rows at low speed to avoid the limits and repeated this until nothing failed. The final results have no failed calls and no unreadable replies.

Setup limitations

We used a 200-tweet slice of each test set, not the full set, so these numbers do not line up with the published TweetEval leaderboard. The comparison across our 10 models still holds, because every model saw the same tweets.

The 200-tweet slice is fixed, not random, so it is reproducible but not a random sample. Each task also used a single prompt at temperature 0. A different prompt, or few-shot examples, would shift the absolute numbers.

We used datasets with public gold labels. This carries a risk of contamination, where a model has seen the labels during training. We cannot rule it out, but the scores were well short of perfect, which suggests it was not a major factor. For the next version, we plan to test tweets whose labels have not been published.

Because the sample is 200 tweets per task, small gaps carry sampling noise. We treat a one- to two-point difference as a tie rather than a ranking.

Which model to pick

The full scores are in the table above. This section is shorter: it maps common needs to the model that fits.

A reminder that runs through all of this: read the column for your task, not the average. A model can rank mid-table overall and still lead the one task you care about.

Don’t miss our benchmarks and data-driven insights. The button opens Google; selecting AIMultiple confirms that you wish to see AIMultiple more often in Google search results.

GoogleAdd as preferred source

Further reading

Cite this benchmark

Pick the format that matches where you're publishing. Pasting the link version into your CMS preserves the backlink.

Ezgi Arslan, PhD. (2026) - "Sentiment Analysis Benchmark Testing: ChatGPT, Claude & Qwen". Published online at AIMultiple.com. Retrieved June 15, 2026, from: https://aimultiple.com/sentiment-analysis-benchmark [Online Resource]

PhD., E. A. (2026, June 15). Sentiment Analysis Benchmark Testing: ChatGPT, Claude & Qwen. AIMultiple. https://aimultiple.com/sentiment-analysis-benchmark

@misc{phd2026, author = {PhD., Ezgi Arslan,}, title = {{Sentiment Analysis Benchmark Testing: ChatGPT, Claude & Qwen}}, year = {2026}, month = jun, howpublished = {\url{https://aimultiple.com/sentiment-analysis-benchmark}}, note = {AIMultiple. Retrieved June 15, 2026} }

Ezgi Arslan, PhD.

Ezgi Arslan, PhD.

Industry Analyst

Ezgi holds a PhD in Business Administration with a specialization in finance and serves as an Industry Analyst at AIMultiple. She drives research and insights at the intersection of technology and business, with expertise spanning sustainability, survey and sentiment analysis, AI agent applications in finance, answer engine optimization, firewall management, and procurement technologies.

View Full Profile