Sentiment Analysis Benchmark Testing: ChatGPT, Claude & Qwen (original) (raw)

Achieving precise labeling of emotions and sentiments, as well as detecting irony, hatefulness, and offensiveness, remains a challenge, requiring further testing and refinement. We tested 10 large language models across five sentiment tasks: emotion, hatefulness, irony, offensiveness, and sentiment. We ranked them by average accuracy across all five.

The results highlight clear distinctions between the tools:

GPT 5.5 achieved the best overall accuracy (80%),
Minimax M2.7 (72%) recorded the lowest overall performance.

Experimental results: sentiment analysis benchmark

Loading Chart

Ranking: Tools are ranked according to their average accuracy rates aggregated across all tested categories: emotion, hatefulness, irony, offensiveness, and sentiment.

For further details, read the methodology of our benchmark.

Overall accuracy

Combining all tasks, the models’ total accuracy scores provide a holistic view of their capabilities:

GPT 5.5 ranked first at 80%. It never dropped below 73% in any task, which made it the most consistent model in the test.
Claude Sonnet 4.6 came second at 79%. It scored the single highest result in the benchmark: 82% on hatefulness.
Qwen 3.6 Plus and ChatGPT 5.4 mini tied for third at 78%. ChatGPT 5.4 mini is the smallest model near the top, yet it led offensiveness detection and tied for first in irony.
Kimi k2.6 scored 77%, with steady results and no clear weak task.
Gemini 3.1-pro and GLM 5.1 tied at 76%. Gemini 3.1-pro tied for first in emotion detection but ranked low in hatefulness.
Claude Opus 4.8 scored 74%. It was held back by emotion detection (68%), its weakest category.
Gemini 3.5 Flash scored 73%. Its hatefulness result (65%) was the lowest in that task.
Minimax M2.7 ranked last at 72%. It scored lowest in emotion, irony, and offensiveness.

1. Emotion detection

Emotion detection is a challenging task in sentiment analysis, often requiring models to discern subtle cues in language. Here’s how the models performed:

GPT 5.5 and Gemini 3.1-pro tied for first at 80%.
Qwen 3.6 Plus followed at 79%.
Kimi k2.6 scored 78%, and GLM 5.1 scored 77%.
ChatGPT 5.4 mini reached 76%, and Claude Sonnet 4.6 reached 75%.
Gemini 3.5 Flash scored 73%.
Claude Opus 4.8 scored 68%.
Minimax M2.7 scored lowest at 66%.

Emotion detection had a wide spread: 14 points between the top and bottom models. This makes it one of the two tasks that most clearly separate models.

2. Hatefulness detection

Detecting hateful content is crucial for Twitter sentiment classification and other moderation tasks. The results revealed notable differences:

Claude Sonnet 4.6 led at 82%, the highest single score in the benchmark.
GPT 5.5 followed closely at 80%.
Qwen 3.6 Plus scored 77%.
Kimi k2.6 and GLM 5.1 both scored 76%.
Minimax M2.7 scored 75%.
ChatGPT 5.4 mini scored 72%.
Gemini 3.1-pro and Claude Opus 4.8 both scored 71%.
Gemini 3.5 Flash scored lowest at 65%.

Hatefulness had the widest spread of any task: 17 points. If moderation is your use case, pick from the top of this column rather than from the average ranking.

3. Irony detection

Irony detection is an area where semantic evaluation plays a pivotal role. Both models delivered high sentiment analysis benchmark performance, but GPT-4o emerged as a clear leader:

GPT 5.5, Claude Sonnet 4.6, Qwen 3.6 Plus, and ChatGPT 5.4 mini tied for first at 91%.
Gemini 3.1-pro, GLM 5.1, and Gemini 3.5 Flash each scored 87%.
Claude Opus 4.8 scored 86%, and Kimi k2.6 scored 85%.
Minimax M2.7 scored lowest at 82%.

This was the easiest task in the set. Even the lowest score was 82%. For work that depends on catching irony or sarcasm, any of these models is a safe starting point.

4. Offensiveness detection

Detecting offensive content is critical for maintaining healthy online communities. The models’ sentiment analysis benchmark performances in this task were as follows:

ChatGPT 5.4 mini led at 75%.
GPT 5.5 scored 73%, and Claude Sonnet 4.6 scored 72%. Claude Opus 4.8 scored 70%.
Qwen 3.6 Plus, Kimi k2.6, Gemini 3.1-pro, and GLM 5.1 all scored 69%.
Gemini 3.5 Flash scored 68%.
Minimax M2.7 scored lowest at 65%.

No model reached 76% on the offensiveness metric. The whole field ranged from 65% to 75%. Context drives this task, and the dataset’s borderline cases trip up every model.

5. Sentiment analysis

The overarching sentiment analysis task focused on classifying data into positive, negative, and neutral sentiments. Accuracy scores for this task varied significantly between the models:

GPT 5.5, Qwen 3.6 Plus, ChatGPT 5.4 mini, and Gemini 3.1-pro tied for first at 75%.
Kimi k2.6, Claude Opus 4.8, Gemini 3.5 Flash, and Minimax M2.7 all scored 74%.
Claude Sonnet 4.6 scored 73%.
GLM 5.1 scored lowest at 72%.

The full range was 3 points, from 72% to 75%. No model handled three-way sentiment well. If the project needs reliable positive, negative, and neutral labels, none of these models is ready to run without a human check.

Observations and insights

Tasks are not equally hard

Irony was easy for every model (82% to 91%). Sentiment and offensiveness were hard for every model, with all scores between 65% and 75%. Pick a model for the task you actually have, not for its average rank.

Emotion and hatefulness separate models best

These two tasks had the widest score gaps: 14 and 17 points. If your use case is emotion tracking or moderation, the choice of model matters more here than anywhere else.

A high average can hide a weak task

GPT 5.5 ranked first overall and remained strong across the board. But Claude Opus 4.8 ranked eighth overall, scoring 86% on irony. Read the column for your task, not the average.

Benchmark dataset and methodology

Analysis dataset

We used the TweetEval dataset, built for sentiment analysis on real Twitter messages.1 It is part of the Association for Computational Linguistics (ACL) work on semantic evaluation. The dataset ships with pre-labeled training and test sets across five task types:

Emotion detection: naming the feeling in a tweet, such as anger, joy, optimism, or sadness. Example tweet and label: “#Deppression is real. Partners w/ #depressed people truly dont understand the depth in which they affect us. Add in #anxiety &makes it worse” is labeled sad.2
Hatefulness detection: flagging hate speech in a tweet. Example tweet and label: “Trump wants to deport illegal aliens with ‘no judges or court cases’ #MeToo I am solidly behind this action The thought of someone illegally entering a country & showing no respect for its laws, should be protected by same laws is ludacris! #DeportThemAll” is labeled hateful.3
Irony detection: spotting ironic intent. Example tweet and label: “People who tell people with anxiety to ‘just stop worrying about it’ are my favorite kind of people #not #educateyourself” is labeled irony.4
Offensiveness detection: classifying tweets with offensive language. Example tweet and label: “#ConstitutionDay It’s very odd for the alt right conservatives to say that we are ruining the constitution because we want #GunControlNow but they are the ones ruining the constitution getting upset because foreigners are coming to this land who are not White wanting to live” is labeled offensive.5
Sentiment classification: assigning a positive, negative, or neutral label. Example tweet and label: “Can’t wait to try this – Google Earth VR – this stuff really is the future of exploration….” is labeled positive.6

These tasks align with real-world machine-learning approaches, making them ideal for evaluating the experimental results of the two models.

Models tested

We tested 10 large language models, all through the OpenRouter API so the setup was the same for each:

GPT 5.5, ChatGPT 5.4 mini, Claude Sonnet 4.6, Claude Opus 4.8, Gemini 3.1-pro, Gemini 3.5 Flash, Qwen 3.6 Plus, Kimi k2.6, GLM 5.1, and Minimax M2.7.

Experimental setup

We kept every setting the same across all 10 models.

Sample

We used the first 200 tweets of each task’s official test set, with the dataset’s own gold labels. The same 200 tweets went to every model, so the comparison is like-for-like.

Prompting

We used zero-shot prompts: a plain task instruction and the raw tweet, with no worked examples. The model returned one label and nothing else.

We wrote the prompts so they gave nothing away. We did not name the benchmark, call the model an “annotator,” or hint that it was being graded. Naming the test can change how a model answers, so we left it out. The emotion prompt, for example, asked the model to pick one of anger, joy, optimism, or sadness and reply with hat word.

Generation settings

We set temperature to 0, which makes the output as steady as the model allows. We set the token limit to 4,096. The high limit matters for reasoning models: with a small limit they spend the whole budget on hidden reasoning and return a blank answer. The extra room lets them finish reasoning and still print the label. Models that do not reason answer in one short word, so the limit costs nothing there.

Reading the answers

We mapped each reply to a label in steps: first an exact match, then a short list of synonyms (for example, “happy” maps to joy), then a search for any label inside a longer reply. Replies we could not read were counted as wrong.

Metric

The score for each task is not raw accuracy. We used the metric that the TweetEval authors set for each task:

Emotion: macro-F1
Sentiment: macro-recall
Hatefulness: macro-F1
Irony: F1 of the irony class
Offensiveness: macro-F1

Macro-F1 and macro-recall both weight each class the same, no matter how often it appears. This is the right choice here because classes like hate or irony are rare, and plain accuracy would let a model look good by always picking the common label. The average column is the mean of these five scores.

Reliability

A few models hit rate limits during the run and dropped some calls. We re-ran the failed rows at low speed to avoid the limits and repeated this until nothing failed. The final results have no failed calls and no unreadable replies.

Setup limitations

We used a 200-tweet slice of each test set, not the full set, so these numbers do not line up with the published TweetEval leaderboard. The comparison across our 10 models still holds, because every model saw the same tweets.

The 200-tweet slice is fixed, not random, so it is reproducible but not a random sample. Each task also used a single prompt at temperature 0. A different prompt, or few-shot examples, would shift the absolute numbers.

We used datasets with public gold labels. This carries a risk of contamination, where a model has seen the labels during training. We cannot rule it out, but the scores were well short of perfect, which suggests it was not a major factor. For the next version, we plan to test tweets whose labels have not been published.

Because the sample is 200 tweets per task, small gaps carry sampling noise. We treat a one- to two-point difference as a tie rather than a ranking.

Which model to pick

The full scores are in the table above. This section is shorter: it maps common needs to the model that fits.

Best all-round choice: GPT 5.5. It ranked first and stayed strong on every task, so it is the safe default when your work mixes several sentiment jobs.
Content moderation and hate speech: Claude Sonnet 4.6. It scored highest of any model on hatefulness. GPT 5.5 is a close second.
Offensive-language detection on a budget: ChatGPT 5.4 mini. It led offensiveness and matched the top irony scores, which is rare for a smaller, cheaper model.
Emotion and sentiment tracking: Gemini 3.1-pro or Qwen 3.6 Plus. Both sit at the top of these two columns. Use them for mood and opinion work rather than moderation.
Irony and sarcasm: almost any model here. Scores ran from 82% to 91%, so this task rarely drives the choice. Pick the cheapest model that meets your other needs.
Steady, general-purpose use: Kimi k2.6. No standout task, but no weak one either.
Use with care for high-stakes work: Gemini 3.5 Flash and Minimax M2.7 ranked at the bottom. Gemini 3.5 Flash was weakest on hate speech, so avoid it for moderation in particular.

A reminder that runs through all of this: read the column for your task, not the average. A model can rank mid-table overall and still lead the one task you care about.

Don’t miss our benchmarks and data-driven insights. The button opens Google; selecting AIMultiple confirms that you wish to see AIMultiple more often in Google search results.

Add as preferred source

Cite this benchmark

Pick the format that matches where you're publishing. Pasting the link version into your CMS preserves the backlink.

Ezgi Arslan, PhD. (2026) - "Sentiment Analysis Benchmark Testing: ChatGPT, Claude & Qwen". Published online at AIMultiple.com. Retrieved June 15, 2026, from: https://aimultiple.com/sentiment-analysis-benchmark [Online Resource]

PhD., E. A. (2026, June 15). Sentiment Analysis Benchmark Testing: ChatGPT, Claude & Qwen. AIMultiple. https://aimultiple.com/sentiment-analysis-benchmark

@misc{phd2026, author = {PhD., Ezgi Arslan,}, title = {{Sentiment Analysis Benchmark Testing: ChatGPT, Claude & Qwen}}, year = {2026}, month = jun, howpublished = {\url{https://aimultiple.com/sentiment-analysis-benchmark}}, note = {AIMultiple. Retrieved June 15, 2026} }

Ezgi Arslan, PhD.

Industry Analyst

Ezgi holds a PhD in Business Administration with a specialization in finance and serves as an Industry Analyst at AIMultiple. She drives research and insights at the intersection of technology and business, with expertise spanning sustainability, survey and sentiment analysis, AI agent applications in finance, answer engine optimization, firewall management, and procurement technologies.

View Full Profile