How to Use the Hugging Face Transformer Library for Sentiment Analysis (original) (raw)

Last Updated : 23 Jul, 2025

The Hugging Face Transformer library is now a popular choice for developers working on Natural Language Processing (NLP) projects. It simplifies access to a range of pretrained models like BERT, GPT, and RoBERTa, making it easier for developers to utilize advanced models without extensive knowledge in deep learning. The Transformer library enables text classification, translation, summarization, and question-answering tasks.

**This article will walk you through the essentials of utilizing the Hugging Face Transformer library, starting from installation and moving on to handling pre-trained models.

Why Use Hugging Face Transformers?

The HuggingFace library offers several benefits:

Using HuggingFace Library for Sentimental Analysis: Step-by-Step Guide

Step 1: Installing the Required Libraries

To begin, you need to install the necessary libraries:

pip install transformers datasets torch

These libraries provide tools to access pre-trained models (transformers), datasets (datasets), and the PyTorch framework (torch), which is required to run the models.

Step 2: Loading the IMDb Dataset

We’ll use the IMDb dataset, a common benchmark for binary sentiment classification, where each review is classified as positive or negative.

Python `

from datasets import load_dataset

Load IMDb dataset

dataset = load_dataset('imdb') print(dataset)

`

**Output:

DatasetDict({
train: Dataset({
features: ['text', 'label'],
num_rows: 25000
})
test: Dataset({
features: ['text', 'label'],
num_rows: 25000
})
unsupervised: Dataset({
features: ['text', 'label'],
num_rows: 50000
})
})

The **load_dataset()**function allows you to load datasets directly from the Hugging Face hub. Here, we are loading the IMDb dataset, which contains movie reviews labeled as either positive or negative.

Step 3: Loading a Pre-trained BERT Tokenizer

We will use a pre-trained BERT tokenizer to convert text into token IDs that can be understood by the model. BERT’s tokenizer splits text into subword tokens, allowing it to handle large vocabularies efficiently.

Python `

from transformers import AutoTokenizer

Load the tokenizer for a pretrained BERT model

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

`

Step 4: Tokenizing the Dataset

We must preprocess the dataset by applying the tokenizer to each example. The tokenizer converts each review text into tokens, ensuring it fits within the model's maximum input length.

Python `

Tokenizing function

def preprocess_function(examples): return tokenizer(examples['text'], truncation=True, padding=True, max_length=512)

Apply tokenization to the dataset

tokenized_dataset = dataset.map(preprocess_function, batched=True)

`

In this step, we create a**preprocess_function()**that tokenizes the input text, truncates it to 512 tokens, and applies padding to ensure all inputs are the same length. We then map this function to the entire dataset.

Step 5: Loading a Pre-trained BERT Model for Sequence Classification

We’ll load a pre-trained BERT model (bert-base-uncased) and modify its classification head to fit our binary classification task (IMDb reviews are classified as either positive or negative).

Python `

from transformers import AutoModelForSequenceClassification

Load a pretrained BERT model for sequence classification

model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

`

Here, we specify that our model will have two output labels, corresponding to the binary classification task.

Step 6: Setting Up Training Arguments

To fine-tune our model, we need to specify training arguments. This includes setting batch sizes, learning rate, evaluation strategy, number of epochs, and where to store results.

Python `

from transformers import TrainingArguments

Set up training arguments

training_args = TrainingArguments( output_dir="./results", evaluation_strategy="epoch", learning_rate=2e-5, per_device_train_batch_size=8, per_device_eval_batch_size=8, num_train_epochs=3, weight_decay=0.01, )

`

Here, we set a learning rate of 2e-5, a batch size of 8, and run the model for 3 epochs with weight decay to prevent overfitting.

Step 7: Splitting the Dataset into Train and Test Sets

We split the tokenized dataset into training and test sets. This will allow us to evaluate the model's performance on unseen data after fine-tuning.

Python `

Split dataset into train and test sets

train_dataset = tokenized_dataset["train"] test_dataset = tokenized_dataset["test"]

`

Step 8: Initializing the Trainer

The Hugging Face Trainer class simplifies the training loop by handling gradient updates, evaluation, and logging. You only need to pass the model, training arguments, dataset, and tokenizer.

Python `

from transformers import Trainer

Initialize the Trainer

trainer = Trainer( model=model, args=training_args, train_dataset=train_dataset, eval_dataset=test_dataset, tokenizer=tokenizer, )

`

Step 9: Training the Model

Now, we can fine-tune the pre-trained BERT model on the IMDb dataset. The **train() method will handle the training process, logging the results at each epoch.

Python `

Train the model

trainer.train()

`

**Output:

training-output

Step 10: Evaluating the Model

After training, we evaluate the model on the test set to check how well it generalizes to new, unseen data.

Python `

Evaluate the model

results = trainer.evaluate() print(results)

`

**Output:

{'eval_loss': 0.31074509024620056, 'eval_runtime': 756.7467, 'eval_samples_per_second': 33.036, 'eval_steps_per_second': 4.13, 'epoch': 3.0}

Conclusion

In this article, we showed how to use Hugging Face’s Transformer library to fine-tune a pre-trained BERT model for sentiment analysis using the IMDb dataset. Hugging Face simplifies the process of working with transformers by providing pre-trained models, tokenizers, and ready-to-use tools for training and evaluation.