How to Use the Hugging Face Transformer Library for Sentiment Analysis (original) (raw)
Last Updated : 23 Jul, 2025
The Hugging Face Transformer library is now a popular choice for developers working on Natural Language Processing (NLP) projects. It simplifies access to a range of pretrained models like BERT, GPT, and RoBERTa, making it easier for developers to utilize advanced models without extensive knowledge in deep learning. The Transformer library enables text classification, translation, summarization, and question-answering tasks.
**This article will walk you through the essentials of utilizing the Hugging Face Transformer library, starting from installation and moving on to handling pre-trained models.
Why Use Hugging Face Transformers?
The HuggingFace library offers several benefits:
- **Pre-trained Models: Hugging Face provides numerous pre-trained models that are readily available for tasks such as text classification, text generation, and translation.
- **Ease of Use: The library abstracts away the complexity of using transformer models, allowing you to focus on your task.
- **Integration with PyTorch and TensorFlow: You can seamlessly integrate Hugging Face models with either framework.
- **Scalable: Hugging Face models can be fine-tuned to your specific tasks, whether it be text classification, question answering, or summarization.
Using HuggingFace Library for Sentimental Analysis: Step-by-Step Guide
Step 1: Installing the Required Libraries
To begin, you need to install the necessary libraries:
pip install transformers datasets torch
These libraries provide tools to access pre-trained models (transformers), datasets (datasets), and the PyTorch framework (torch), which is required to run the models.
Step 2: Loading the IMDb Dataset
We’ll use the IMDb dataset, a common benchmark for binary sentiment classification, where each review is classified as positive or negative.
Python `
from datasets import load_dataset
Load IMDb dataset
dataset = load_dataset('imdb') print(dataset)
`
**Output:
DatasetDict({
train: Dataset({
features: ['text', 'label'],
num_rows: 25000
})
test: Dataset({
features: ['text', 'label'],
num_rows: 25000
})
unsupervised: Dataset({
features: ['text', 'label'],
num_rows: 50000
})
})
The **load_dataset()**function allows you to load datasets directly from the Hugging Face hub. Here, we are loading the IMDb dataset, which contains movie reviews labeled as either positive or negative.
Step 3: Loading a Pre-trained BERT Tokenizer
We will use a pre-trained BERT tokenizer to convert text into token IDs that can be understood by the model. BERT’s tokenizer splits text into subword tokens, allowing it to handle large vocabularies efficiently.
Python `
from transformers import AutoTokenizer
Load the tokenizer for a pretrained BERT model
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
`
Step 4: Tokenizing the Dataset
We must preprocess the dataset by applying the tokenizer to each example. The tokenizer converts each review text into tokens, ensuring it fits within the model's maximum input length.
Python `
Tokenizing function
def preprocess_function(examples): return tokenizer(examples['text'], truncation=True, padding=True, max_length=512)
Apply tokenization to the dataset
tokenized_dataset = dataset.map(preprocess_function, batched=True)
`
In this step, we create a**preprocess_function()**that tokenizes the input text, truncates it to 512 tokens, and applies padding to ensure all inputs are the same length. We then map this function to the entire dataset.
Step 5: Loading a Pre-trained BERT Model for Sequence Classification
We’ll load a pre-trained BERT model (bert-base-uncased) and modify its classification head to fit our binary classification task (IMDb reviews are classified as either positive or negative).
Python `
from transformers import AutoModelForSequenceClassification
Load a pretrained BERT model for sequence classification
model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
`
Here, we specify that our model will have two output labels, corresponding to the binary classification task.
Step 6: Setting Up Training Arguments
To fine-tune our model, we need to specify training arguments. This includes setting batch sizes, learning rate, evaluation strategy, number of epochs, and where to store results.
Python `
from transformers import TrainingArguments
Set up training arguments
training_args = TrainingArguments( output_dir="./results", evaluation_strategy="epoch", learning_rate=2e-5, per_device_train_batch_size=8, per_device_eval_batch_size=8, num_train_epochs=3, weight_decay=0.01, )
`
Here, we set a learning rate of 2e-5, a batch size of 8, and run the model for 3 epochs with weight decay to prevent overfitting.
Step 7: Splitting the Dataset into Train and Test Sets
We split the tokenized dataset into training and test sets. This will allow us to evaluate the model's performance on unseen data after fine-tuning.
Python `
Split dataset into train and test sets
train_dataset = tokenized_dataset["train"] test_dataset = tokenized_dataset["test"]
`
Step 8: Initializing the Trainer
The Hugging Face Trainer class simplifies the training loop by handling gradient updates, evaluation, and logging. You only need to pass the model, training arguments, dataset, and tokenizer.
Python `
from transformers import Trainer
Initialize the Trainer
trainer = Trainer( model=model, args=training_args, train_dataset=train_dataset, eval_dataset=test_dataset, tokenizer=tokenizer, )
`
Step 9: Training the Model
Now, we can fine-tune the pre-trained BERT model on the IMDb dataset. The **train() method will handle the training process, logging the results at each epoch.
Python `
Train the model
trainer.train()
`
**Output:
Step 10: Evaluating the Model
After training, we evaluate the model on the test set to check how well it generalizes to new, unseen data.
Python `
Evaluate the model
results = trainer.evaluate() print(results)
`
**Output:
{'eval_loss': 0.31074509024620056, 'eval_runtime': 756.7467, 'eval_samples_per_second': 33.036, 'eval_steps_per_second': 4.13, 'epoch': 3.0}
Conclusion
In this article, we showed how to use Hugging Face’s Transformer library to fine-tune a pre-trained BERT model for sentiment analysis using the IMDb dataset. Hugging Face simplifies the process of working with transformers by providing pre-trained models, tokenizers, and ready-to-use tools for training and evaluation.