GitHub - NielsRogge/Transformers-Tutorials: This repository contains demos I made with the Transformers library by HuggingFace. (original) (raw)

Transformers-Tutorials

Hi there!

This repository contains demos I made with the Transformers library by 🤗 HuggingFace. Currently, all of them are implemented in PyTorch.

NOTE: if you are not familiar with HuggingFace and/or Transformers, I highly recommend to check out our free course, which introduces you to several Transformer architectures (such as BERT, GPT-2, T5, BART, etc.), as well as an overview of the HuggingFace libraries, including Transformers, Tokenizers, Datasets, Accelerate and the hub.

For an overview of the ecosystem of HuggingFace for computer vision (June 2022), refer to this notebook with corresponding video.

Currently, it contains the following demos:

... more to come! 🤗

If you have any questions regarding these demos, feel free to open an issue on this repository.

Btw, I was also the main contributor to add the following algorithms to the library:

All of them were an incredible learning experience. I can recommend anyone to contribute an AI algorithm to the library!

Data preprocessing

Regarding preparing your data for a PyTorch model, there are a few options:

from torch.utils.data import Dataset

class CustomTrainDataset(Dataset): def init(self, df, tokenizer): self.df = df self.tokenizer = tokenizer

def __len__(self):
    return len(self.df)

def __getitem__(self, idx):
    # get item
    item = df.iloc[idx]
    text = item['text']
    label = item['label']
    # encode text
    encoding = self.tokenizer(text, padding="max_length", max_length=128, truncation=True, return_tensors="pt")
    # remove batch dimension which the tokenizer automatically adds
    encoding = {k:v.squeeze() for k,v in encoding.items()}
    # add label
    encoding["label"] = torch.tensor(label)
    
    return encoding

Instantiating the dataset then happens as follows:

from transformers import BertTokenizer import pandas as pd

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased") df = pd.read_csv("path_to_your_csv")

train_dataset = CustomTrainDataset(df=df, tokenizer=tokenizer)

Accessing the first example of the dataset can then be done as follows:

encoding = train_dataset[0]

In practice, one creates a corresponding DataLoader, that allows to get batches from the dataset:

from torch.utils.data import DataLoader

train_dataloader = DataLoader(train_dataset, batch_size=4, shuffle=True)

I often check whether the data is created correctly by fetching the first batch from the data loader, and then printing out the shapes of the tensors, decoding the input_ids back to text, etc.

batch = next(iter(train_dataloader)) for k,v in batch.items(): print(k, v.shape)

decode the input_ids of the first example of the batch

print(tokenizer.decode(batch['input_ids'][0].tolist())

Loading a custom dataset as a Dataset object can be done as follows (you can install datasets using pip install datasets):

from datasets import load_dataset

dataset = load_dataset('csv', data_files={'train': ['my_train_file_1.csv', 'my_train_file_2.csv'] 'test': 'my_test_file.csv'})

Here I'm loading local CSV files, but there are other formats supported (including JSON, Parquet, txt) as well as loading data from a local Pandas dataframe or dictionary for instance. You can check out the docs for all details.

Training frameworks

Regarding fine-tuning Transformer models (or more generally, PyTorch models), there are a few options:

import torch from transformers import BertForSequenceClassification

Instantiate pre-trained BERT model with randomly initialized classification head

model = BertForSequenceClassification.from_pretrained("bert-base-uncased")

I almost always use a learning rate of 5e-5 when fine-tuning Transformer based models

optimizer = torch.optim.Adam(model.parameters(), lr=5e-5)

put model on GPU, if available

device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model.to(device)

for epoch in range(epochs): model.train() train_loss = 0.0 for batch in train_dataloader: # put batch on device batch = {k:v.to(device) for k,v in batch.items()}

    # forward pass
    outputs = model(**batch)
    loss = outputs.loss
    
    train_loss += loss.item()
    
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

print("Loss after epoch {epoch}:", train_loss/len(train_dataloader))

model.eval()
val_loss = 0.0
with torch.no_grad():
    for batch in eval_dataloader:
        # put batch on device
        batch = {k:v.to(device) for k,v in batch.items()}
        
        # forward pass
        outputs = model(**batch)
        loss = outputs.logits
        
        val_loss += loss.item()
              
print("Validation loss after epoch {epoch}:", val_loss/len(eval_dataloader))

Citation

Feel free to cite me when you use some of my tutorials :)

@misc{rogge2025transformerstutorials, author = {Rogge, Niels}, title = {Tutorials}, url = {https://github.com/NielsRogge/tutorials}, year = {2025} }