Text to text Transfer Transformer (T5) in Data Augmentation (original) (raw)

Last Updated : 31 Jul, 2025

Data augmentation in NLP is a technique used to create additional training data by slightly modifying existing text. This helps machine learning models perform better, especially when the original dataset is small. Whether we're building a model for text classification, summarization or question answering, data augmentation can make a big difference.

Some common techniques used for data augmentation in NLP include:

In this article, we'll focus on how the **Text-to-Text Transfer Transformer (T5) can be used to generate new training data and improve NLP model performance.

Text to Text Transfer Transformer

Text-to-Text Transfer Transformer (T5) is a large transformer model trained on the Colossal Clean Crawled Corpus (C4). It was released as a pre-trained model capable of handling various NLP tasks such as translation, summarization, question answering and classification.

T5 treats every NLP task as a text-to-text problem. This means both the input and output are plain text, regardless of the task. For example:

**Example 1: For English-to-German translation, the input could be: "translate English to German: That is a book." then the output would be the translated sentence in German.

**Example 2: For sentiment analysis, the input might be: "sentiment analysis: I love this product." and the output would be: _"positive".

T5 allows training on multiple tasks by using different prefixes in the input to indicate the task type. This approach enables a single model to handle diverse NLP tasks effectively. It has shown strong performance across many benchmarks and is widely used for generating synthetic data in data augmentation workflows.

**How to use T5 for Data Augmentation

There are multiple ways to use the T5 (Text-to-Text Transfer Transformer) model for data augmentation in NLP tasks.

1. Using T5 Directly

Similar to back translation, T5 can be used without additional training by leveraging its pre-trained summarization capabilities. In this approach:

2. Fine-Tuning T5 for Custom Data Augmentation

T5 can also be fine-tuned on specific tasks to generate high-quality synthetic data. Two effective strategies are:

1. Masked Span Prediction

Fine Tuning Data on Masked word Prediction Task

2. Paraphrase Generation

Fine Tuning T5 for Paraphrase Generation using PAWS Dataset

Model Variants and Considerations

T5 is available in multiple sizes:

Larger models tend to produce better results but require more computational resources and training time. However, this is typically a one-time effort and the resulting model can be reused across various NLP tasks for effective data augmentation.

Implementation of Data Augmentation

1. Installation and Imports

!pip install transformers torch datasets

import pandas as pd import numpy as np import torch from transformers import T5Tokenizer, T5ForConditionalGeneration

`

2. Setting Device for Computation

Automatically use GPU if available, otherwise fall back to CPU

Python `

device = torch.device("cuda" if torch.cuda.is_available() else "cpu") print(f"Using device: {device}")

`

**Output:

Using device: cpu

3. Loading T5 Paraphrasing Model

class T5Paraphraser: def init(self, model_name="ramsrigouthamg/t5_paraphraser"): self.tokenizer = T5Tokenizer.from_pretrained(model_name) self.model = T5ForConditionalGeneration.from_pretrained(model_name).to(device)

def paraphrase(self, text, num_return_sequences=3, max_length=128):
    input_text = f"paraphrase: {text} </s>"
    inputs = self.tokenizer.encode(input_text, return_tensors="pt", max_length=max_length, truncation=True).to(device)

    outputs = self.model.generate(
        inputs, max_length=max_length, num_return_sequences=num_return_sequences,
        do_sample=True, top_k=50, top_p=0.95, temperature=0.8
    )

    return list(set(self.tokenizer.decode(output, skip_special_tokens=True) for output in outputs))

`

4. Initialising Model

paraphraser = T5Paraphraser()

sample_sentences = [ "Python is a powerful language.", "Deep learning requires large datasets.", "Artificial intelligence is evolving rapidly." ]

for text in sample_sentences: print(f"\nOriginal: {text}") for i, p in enumerate(paraphraser.paraphrase(text), 1): print(f"Paraphrase {i}: {p}")

`

**Output:

T5_paraphraser

T5 Paraphraser

5. Augmented a Text Classification Dataset

def get_sample_data(): return pd.DataFrame({ 'text': [ "Great product quality", "Fast delivery", "Excellent service", "Poor design", "Terrible support" ], 'label': ['positive', 'positive', 'positive', 'negative', 'negative'] })

def augment_dataset(df, text_col='text', label_col='label', ratio=0.5): augmented = []

for label in df[label_col].unique():
    samples = df[df[label_col] == label].sample(frac=ratio)
    for text in samples[text_col]:
        for p in paraphraser.paraphrase(text, num_return_sequences=2):
            augmented.append({text_col: p, label_col: label, 'source': 'augmented'})

df['source'] = 'original'
return pd.concat([df, pd.DataFrame(augmented)], ignore_index=True)

df = get_sample_data() aug_df = augment_dataset(df)

print("\nAugmented Dataset:") print(aug_df.head())

`

**Output:

Augmented_dataset

Augmentation on sample dataset

6. Batch Processing for Large Datasets

def batch_paraphrase(texts, batch_size=5, num_return=1): results = []

for i in range(0, len(texts), batch_size):
    batch = texts[i:i+batch_size]
    for text in batch:
        results.append(text)
        results.extend(paraphraser.paraphrase(text, num_return_sequences=num_return))

return results

augmented_batch = batch_paraphrase(sample_sentences, batch_size=2, num_return=2) print("\nBatch Paraphrased Output:") for text in augmented_batch: print(text)

`

**Output:

T5-paraphraser_O1

Batch paraphrased output

7. Analysis of Augmented Data

Show proportion of original vs. augmented data

Python `

def print_stats(df): print("\nSample Counts by Source:") print(df['source'].value_counts(normalize=True).apply(lambda x: f"{x*100:.1f}%"))

print_stats(aug_df)

`

**Output:

Augmented-vs-Original

Augmentation and Original proportion

Here we can see that our model is working fine.