LLM synthetic generation to enhance online content moderation generalization in hate speech scenarios (original) (raw)
1 Introduction
The increasing influence of social media and online platforms has amplified the spread of hate speech, with damaging consequences for individuals, communities, and society at large [[1](/article/10.1007/s00607-025-01518-8#ref-CR1 "Ştefăniţă O, Buf D-M (2021) Hate speech in social media and its effects on the LGBT community: a review of the current research. Roman J Commun Public Relat 23:47–55. https://doi.org/10.21018/rjcpr.2021.1.322
")\]. Avoiding or attempting to mitigate such issues, like hate speech, is one of the aspects that most concern modern societies, as it affects institutions, governments, public and private entities, and, of course, citizens. Hate speech, broadly defined as language that discriminates, belittles, or incites violence against individuals or groups based on attributes like race, gender, religion, or sexual orientation, poses significant threats to both mental health and social cohesion \[[2](/article/10.1007/s00607-025-01518-8#ref-CR2 "Das M, Mathew B, Saha P, Goyal P, Mukherjee A (2020) Hate speech in online social media. ACM SIGWEB Newsl 2020(Autumn):1–8.
https://doi.org/10.1145/3427478.3427482
")\].Automated detection of hate speech is critical, not only to maintain safe online environments but also to protect vulnerable populations from continuous exposure to harmful content. Without efficient identification and moderation, hate speech can escalate conflicts, reinforce societal divisions, and perpetuate structural inequalities [[3](/article/10.1007/s00607-025-01518-8#ref-CR3 "Lopez-Sanchez M, Müller A (2021) On simulating the propagation and countermeasures of hate speech in social networks. Appl Sci 11(24):12003. https://doi.org/10.3390/app112412003
")\]. Targeted groups experience psychological harm such as anxiety, low self-esteem, and estrangement, which emphasizes the need for strong, scalable strategies to manage these adverse impacts \[[4](/article/10.1007/s00607-025-01518-8#ref-CR4 "Mathew B, Dutt R, Goyal P, Mukherjee A (2019) Spread of hate speech in online social media. In: Proceedings of the 10th ACM conference on web science, pp 173–182.
https://doi.org/10.1145/3292522.3326034
")\].Hate speech can vary dramatically across languages, cultures, and contexts, making it difficult to define and identify consistently. Even within a single language, hate speech can manifest in diverse ways. Moreover, the volume of content generated daily on social platforms necessitates automation, as manual moderation is both impractical and insufficient for effectively controlling the spread of harmful content. As a result, the development of hate speech datasets is paramount for training machine learning algorithms to recognize these patterns accurately. However, the creation of such datasets is fraught with challenges, particularly in acquiring positive samples [[5](/article/10.1007/s00607-025-01518-8#ref-CR5 "Ilan T, Vilenchik D (2022) HARALD: Augmenting hate speech data sets with real data. In: Goldberg Y, Kozareva Z, Zhang Y (eds) Findings of the association for computational linguistics: EMNLP 2022. Association for Computational Linguistics, Abu Dhabi, pp 2241–2248. https://doi.org/10.18653/v1/2022.findings-emnlp.165
")\].However, sourcing hate speech samples requires navigating ethical and legal considerations [[6](/article/10.1007/s00607-025-01518-8#ref-CR6 "Alkiviadou N (2019) Hate speech on social media networks: towards a regulatory framework? Inf Commun Technol Law 28(1):19–35. https://doi.org/10.1080/13600834.2018.1494417
")\], as well as technical obstacles, such as accessing online platforms where hate speech is prevalent. Recently, these challenges have been further compounded by social media platforms limiting access to their data. For instance, _X_ (formerly _Twitter_), which has historically been a significant source of hate speech data due to its high volume of user-generated content, has limited API access for academic research \[[7](/article/10.1007/s00607-025-01518-8#ref-CR7 "Murtfeldt R, Alterman N, Kahveci I, West JD (2024) RIP Twitter API: a eulogy to its vast research contributions. arXiv")\]. This policy shift creates substantial barriers for researchers who rely on direct data acquisition to study hate speech patterns and build comprehensive datasets.The generation of synthetic text data, particularly for specialized tasks like hate speech detection, has long posed a costly and technically demanding challenge. Traditional non-AI-based methods for synthetic data generation often involve simple manipulations, such as altering characters or substituting words within original sentences, to introduce variety while retaining the core meaning [8]. These techniques are primarily designed to make text classification models more robust by exposing them to noisy or slightly altered data. While these methods contribute to a degree of resilience against minor variations, they fail to provide new semantic views that do not exist among the patterns of the original dataset.
Recent advancements in LLMs have revolutionized the field of text generation by enabling the creation of synthetic text that closely mirrors human language. For example, interactive chatbots such as ChatGPT [[9](/article/10.1007/s00607-025-01518-8#ref-CR9 "Mohamadi S, Mujtaba G, Le N, Doretto G, Adjeroh DA (2023) ChatGPT in the age of generative ai and large language models: a concise survey. arXiv:2307.04251v2
")\] or _Mistral_ \[[10](/article/10.1007/s00607-025-01518-8#ref-CR10 "Jiang AQ, Sablayrolles A, Mensch A, Bamford C, Chaplot DS, Casas D, Bressand F, Lengyel G, Lample G, Saulnier L, Lavaud LR, Lachaux M-A, Stock P, Scao TL, Lavril T, Wang T, Lacroix T, Sayed WE (2023) Mistral 7B.
arXiv. doi:1048550/arXiv.2310.06825
")\] are built on GPT-based LLMs and exemplify these innovations. These models, have been trained on the whole corpus of internet encompassing a broad spectrum of linguistic expressions and tones, ranging from neutral to harmful, offering unprecedented capabilities in generating contextually relevant and semantically rich content. Unlike traditional methods, LLMs can produce diverse, realistic, and coherent sentences that extend beyond simple paraphrasing, enhancing the quality and variety of synthetic data available for training purposes.In this paper, we introduce a method for semantic data augmentation using LLMs that operate without moderation filters. These models have not undergone instruction-tuning specifically designed to limit or avoid potentially contentious or harmful responses. Our approach focuses on enhancing hate speech datasets by generating additional samples that retain semantic coherence and relevance to the targeted categories. This method strengthens the generalization capabilities of various text classification models, particularly in low-resource contexts, by enriching original datasets with synthetic examples, all without requiring any changes to the model architectures themselves.
The hate speech domain serves as our case study. We benchmark it against four widely used hate speech datasets: (1) _Call me Sexist But... (CMSB)_—a dataset containing sexist remarks collected from social networks; (2) _ETHOS_—a dataset encompassing various types of hate speech, including violence, racism, and discrimination based on disability; (3) _Stormfront_—a dataset with white supremacy content reflecting harmful ideologies; and (4) _Antiasian_—a dataset containing offensive content targeting the Asian community.
With this article we bring the following contributions:
- Robust data augmentation technique for hate speech: We propose a semantic data augmentation technique using unfiltered LLMs to generate additional hate speech samples that are contextually coherent and relevant. In addition, we provide a comparative analysis with existing text data augmentation methods, demonstrating that our approach is consistently more robust.
- Enhanced performance in low-resource settings: Our method enriches datasets with synthetic samples, significantly boosting the performance of text classification models in low-resource environments without necessitating any changes to the underlying model architectures.
The rest of the paper is structured as follows. Section 2 reviews the state of the art in synthetic data generation comparing it with our method. Section 3 describes our method and the evaluation methodology. Section 4 contains the experimental setup describing the dataset, the augmentation baselines, the model selection and the employed training and evaluation techniques. Section 5 contains all the experimentation with each of the datasets. Finally, Sect. 6 contain the final remarks.
2 Related work
Data augmentation (DA) refers to a range of techniques that increase the diversity of training data without the need for additional data collection. The primary goal of DA is to create varied versions of existing data, either by applying transformations or by generating synthetic samples, so that the augmented data serves as a regularizer at training time [8].
In areas like image processing, data augmentation techniques are both powerful and straightforward to apply, with methods such as transposition or random erasure proving highly effective [[11](/article/10.1007/s00607-025-01518-8#ref-CR11 "Perez L, Wang J (2017) The effectiveness of data augmentation in image classification using deep learning. arXiv. doi:1048550/arXiv.1712.04621
"), [12](/article/10.1007/s00607-025-01518-8#ref-CR12 "Cubuk ED, Zoph B, Shlens J, Le Q (2020) RandAugment: practical automated data augmentation with a reduced search space. In: Advances in neural information processing systems, vol 33. Curran Associates, Inc., pp 18613–18624")\]. Furthermore, diffusion-based models have recently advanced to the point where they can produce images almost indistinguishable from those created by humans \[[13](/article/10.1007/s00607-025-01518-8#ref-CR13 "Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models arXiv.
https://doi.org/10.48550/arXiv.2006.11239
")\].Although similar strategies have been investigated for text, it is less clear how to produce successful augmented examples as the input space is discrete. Traditional text data augmentation techniques have often focused on character- or word-level manipulations to introduce variation and help models become more resilient to noise [[14](/article/10.1007/s00607-025-01518-8#ref-CR14 "Praveen Gujjar J, Prasanna Kumar HR, Guru Prasad MS (2023) Advanced NLP framework for text processing. In: 2023 6th international conference on information systems and computer networks (ISCON), pp 1–3. https://doi.org/10.1109/ISCON57294.2023.10112058
")\]. By generating slightly altered versions, these approaches expose models to noise, encouraging them to learn more robust representations that generalize beyond specific word choices or minor typographic errors. Moreover, _seq2seq_ techniques like _BackTranslation_ \[[15](/article/10.1007/s00607-025-01518-8#ref-CR15 "Sennrich R, Haddow B, Birch A (2016) Improving neural machine translation models with monolingual data. In: Erk K, Smith NA (eds) Proceedings of the 54th annual meeting of the association for computational linguistics (volume 1: long papers). Association for Computational Linguistics, Berlin, pp 86–96.
https://doi.org/10.18653/v1/P16-1009
")\] go beyond shallow alterations, capturing more of the semantic nuance needed for complex NLP tasks.However, the emergence and rapid advancement of LLMs have sparked a new research direction, where these models are leveraged to generate synthetic text data. Text augmentation brings unique challenges, particularly in assessing the diversity of generated samples and ensuring the synthetic data accurately aligns with the intended class. The success of prominent approaches often hinges on the fine-tuning of prompts, a process known as prompt engineering, to maximize relevance and coherence in the generated text.
2.1 Traditional data augmentation techniques for noise robustness
In text data augmentation, traditional methods primarily involve modifying characters or words within existing sentences to help classification models become more resilient to noise. This approach is particularly valuable in domains like hate speech detection, where data is often sourced from social media platforms that contain frequent spelling errors, abbreviations, and deliberate alterations to evade moderation.
Character-level manipulations typically include four key operations: insertion, deletion, swapping, and substitution [[16](/article/10.1007/s00607-025-01518-8#ref-CR16 "Wei J, Zou K (2019) EDA: easy data augmentation techniques for boosting performance on text classification tasks. In: Inui K, Jiang J, Ng V, Wan X (eds) Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, pp 6382–6388. https://doi.org/10.18653/v1/D19-1670
")\]. These modifications mimic common typographical errors and introduce minor variability that encourages models to generalize across noisy inputs \[[17](/article/10.1007/s00607-025-01518-8#ref-CR17 "Belinkov Y, Bisk Y (2018) Synthetic and natural noise both break neural machine translation. arXiv.
https://doi.org/10.48550/arXiv.1711.02173
")\]. For instance, Yang et al. \[[18](/article/10.1007/s00607-025-01518-8#ref-CR18 "Yang H, Li K (2023) Boosting text augmentation via hybrid instance filtering framework. In: Rogers A, Boyd-Graber J, Okazaki N (eds) Findings of the association for computational linguistics: ACL 2023. Association for Computational Linguistics, Toronto, pp 1652–1669.
https://doi.org/10.18653/v1/2023.findings-acl.105
")\] propose a boosting iterative method for filtering low-quality generations based on an ensemble of metrics such as perplexity \[[18](/article/10.1007/s00607-025-01518-8#ref-CR18 "Yang H, Li K (2023) Boosting text augmentation via hybrid instance filtering framework. In: Rogers A, Boyd-Graber J, Okazaki N (eds) Findings of the association for computational linguistics: ACL 2023. Association for Computational Linguistics, Toronto, pp 1652–1669.
https://doi.org/10.18653/v1/2023.findings-acl.105
")\].At the word level, similar techniques are applied, with some adaptations. For instance, unlike with characters, inserting random words into sentences is challenging, as it often disrupts sentence coherence. Instead, techniques such as synonym replacement are commonly used, where words are swapped for contextually similar alternatives sourced from dictionaries or lexical databases [[19](/article/10.1007/s00607-025-01518-8#ref-CR19 "Pavlick E, Rastogi P, Ganitkevitch J, Van Durme B, Callison-Burch C (2015) PPDB 2.0: better paraphrase ranking, fine-grained entailment relations, word embeddings, and style classification. In: Zong C, Strube M (eds) Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing (volume 2: short papers). Association for Computational Linguistics, Beijing, pp 425–430. https://doi.org/10.3115/v1/P15-2070
")\]. This method enhances the model ability to understand relationships between words that are semantically alike, helping the model recognize meaning in varied linguistic contexts. These simple but effective methods have been applied to a huge variety of downstream task such as emotion recognition \[[20](/article/10.1007/s00607-025-01518-8#ref-CR20 "Mohammad F, Khan M, Nawaz Khan Marwat S, Jan N, Gohar N, Bilal M, Al-Rasheed A (2023) Text augmentation-based model for emotion recognition using transformers. Comput Mater Contin 76(3):3523–3547.
https://doi.org/10.32604/cmc.2023.040202
")\], named entity recognition (NER) \[[21](/article/10.1007/s00607-025-01518-8#ref-CR21 "Tarj’an B, Szasz’ak G, Fegy’o T, Mihajlik P (2020) Deep transformer based data augmentation with subword units for morphologically rich online ASR. ArXiv")\] or even natural language inference (NLI) \[[22](/article/10.1007/s00607-025-01518-8#ref-CR22 "Longpre S, Wang Y, DuBois C (2020) How effective is task-agnostic data augmentation for pretrained transformers? In: Findings of the association for computational linguistics: EMNLP 2020. Association for Computational Linguistics, pp 4401–4411.
https://doi.org/10.18653/v1/2020.findings-emnlp.394
")\].Despite their effectiveness in increasing robustness to noise, these approaches have limitations. They primarily enhance surface-level variability without adding new semantic depth to the data. As a result, while these techniques can improve model generalization against minor variations, they fail to introduce richer perspectives or nuanced expressions necessary for complex tasks like hate speech detection.
Our proposed method naturally incorporates some of these traditional augmentations by generating diverse sentences that preserve the original style and, in some cases, even mimic typographical quirks commonly found in social media text, achieving both surface-level variability and deeper semantic richness.
2.2 Data augmentation using non-large language models
In earlier stages of text data augmentation, sequence-to-sequence (seq2seq) models, such as recurrent neural networks (RNNs) [[23](/article/10.1007/s00607-025-01518-8#ref-CR23 "Schmidt RM (2019) Recurrent neural networks (RNNs): a gentle introduction and overview. arXiv. doi:1048550/arXiv.1912.05911
")\] or Long-Short Term Memories (LSTMs) \[[24](/article/10.1007/s00607-025-01518-8#ref-CR24 "Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9:1735–80.
https://doi.org/10.1162/neco.1997.9.8.1735
")\] and later, transformer architectures \[[25](/article/10.1007/s00607-025-01518-8#ref-CR25 "Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2023). Attention is all you need.
https://doi.org/10.48550/arXiv.1706.03762
")\], were the primary tools for generating augmented text data. These models have proven effective in producing contextually coherent and semantically relevant text by leveraging representations of words and entire sequences.Kobayashi et al. used bidirectional RNNs to predict missing words within sentences based on surrounding context and specific labels, facilitating context-sensitive word substitutions [[26](/article/10.1007/s00607-025-01518-8#ref-CR26 "Kobayashi S (2018) Contextual augmentation: data augmentation by words with paradigmatic relations. In: Walker M, Ji H, Stent A (eds) Proceedings of the 2018 conference of the North American Chapter of the Association for Computational Linguistics: human language technologies, volume 2 (short papers). Association for Computational Linguistics, New Orleans, pp 452–457. https://doi.org/10.18653/v1/N18-2072
")\]. Kumar et al. used Mask Language Modelling (MLM), similar to _BERT_ pretraining \[[27](/article/10.1007/s00607-025-01518-8#ref-CR27 "Devlin J, Chang M-W, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding.
https://doi.org/10.48550/arXiv.1810.04805
")\], to mask some parts of the original sentence \[[28](/article/10.1007/s00607-025-01518-8#ref-CR28 "Kumar V, Choudhary A, Cho E (2021) Data augmentation using pre-trained transformer models. arXiv")\], then let the encoder transformer yield a semantic variation of the source, enhancing data diversity with syntactically consistent variations.One of the most popular and useful technique is BackTranslation. This method involves translating a sentence to an intermediate language and then back to the original, yielding natural rephrasings influenced by the auxiliary language syntactic and lexical structures [[15](/article/10.1007/s00607-025-01518-8#ref-CR15 "Sennrich R, Haddow B, Birch A (2016) Improving neural machine translation models with monolingual data. In: Erk K, Smith NA (eds) Proceedings of the 54th annual meeting of the association for computational linguistics (volume 1: long papers). Association for Computational Linguistics, Berlin, pp 86–96. https://doi.org/10.18653/v1/P16-1009
")\]. Originally implemented with encoder-decoder models, specifically Gated Recurrent Units (GRUs) \[[29](/article/10.1007/s00607-025-01518-8#ref-CR29 "Chung J, Gulcehre C, Cho K, Bengio Y (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling.
arXiv. doi:1048550/arXiv.1412.3555
")\] trained on translating English to German and viceversa, _BackTranslation_ has become more widely adopted with transformer-based models due to their flexibility and proficiency in natural language generation. These family of methods have been instrumental in introducing diversity to training data, yet they still face limitations in producing highly varied samples without semantic drift.2.3 Relying in LLMs for data augmentation
The emergence of LLMs has transformed the field of data augmentation, enabling the generation of synthetic text data that is both contextually rich and semantically precise. LLMs offer unprecedented flexibility for generating diverse and coherent text, driven by their vast pre-training on diverse language corpora.
A prominent area within LLM-based augmentation is prompt engineering, where specific prompts are crafted to elicit synthetic data that aligns with the target dataset’s characteristics. These approaches range from straightforward rephrasings [30] to sophisticated prompts tailored for particular domains or applications [31]. However, while effective, prompt-based methods sometimes yield limited benefits when adding synthetic data to large datasets, as excessive data can inadvertently reduce model performance [32, 33].
Moreover, recent work have proved the effectiveness of LLMs in generating counterfactual hard samples that can help models increase generalization capabilities, specifically in Natural Language Inference tasks. Wu et al. [[34](/article/10.1007/s00607-025-01518-8#ref-CR34 "Wu T, Ribeiro MT, Heer J, Weld D (2021) Polyjuice: generating counterfactuals for explaining, evaluating, and improving models. In: Zong C, Xia F, Li W, Navigli R (eds) Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (volume 1: long papers). Association for Computational Linguistics, pp 6707–6723. https://doi.org/10.18653/v1/2021.acl-long.523
")\] trained a GPT-2 architecture to transform source samples given a control code such as _negation_, where the augmented sample is a subtle negation of the original one, or _shuffle_, where entities are swapped completely altering the meaning. In addition, Dixit et al. \[[35](/article/10.1007/s00607-025-01518-8#ref-CR35 "Dixit T, Paranjape B, Hajishirzi H, Zettlemoyer L (2022) CORE: a retrieve-then-edit framework for counterfactual data generation. arXiv")\] used GPT-3 as a counterfactual editor to flip the labels of original samples by negating the hypothesis. Finally, Chen et al. \[[36](/article/10.1007/s00607-025-01518-8#ref-CR36 "Chen Z, Gao Q, Bosselut A, Sabharwal A, Richardson K (2023) DISCO: distilling counterfactuals with large language models. arXiv")\], in a similar way, let GPT-3 edit selected chunks of the original sentence guiding the generation with a target label, different than the initial one.Kruschwitz et al. [[37](/article/10.1007/s00607-025-01518-8#ref-CR37 "Kruschwitz U, Schmidhuber M (2024) Llm-based synthetic datasets: applications and limitations in toxicity detection. In: Kumar R, Ojha AK, Malmasi S, Chakravarthi BR, Lahiri B, Singh S, Ratan S (eds.) Proceedings of the fourth workshop on threat, aggression & cyberbullying @ LREC-COLING-2024. ELRA and ICCL, Torino, pp 37–51. https://aclanthology.org/2024.trac-1.6/
")\] and Pendzel et al. \[[38](/article/10.1007/s00607-025-01518-8#ref-CR38 "Pendzel S, Wullach T, Adler A, Minkov E (2023) Generative ai for hate speech detection: evaluation and findings (
arXiv:2311.09993
),
https://doi.org/10.48550/arXiv.2311.09993
.
arXiv:2311.09993
")\] both explore synthetic hate speech generation through fine-tuning—Kruschwitz et al. use GPT-3 Curie on neutral and hate speech datasets, but a filtering process revealed that up to 96% of the generated samples lacked sufficient toxicity, ultimately offering no consistent performance gains, while Pendzel et al. apply a toxicity threshold of 0.7 on samples generated by a fine-tuned GPT-2, a method that risks compromising diversity and novelty. In contrast, our approach leverages prompt engineering and in-context learning with an unmoderated LLM, thereby generating synthetic samples that are both appropriately toxic and diverse without the need for fine-tuning, effectively overcoming the limitations of traditional augmentation methods.Additionally, LLMs have opened new avenues for “steering” generation toward specific targets, known as controlled generation. Although traditionally not part of data augmentation, controlled generation aligns the model’s outputs with specific labels or desired semantic styles, enhancing dataset relevance. For instance, methods have been developed to refine reasoning by prompting the model to self-correct on challenging tasks, thus aligning outputs with correct labels in a zero-shot setting [39]. Stylistic consistency can also be achieved by encoding the dataset and utilizing the mean representation to guide new generations, achieving stylistic alignment without model fine-tuning [40]. These advancements underscore the versatility of LLMs in data augmentation, providing high-quality, semantically aligned synthetic data.
2.4 Augmentation in hate speech detection for low-resource text datasets
Creating datasets for hate speech classification presents unique challenges, especially in collecting positive samples that are both consistent and representative of the target behavior. Gathering such data often involves locating specific user hubs on social media or other online platforms, yet this alone is not sufficient [[5](/article/10.1007/s00607-025-01518-8#ref-CR5 "Ilan T, Vilenchik D (2022) HARALD: Augmenting hate speech data sets with real data. In: Goldberg Y, Kozareva Z, Zhang Y (eds) Findings of the association for computational linguistics: EMNLP 2022. Association for Computational Linguistics, Abu Dhabi, pp 2241–2248. https://doi.org/10.18653/v1/2022.findings-emnlp.165
")\]. Labeling hate speech accurately demands a deep understanding of the topic’s social and cultural nuances, as well as context-specific knowledge. This complexity increases the risk of bias in the resulting classification models, especially when annotators lack expertise in sociological factors specific to hate speech.The subjectivity inherent to hate speech annotation further complicates the process. Unlike other domains where practical annotation methods (e.g., Amazon Mechanical Turk) can be applied effectively, hate speech detection suffers from interpretive variability that is difficult to eliminate, even with rigorous annotation protocols [[41](/article/10.1007/s00607-025-01518-8#ref-CR41 "Samory M, Sen I, Kohne J, Flöck F, Wagner C (2021) “Call me sexist, but...” : revisiting sexism detection using psychological scales and adversarial samples. In: Proceedings of the international AAAI conference on web and social media, vol 15. Association for the Advancement of Artificial Intelligence (AAAI), pp 573–584. https://doi.org/10.1609/icwsm.v15i1.18085
")\]. Techniques such as using multiple annotators per sample and pre-screening them help to mitigate some bias, yet the ambiguity of hate speech categories persists, often resulting in inconsistent labels across annotators.Generally, real data tends to be of higher quality than synthetic data, which is why some researchers aim to expand low-resource hate speech datasets using other collected sources. For instance, Harald et al. compile hate speech data from multiple sources to augment low-resource datasets, enhancing model generalization despite the lack of strict topic alignment [[5](/article/10.1007/s00607-025-01518-8#ref-CR5 "Ilan T, Vilenchik D (2022) HARALD: Augmenting hate speech data sets with real data. In: Goldberg Y, Kozareva Z, Zhang Y (eds) Findings of the association for computational linguistics: EMNLP 2022. Association for Computational Linguistics, Abu Dhabi, pp 2241–2248. https://doi.org/10.18653/v1/2022.findings-emnlp.165
")\]. Similarly, Khullar et al. explore cross-linguistic generalization in hate speech detection by leveraging named entity recognition (NER) to mask specific tokens, translating the sentence into a target language, and then using an encoder to unmask the tokens within the new context \[[42](/article/10.1007/s00607-025-01518-8#ref-CR42 "Khullar A, Nkemelu D, Nguyen VC, Best ML (2024) Hate speech detection in limited data contexts using synthetic data generation. ACM J Comput Sustain Soc 2(1):4–1418.
https://doi.org/10.1145/3625679
")\]. This approach improves cross-lingual performance by maintaining the original meaning in the target language.Other researchers focus on augmenting data complexity to improve model robustness. For example, [43] employ fine-tuned encoders on hate speech datasets to adjust token sampling from a large language model, steering its output towards adversarial examples that are harder for classifiers to predict. This approach introduces more challenging instances, allowing models to develop a more nuanced understanding of hate speech content.
As mentioned in previous work, a noteworthy trend in some studies is the practice we refer to as pseudo-data-leaking, where data augmentation is applied to the set intended for evaluation, indirectly enhancing metrics. Although the exact test samples are not added to the training set, semantically similar samples are introduced, potentially providing the model with implicit “hints” about the test data. For instance, Dai et al. generate rephrasings using what they call the augmentation set which then is used for the evaluation [30]. Our approach deliberately avoids such methods, ensuring that the validation set remains untouched during augmentation to uphold the rigor and reliability of model assessment.
3 Methods
In this work, we introduce a novel text data augmentation method using LLMs to generate synthetic samples that are semantically aligned with the original text, while preserving the same style. Our approach aims to distill the vast knowledge of the internet encapsulated in LLMs into new, high-quality data points. This allows for the creation of samples that go beyond mere rephrasing, capturing the original intent and stylistic nuances, yet introducing diversity that enriches the dataset. This method builds upon our previous research, LoRDS-GEN [44], extending its foundational ideas to a broader augmentation context.Footnote 1
3.1 Demonstration-based prompting
Our method leverages a technique known as demonstration-based prompting, where samples from the original dataset are incorporated into the prompt to guide the LLM’s output towards the intended data distribution. Traditionally, this approach encourages the LLM to produce rephrased versions of the input samples [30, 32]. However, our method takes a distinct approach by enforcing prompt instructions and applying various token sampling mechanisms (see Sect. 4.4) to ensure that the generated samples are both diverse and semantically aligned with the original label or topic (e.g., hate speech). This approach enables the creation of new samples that retain the core semantic essence while introducing meaningful variability across generated examples (Fig. 1).
Fig. 1
Prompt and label sharing augmentation
The prompt structure is carefully designed to achieve this balance, consisting of four specific components. First, we establish the role of the LLM, framing it as an expert with sociological insight, particularly in the context of hate speech detection, to ensure that the generated content is sensitive to the task. Second, we set a contextual condition that directs the LLM to interpret the original sample within the framework of the specific label, helping it capture the intended meaning and tone accurately. Third, we define the task, guiding the LLM to generate multiple samples based on the original text, while preserving the same informal slang, stylistic elements, and tone. The model is explicitly asked to produce a designated number of diverse outputs (n) that maintain the linguistic style yet vary in wording and phrasing.
Finally, we implement an output format requirement that forces the LLM to adhere to a structured format, facilitating automatic parsing and integration without the need for human review. This schema has been iteratively refined, evolving from simpler prompt versions based on human evaluation of sample quality and diversity. Each component of the prompt contributes to a cohesive system that directs the LLM to generate high-quality, contextually appropriate synthetic data aligned with the original dataset.
3.2 Label-sharing augmentation
Given a dataset D composed of samples \({x_i, y_i}\), where each \(x_i\) represents the sample text and each \(y_i\) represents the corresponding label (e.g., a category such as hate speech or sexism), our goal is to generate synthetic text samples that closely align with the original dataset distribution but providing new semantic points of view. To achieve this, we utilize a generator G, which, in this case, is a large language model (LLM). The generator takes both the original sample text \(x_i\) and its label \(y_i\), embedded into a carefully crafted prompt, and outputs synthetic text samples \({x_i}^s\).
The process of labeling the synthetic samples relies on an important assumption: if an original sample \({x_i, y_i}\) is representative of the underlying distribution of the dataset D, then the synthetic sample \({{x_i}^s, y_i}\) generated by G, when conditioned on \({x_i, y_i}\), should also belong to the same distribution. This assumption, which we refer to as label-sharing, means that the synthetic samples generated from the original inputs can be associated with the same labels as their originals. In other words, we expect the LLM to generate text that retains the semantic and stylistic features of the original class, preserving the label’s relevance while introducing new linguistic variability (Table 1).
Table 1 Notation for the proposed method
Once the generator G has produced the synthetic samples, we construct a new dataset \(D_s\) by combining the original samples \({x_i, y_i}\) with their synthetic counterparts \({{x_i}^s, y_i}\). This expanded dataset, now containing both original and synthetic samples, is intended to improve the robustness and generalization of text classification models by exposing them to a broader range of variations that still align with the original semantics.
Our methodology is based on the premise that a well-designed prompt can effectively guide the LLM through in-context learning to generate samples that belong to the same semantic class as the one provided in the prompt. This is a significant assumption that we later demonstrate empirically; however, the experimental process involved two crucial steps:
- LLM benchmarking: We sought a LLM that, on one hand, was capable of generating hate speech content in an unsupervised manner—that is, one that would adhere to the label of the original sample on which it was based—and, on the other hand, one that could contribute distinctive elements such that the generated samples were not mere rephrasings, thereby introducing new semantic dimensions.
We tested other state-of-the-art LLMs such as Gemma (google/gemma-7bFootnote 2) and LLaMa3 (meta-llama/Meta-Llama-3-8BFootnote 3). However, these models were not even consistently capable of generating offensive content without triggering their moderation filters. - Prompt refinement: The prompt design underwent several improvements. The most notable improvement occurred when we introduced Role Assignment, which provided the LLM with specific context regarding how we wanted it to “think” when generating the sample. Additionally, specifying that we wanted the samples to retain the original writing style and/or slang proved to be a key differentiating element.
3.3 Training and evaluation
We train various transformer-based text classification models on both the original dataset D and the augmented dataset \(D_s\). For a fair comparison, we ensure that the same random seeds and pre-training states are used across all models. This setup allows us to isolate the impact of the synthetic data by controlling for other variables that could affect model performance. Finally, we compare the performance of these models on different hate speech datasets, analyzing whether the synthetic samples in \(D_s\) lead to improvements in classification accuracy and robustness over the original dataset D alone.
Our method is not designed to enhance generalization for models trained on large, high-quality datasets. Both empirical evidence and recent studies [32] indicate that real data generally surpasses synthetic data in quality, often resulting in either diminished performance or no noticeable improvement when synthetic data is added. Instead, our approach targets low-resource datasets, which are typically more sparse and contain semantic gaps that synthetic data can help fill. This focus is particularly crucial in the domain of hate speech detection, where positive samples are challenging to obtain, making data augmentation essential for improved model performance.
The objective of this study is to evaluate the impact of our augmentation method across multiple hate speech datasets, with a particular focus simulating varying initial dataset sizes. To achieve this, we will conduct a series of experiments ranging from extreme low-resource scenarios—with as few as 16 data points—to experiments using the entire dataset. Our primary aim is to provide a reliable augmentation method tailored for scenarios where data points are minimal.
For each experiment, we will select multiple random seeds (10 in this case) and vary dataset sizes in powers of 2 (depending on the dataset size). Each training size is a random and class-balanced sample extracted from the original dataset. For each size, we will average the training results across the different seeds, comparing the performance of the same classifier trained on the original dataset D and the augmented dataset \(D_s\). This approach allows us to quantitatively assess the impact of adding synthetic data across simulated scenarios with varying dataset sizes, giving insight into the method’s effectiveness in low-resource conditions (See 1).
Algorithm 1
The pseudo-code for generation, training and evaluation.
4 Experimental setup
The following section details the hate speech datasets used to evaluate our method, along with the specific techniques and parameters employed for generating synthetic samples.
4.1 Datasets
We focus on hate speech detection, employing four well-curated datasets. This selection is critical because, in demonstration-based prompting, the quality of synthetic samples depends heavily on accurate labeling.
Call Me Sexist But... (CMSB) [[41](/article/10.1007/s00607-025-01518-8#ref-CR41 "Samory M, Sen I, Kohne J, Flöck F, Wagner C (2021) “Call me sexist, but...” : revisiting sexism detection using psychological scales and adversarial samples. In: Proceedings of the international AAAI conference on web and social media, vol 15. Association for the Advancement of Artificial Intelligence (AAAI), pp 573–584. https://doi.org/10.1609/icwsm.v15i1.18085
")\]—This binary classification dataset comprises sexist and neutral samples sourced from social media. The dataset authors implemented a detailed taxonomy or _codebook_ to ensure labeling accuracy, using it to verify inter-annotator agreement on the _Amazon MTurk platform_. _CMSB_ is particularly challenging due to its inclusion of adversarial samples, which were created by removing intent from sexist examples to test model robustness.ETHOS [[45](/article/10.1007/s00607-025-01518-8#ref-CR45 "Mollas I, Chrysopoulou Z, Karlos S, Tsoumakas G (2022) ETHOS: an online hate speech detection dataset. Complex Intell Syst 8(6):4663–4678. https://doi.org/10.1007/s40747-021-00608-2
arXiv:2006.08328
[cs, stat]")\]—_ETHOS_ is available in both binary and multi-label formats. It includes diverse samples from _Reddit_ that contain various forms of hate speech, including racism, sexism, and attacks based on religion, ethnicity, nationality, disability, sexual orientation, and gender identity. The binary version poses a particular challenge as the label broadly indicates the presence of any of these hate speech types, resulting in significant topic variability.Stormfront [[46](/article/10.1007/s00607-025-01518-8#ref-CR46 "de Gibert O, Perez N, García-Pablos A, Cuadros M (2018) Hate speech dataset from a white supremacy forum. In: Fišer D, Huang R, Prabhakaran V, Voigt R, Waseem Z, Wernimont J (eds) Proceedings of the 2nd workshop on abusive language online (ALW2). Association for Computational Linguistics, Brussels, pp 11–20. https://doi.org/10.18653/v1/W18-5102
")\]—This hate speech dataset comprises English samples from _Stormfront_, a white supremacist online forum. Developed for sentence-level annotation, the dataset captures explicit forms of hate speech and includes categories like “HATE”, “NOHATE” and a “RELATION” label, which marks content that depends on multiple sentences to convey hate. Through strict guidelines and a web-based annotation tool, annotators labeled over 10, sentences, achieving a decent inter-annotator agreement.Antiasian (COVID-HATE) [[47](/article/10.1007/s00607-025-01518-8#ref-CR47 "He B, Ziems C, Soni S, Ramakrishnan N, Yang D, Kumar S (2021) racism is a virus: anti-Asian hate and counterspeech in social media during the COVID-19 crisis. arXiv. doi:1048550/arXiv.2005.12423
")\]—This dataset, assembled to study anti-Asian hate and counter-speech on _Twitter_ during the _COVID-19_ pandemic, contains over 206 million tweets spanning 14 months. The authors used a keyword-based approach to identify hate, counter-speech, and neutral tweets, resulting in a hand-labeled subset of 3,355 tweets with high inter-annotator agreement.4.2 Augmentation baselines
In addition to vanilla training—where models are trained without any data augmentation—we selected five different augmentation baselines for comparison. For each baseline, we generate an identical number of augmented samples (\(n=3\)).
BackTranslation [[15](/article/10.1007/s00607-025-01518-8#ref-CR15 "Sennrich R, Haddow B, Birch A (2016) Improving neural machine translation models with monolingual data. In: Erk K, Smith NA (eds) Proceedings of the 54th annual meeting of the association for computational linguistics (volume 1: long papers). Association for Computational Linguistics, Berlin, pp 86–96. https://doi.org/10.18653/v1/P16-1009
")\]—We include BackTranslation as a baseline since it is one of the most widely used text augmentation techniques. Specifically, we use the official implementation from the googletrans library,[Footnote 4](#Fn4) applying three different language pairs: (1) English-German-English, (2) English-Spanish-English, and (3) English-Chinese-English. This approach generates three unique variations of each sample by translating to an intermediate language and back to English.NLPAug [[48](/article/10.1007/s00607-025-01518-8#ref-CR48 "Ma E (2019) NLP augmentation. https://github.com/makcedward/nlpaug
")\]—NLPAug is a comprehensive library offering various traditional character- and word-based augmentation techniques,[Footnote 5](#Fn5) as well as context-sensitive, encoder-based methods that modify words based on their surrounding context. We selected nine augmentation algorithms from this library, divided into three categories: - _Character augmentation_—We use four methods that apply changes at the character level: delete, insert, swap, and substitute.
- _Word augmentation_—Three word-level techniques are included: delete, swap, and substitute, with the latter performing synonym replacement using a lexical database.
- _Encoder-based augmentation_—We use two BERT-based methods [27]: insert, which injects a new word at a random position based on contextual embeddings, and substitute, which masks and replaces words using BERT embeddings.
For fair comparison, we decompose all of these methods into three different baselines, (1) NLPAugMax which is an ensemble of the seven character and word augmentation method in which at each evaluation time we select the top-3 best metrics of the whole set and perform the average, (2) InsBERT, which is the insert contextual augmentation based on BERT and, finally, (3) SubBERT, being the substitute operation.
LLM rephrasing [30]—Since our method is LLM-based, we include AugGPT as a comparable baseline, which prompts the LLM to rephrase original sentences rather than generating varied and diverse samples. As the original authors did not specify the sampling mechanisms used, we generate each rephrased sample n times using the same prompt, setting a moderately high temperature of 1.2 to introduce variability in the augmented outputs.
4.3 Model selection
For our augmentation method, we utilize the Mistral-7B-Instruct-v0.2 language model [[10](/article/10.1007/s00607-025-01518-8#ref-CR10 "Jiang AQ, Sablayrolles A, Mensch A, Bamford C, Chaplot DS, Casas D, Bressand F, Lengyel G, Lample G, Saulnier L, Lavaud LR, Lachaux M-A, Stock P, Scao TL, Lavril T, Wang T, Lacroix T, Sayed WE (2023) Mistral 7B. arXiv. doi:1048550/arXiv.2310.06825
")\], available via the _Hugging Face_ platform.[Footnote 6](#Fn6) This model was chosen for two key reasons: first, despite its relatively modest size, it demonstrates performance comparable to much larger models like GPT\\(-\\)3.5 \[[49](/article/10.1007/s00607-025-01518-8#ref-CR49 "Ouyang L, Wu J, Jiang X, Almeida D, Wainwright CL, Mishkin P, Zhang C, Agarwal S, Slama K, Ray A, Schulman J, Hilton J, Kelton F, Miller L, Simens M, Askell A, Welinder P, Christiano P, Leike J, Lowe R (2022) Training language models to follow instructions with human feedback.
arXiv. doi:1048550/arXiv.2203.02155
")\], offering a balance between computational efficiency and capability. Second, it operates without moderation mechanisms, allowing us to generate hate speech-related samples directly, without requiring complex jailbreaking techniques.To evaluate the impact of our augmentation method, we compare classifier performance on the original dataset and the augmented dataset. This comparison is conducted both before and after incorporating synthetic data. We use three state-of-the-art transformer encoders to build the text classifiers. The selected models include BERT (bert-base-uncased), RoBERTa (roberta-base), and DeBERTa (deberta-v3-base) [[27](/article/10.1007/s00607-025-01518-8#ref-CR27 "Devlin J, Chang M-W, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. https://doi.org/10.48550/arXiv.1810.04805
"), [50](/article/10.1007/s00607-025-01518-8#ref-CR50 "Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V (2019) RoBERTa: a robustly optimized bert pretraining approach"), [51](/article/10.1007/s00607-025-01518-8#ref-CR51 "He P, Liu X, Gao J, Chen W (2021) DeBERTa: decoding-enhanced BERT with disentangled attention. arXiv")\].4.4 Generation mechanisms
The effectiveness of our method is largely driven by the carefully designed prompt used to generate synthetic samples. For the CMSB dataset, we use the keyword sexist to define the positive class, while for the ETHOS, the keyword hate speech is employed, as in that dataset they coexist different forms of hate speech. In StormFront we choose white supremacist hate speech and in Antiasian we choose antiasian discourse. For all the datasets, the negative class is uniformly represented by the keyword neutral. These keywords guide the LLM to align the generated text with the semantic characteristics of the respective labels.
To ensure quality and variability in the generated samples, we employ two token sampling mechanisms during the LLM output:
- Typical sampling (typical_p) [52]—This method evaluates the conditional probabilities of predicted tokens against a randomness threshold, choosing tokens that are less likely but still contextually plausible. This approach enhances diversity by avoiding overly predictable outputs. For this parameter, we use a value of 0.8.Footnote 7
- Repetition penalty (repetition_penalty) [53]—This technique penalizes tokens that have already been generated, reducing the likelihood of repetitive outputs and encouraging the LLM to produce distinct samples. We apply a penalty factor of 1.2.Footnote 8
For temperature, we retain the default value of 1.0. We found that increasing the temperature while simultaneously altering the other parameters led to overly chaotic and inconsistent generations. Additionally, to optimize performance and reduce hardware demands, we quantize the LLM weights to 4 bits, enabling efficient inference on a single GPU.
- Repetition penalty (repetition_penalty) [53]—This technique penalizes tokens that have already been generated, reducing the likelihood of repetitive outputs and encouraging the LLM to produce distinct samples. We apply a penalty factor of 1.2.Footnote 8
4.5 Metrics and hyperparameters
To compare generalization across methods, we use the F1-score, calculated against a dedicated test set drawn exclusively from the original dataset. This test set is used solely for evaluation purposes and does not influence the generation process (as noted in Sect. 2.4).
For the classification task, all models share a consistent architecture, differing only in the encoder used. The pooled output from the encoder feeds into a binary classification head, consisting of a single linear layer. We train these models using the AdamW optimizer, running for a maximum of 30 epochs. The learning rates are set to 1e-5 for the encoders and 1e-3 for the classification head, providing an effective balance for fine-tuning and classification.
4.6 Equipment and timing
All generation and training processes were conducted using a 24GB VRAM NVIDIA GeForce RTX 3090 GPU. Training durations varied depending on the dataset size, ranging from a few minutes for smaller subsets to over an hour for larger ones. In total, the experimental setup encompassed 4 datasets, 3 encoder models, 13 augmentation techniques, approximately 10 different sub-sample sizes, and 10 random seeds per size. This resulted in approximately 2600 individual training sessions, collectively requiring around 55 compute days.
For data generation, the optimized Mistral-7B-Instruct-v0.2 model demonstrated an efficient processing speed of roughly 6 s per sample. Given the training dataset sizes—approximately 8k samples for CMSB, 2k samples for Stormfront and 500 samples for both, ETHOS and _Antiasian_—this translated to an average generation time of around 25 h per prompt across all tested configurations. This setup underscores the computational efficiency and scalability of our approach, even when applied to extensive experimental scenarios.
5 Results
In this work we analyze the impact of synthetic data generated by LLMs on datasets of varying sizes, with a particular focus on assessing their potential to improve generalization in low-resource scenarios. To achieve this, we train different text classification models built on transformer encoder architectures, comparing their effectiveness across a range of dataset sizes. These sizes span from extremely limited subsets—representing highly constrained, low-resource conditions—-to the complete dataset, providing a comprehensive view of how synthetic data augmentation impacts classification performance.
It is true that LLMs possess strong zero-shot classification capabilities and could potentially be used for hate speech detection. However, our focus is on enhancing dedicated models that are smaller, more efficient, and faster, as they remain the preferable option for deployment when appropriately fine-tuned [[54](/article/10.1007/s00607-025-01518-8#ref-CR54 "Bucher MJJ, Martini M (2024) Fine-tuned “small” llms (still) significantly outperform zero-shot generative ai models in text classification ( arXiv:2406.08660
)
https://doi.org/10.48550/arXiv.2406.08660
.
arXiv:2406.08660
")\]. Moreover, during our experiments, we attempted to use other LLMs to label the generated samples for filtering purposes, but we found that in a zero-shot setting the frequency of errors or refusals was high enough to discard this approach.The training samples for each dataset size are selected randomly from the original training sets, maintaining equal representation across classes. This study does not address data imbalance; instead, we focus on datasets where class equality is preserved for clarity in evaluation. To ensure reliable and reproducible results, each dataset size is evaluated through multiple training runs using 10 different random seeds. These seeds influence two critical factors: (1) the initialization of the classification layer’s weights and (2) the random selection of training subsets on which augmentation techniques are applied. Finally, the validation and test sets are the original ones, with the original size for each dataset.
For data augmentation, we apply a fixed augmentation factor of 3, meaning that for every original sample, three synthetic samples are generated across all augmentation methods. This augmentation process is consistently applied to all selected subsets, ensuring that the impact of synthetic data is measured uniformly across methods and dataset sizes. By systematically varying the dataset size and applying multiple augmentation techniques, we aim to quantify the effectiveness of LLM-generated synthetic data in improving the robustness and generalization of text classification models, particularly in low-resource conditions. The training and generation techniques can be found in 4.
5.1 Impact of semantic breadth on data augmentation effectiveness
The results demonstrate that while our method may not achieve the highest performance in every proposed scenario, it stands out as the most consistent. Across nearly all datasets and configurations, incorporating our generated data consistently leads to improvements over the vanilla baseline. This consistency underscores the robustness of our approach in enhancing model generalization, particularly in low-resource scenarios. However, our method is not always the optimal choice, with the LLMRephrasing approach emerging as a strong competitor in certain contexts, such as low-resource settings in CMSB 2 and Antiasian 4.
The reason for this lies in the specific nature of these datasets. Both CMSB and Antiasian have narrowly focused semantic scopes, dealing exclusively with sexism and anti-Asian hate speech, respectively. Unlike broader datasets such as ETHOS, which encompasses a wide range of hate speech categories (e.g., racism, sexism, and other forms of discrimination) and allows for more diverse semantic augmentation, these specific datasets already possess a dense and cohesive semantic structure. As a result, the introduction of novel semantic perspectives provides limited additional benefit. In these cases, it seems that simpler rephrasing techniques, which primarily modify sentence structure while preserving the dataset’s existing semantic density, can suffice to enhance model performance.
This finding highlights an important insight: the effectiveness of data augmentation is influenced not only by the method itself but also by the inherent characteristics of the dataset. For datasets with broad and diverse semantics, such as ETHOS, our method excels by generating varied samples that capture underrepresented patterns and nuances, effectively expanding the dataset’s semantic coverage (Table 3). However, for more focused datasets like CMSB and Antiasian, where the training and test data share a tightly defined semantic distribution, simpler rephrasing methods may be equally effective.
Table 2 CMSB averaged F1-scores across 10 random seeds for each encoder and sample size. B is BERT, R is RoBERTa and D is DeBERTa
Table 3 ETHOS averaged F1-scores across 10 random seeds for each encoder and sample size
5.2 Variability comes at a cost
The introduction of synthetic data generated by LLMs has proven to be a pivotal advancement in low-resource scenarios. By leveraging LLMs to produce novel, semantically rich samples, we address the deficiencies that arise from a scarcity of real data. Our results consistently confirm that demonstration-based generation strategies significantly enhance the performance of text classifiers in low-resource settings, provided that prompts are meticulously crafted to ensure both the quality and variability of the generated samples.
Interestingly, this advantage is not confined to low-resource scenarios. Our method also yields performance improvements when substantial amounts of real data are available, though these gains are less consistent and generally more modest. For instance, in the CMSB dataset, we observe a notable average gain of \(+0.025\) in F1-score for large data samples within the [512–2048] size range (Table 2). However, this improvement pales in comparison to the substantial impact observed in limited scenarios, where gains reach \(+0.068\) in the [16–96] range and \(+0.053\) in the [96–512] range. A similar trend is evident in the ETHOS dataset. Performance gains average \(+0.031\) in F1-score for the [16–48] range and \(+0.028\) for the [48–128] range in low-resource settings, compared to a smaller gain of \(+0.015\) in the [128–500] range as the proportion of real data increases. In the Antiasian dataset, we observe a slight variation of \(+0.014\) in F1-score within the [128–500] range, while the margin is significantly higher in extreme conditions, such as a gain of \(+0.107\) in the [16–48] range (see Table 4). Similarly, for the Stormfront dataset, the gains are \(+0.053\) in the [16–96] range and diminish to \(+0.011\) in the [384–1536] range (Table 5).
Table 4 Antiasian averaged F1-scores across 10 random seeds for each encoder and sample size
These observations suggest that while our method provides benefits in larger datasets, the magnitude of improvement diminishes as the amount of real data increases. Traditional augmentation methods, which minimally alter the original text, tend to preserve the original data distribution more faithfully. This preservation ensures that the augmented data remains stylistically and semantically consistent with the training set. In contrast, our method introduces slight deviations in style without fine-tuning the LLM, leading to potential shifts in the data distribution as the dataset scales up. While this variability is advantageous in low-resource scenarios—where it helps fill semantic gaps and introduces diversity—it may be less beneficial in larger datasets where the original distribution is already well-represented. Despite that, our method is consistently the best option for data augmentation in every scenario (Table 5).
Table 5 Averaged F1-scores across the encoder performances for each window size
Traditional methods excel in maintaining fidelity to the original dataset, making them more suitable when the dataset is extensive and comprehensive. Our method, however, shines in scenarios where diversity and novel instances are most needed, particularly in low-resource settings. Despite the diminishing returns in high-resource scenarios, it’s noteworthy that our approach still demonstrates consistent improvements when averaging performance across different classifiers. This consistency indicates that, although the benefits may be less pronounced, our method remains a valuable tool for enhancing model performance across various data regimes (Table 4).
5.3 Encoder-specific responses to synthetic data augmentation
Our analysis indicates that the impact of synthetic data augmentation varies significantly across different encoders. Each model—BERT, RoBERTa, and _DeBERTa_—responds uniquely to the introduction of synthetic data, with noticeable differences in performance consistency and sensitivity to data volume.
RoBERTa exhibits higher variance in performance when augmented with synthetic data, resulting in less consistent improvements compared to the other encoders. This inconsistency is evident across multiple datasets and configurations. For instance, in the CMSB dataset (Table 2), while our method improves RoBERTa F1-score from 0.497 to 0.563 in the smallest data size [16–96], the gains are less pronounced or even marginal in larger data sizes. In the [512–2048] range, the F1-score increases slightly from 0.686 to 0.687, being NLPAugMax a better option. Morevoer, in Antiasian, in scenarios of high-resources we also have traditional augmentation techniques as options that yield better metrics.
This high variability suggests that RoBERTa may be more sensitive to the introduction of synthetic data. The fluctuations in performance could be attributed to RoBERTa pre-training objectives and architectural nuances, which might make it more susceptible to shifts in data distribution caused by synthetic samples. The model’s sensitivity underscores the need for careful consideration when applying data augmentation techniques to RoBERTa, as the benefits may not be as consistently realized as with other encoders.
In contrast, DeBERTa tends to shine more prominently in high-resource scenarios. As the amount of real data increases, DeBERTa performance improvements become more substantial. For instance, in the CMSB dataset (Table 2), DeBERTa F1-score with our method jumps from 0.610 in the vanilla setting to 0.679 in the largest data size [512–2048]. This significant gain indicates that DeBERTa effectively leverages the abundance of data, including the synthetic samples, to enhance its performance.
A similar trend is observed, for instance, in the Stormfront dataset (Table 6). DeBERTa shows a considerable improvement in the [384–1536] data size when augmented with our method, achieving an F1-score of 0.801 compared to 0.775 in the vanilla setting and 0.777 in their LLM counterpart. This pattern suggests that DeBERTa advanced architectural features—such as disentangled attention mechanisms and improved parametrization—allow it to capture complex patterns more effectively as the dataset grows.
Table 6 Stormfront averaged F1-scores across 10 random seeds for each encoder and sample size
For BERT, the performance improvements are generally more consistent across different data sizes and augmentation methods, positioning it as a reliable baseline. However, it does not exhibit the same level of enhancement in high-resource scenarios as DeBERTa, nor does it show the same degree of variability as RoBERTa.
These observations highlight the importance of tailoring data augmentation strategies to specific encoder architectures. For RoBERTa, the higher variance in performance indicates a need for more carefully curated synthetic data or perhaps alternative augmentation techniques that align more closely with its learning characteristics. On the other hand, DeBERTa appears to benefit more from synthetic data in high-resource settings, making it a strong candidate when large datasets—including augmented samples—are available (Table 5).
5.4 The critical role of prompt design in synthetic data generation
The design of prompts used to generate synthetic data with LLMs plays a pivotal role in the quality and effectiveness of the augmented data. In our study, we compare our meticulously crafted prompt—which has undergone manual quality and variability assessments—to a simpler prompt focused solely on rephrasing the original instances, referred to as LLM-Rephrasing. This prompt is just: “Please, rephrase this text: {text}” [30]. The performance comparison between these two approaches varies significantly across different datasets, highlighting the importance of prompt design in data augmentation.
In the CMSB dataset (Table 2), simple rephrasing appears to perform on par with our more elaborate prompt, especially in low-resource settings. For instance, using the BERT encoder with the smallest data size [16–96], the LLM-Rephrasing approach achieves an F1-score of 0.537, slightly surpassing our method’s score of 0.524. Similarly, with the DeBERTa encoder, LLM-Rephrasing attains an F1-score of 0.449 compared to our method’s 0.443. This marginal difference suggests that for datasets with a narrow semantic focus, like CMSB, simple rephrasing may suffice to enhance model performance.
A well-designed prompt guides the LLM to produce synthetic data that enriches the dataset, enabling the model to learn from a broader spectrum of examples. This is particularly important in low-resource settings, where the original data may not cover all the nuances and contexts relevant to the task. Manual review of the prompt and generated data further ensures that the synthetic samples are both relevant and diverse. By assessing the quality and variability of the outputs, we can refine the prompt to better align with the desired outcomes, enhancing the overall effectiveness of the data augmentation process.
The varying effectiveness of simple rephrasing versus a more elaborate prompt suggests that data augmentation strategies should be tailored to the dataset characteristics and the specific hate speech subcategories present. For narrowly focused datasets such as CMSB, which predominantly capture hate speech targeting sexism, simple rephrasing is often sufficient. The existing data already spans the required semantic space, and the model benefits from exposure to different phrasings of the same concepts. In contrast, for broader or more diverse datasets like ETHOS, which include multiple hate speech subcategories—such as racism, sexism, or religious discrimination—a more elaborate prompt is essential. This approach generates semantically diverse data, enabling the model to generalize more effectively across varied contexts and to handle the nuances inherent in each hate speech category. Our detailed subcategory analysis further confirms that while simpler augmentation may yield competitive results in focused settings, our method’s adaptability leads to consistent improvements in both overall and subcategory-specific F1-scores. As summarized in Table 5, these trends underscore the importance of matching augmentation complexity to the semantic and categorical diversity of the hate speech present in the dataset.
6 Conclusion
In this paper, we propose an automated data augmentation method leveraging LLMs, specifically tailored for low-resource datasets in the domain of hate speech detection. Our study systematically evaluates the impact of LLM-generated synthetic data on different text classification architectures across datasets of varying sizes, ranging from extremely limited samples to those with more substantial data availability.
The results of our analysis demonstrate that our method consistently outperforms traditional augmentation techniques, such as character- and word-based manipulations, as well as rephrasing-based LLM methods. By focusing solely on the training set for data augmentation, our approach exhibits superior performance in low-resource scenarios, where the scarcity of quality data typically hinders model generalization. Furthermore, our method remains robust and competitive even when applied to datasets with moderate amounts of high-quality original data. The design of the prompt and token sampling strategies introduce variability while preserving semantic alignment, positioning our approach as a more reliable alternative to other state-of-the-art LLM-based techniques.
However, as the size of the dataset grows and includes high-quality original samples, the benefits of synthetic augmentation diminish. This is likely due to slight deviations in the synthetic samples from the original data distribution, which can introduce noise and reduce classifier performance on larger-scale predictions. While these deviations are minor, they underscore the challenges inherent in maintaining perfect semantic fidelity in generated data.
The experimentation conducted in this study involved generating a fixed number of synthetic samples per original sample across all methods. Future work will explore the implications of varying the augmentation factor, particularly examining how generating larger volumes of synthetic data impacts both quality and classifier performance. Another future research will investigate the impact of synthetic augmentation in class imbalance, studying whether specific proportions of synthetic data can mitigate imbalance-related challenges. Moreover, we aim to analyze how increasing the number of original samples influences the quality and diversity of generated data, particularly in scenarios where the LLM has richer context to draw upon during generation. Lastly, we plan to adapt our method to other small LLMs -such as LLaMa3 or _Gemma_- jailbreaking, if necessary, their moderation mechanisms. By addressing these areas, we hope to refine our method further, extending its applicability and effectiveness in both low-resource and balanced dataset scenarios. This continued exploration will enhance our understanding of how to best leverage LLMs for data augmentation, particularly in complex tasks such as hate speech detection, where nuanced and diverse datasets are critical for reliable model performance.
7 Ethical considerations
We fully acknowledge and share the concerns regarding the ethical implications of using unfiltered large language models (LLMs) for hate speech generation. The potential for misuse of synthetic hate speech content is significant, and we disassociate ourselves from any malicious applications of this technology.
Our approach is intended solely to enhance the robustness of hate speech detection models by providing additional, diverse training data, thereby contributing to safer and more effective content moderation systems. The synthetic data generated through our method is used exclusively in controlled experimental settings for research purposes.
To mitigate risks, all generated content is stored and processed in secure environments with restricted access, and it is not used outside the scope of academic research. In support of transparency and reproducibility, we include in “Appendix” a representative example of the prompt used to guide the generation process.
Our work does not aim to normalize or trivialize hate speech, but rather to support the development of more resilient automated moderation systems. All procedures followed institutional ethical guidelines for research involving sensitive content.
References
- Ştefăniţă O, Buf D-M (2021) Hate speech in social media and its effects on the LGBT community: a review of the current research. Roman J Commun Public Relat 23:47–55. https://doi.org/10.21018/rjcpr.2021.1.322
Article Google Scholar - Das M, Mathew B, Saha P, Goyal P, Mukherjee A (2020) Hate speech in online social media. ACM SIGWEB Newsl 2020(Autumn):1–8. https://doi.org/10.1145/3427478.3427482
Article Google Scholar - Lopez-Sanchez M, Müller A (2021) On simulating the propagation and countermeasures of hate speech in social networks. Appl Sci 11(24):12003. https://doi.org/10.3390/app112412003
Article Google Scholar - Mathew B, Dutt R, Goyal P, Mukherjee A (2019) Spread of hate speech in online social media. In: Proceedings of the 10th ACM conference on web science, pp 173–182. https://doi.org/10.1145/3292522.3326034
- Ilan T, Vilenchik D (2022) HARALD: Augmenting hate speech data sets with real data. In: Goldberg Y, Kozareva Z, Zhang Y (eds) Findings of the association for computational linguistics: EMNLP 2022. Association for Computational Linguistics, Abu Dhabi, pp 2241–2248. https://doi.org/10.18653/v1/2022.findings-emnlp.165
- Alkiviadou N (2019) Hate speech on social media networks: towards a regulatory framework? Inf Commun Technol Law 28(1):19–35. https://doi.org/10.1080/13600834.2018.1494417
Article Google Scholar - Murtfeldt R, Alterman N, Kahveci I, West JD (2024) RIP Twitter API: a eulogy to its vast research contributions. arXiv
- Feng SY, Gangal V, Wei J, Chandar S, Vosoughi S, Mitamura T, Hovy E (2021) A survey of data augmentation approaches for NLP. arXiv
- Mohamadi S, Mujtaba G, Le N, Doretto G, Adjeroh DA (2023) ChatGPT in the age of generative ai and large language models: a concise survey. arXiv:2307.04251v2
- Jiang AQ, Sablayrolles A, Mensch A, Bamford C, Chaplot DS, Casas D, Bressand F, Lengyel G, Lample G, Saulnier L, Lavaud LR, Lachaux M-A, Stock P, Scao TL, Lavril T, Wang T, Lacroix T, Sayed WE (2023) Mistral 7B. arXiv. doi:1048550/arXiv.2310.06825
- Perez L, Wang J (2017) The effectiveness of data augmentation in image classification using deep learning. arXiv. doi:1048550/arXiv.1712.04621
- Cubuk ED, Zoph B, Shlens J, Le Q (2020) RandAugment: practical automated data augmentation with a reduced search space. In: Advances in neural information processing systems, vol 33. Curran Associates, Inc., pp 18613–18624
- Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models arXiv. https://doi.org/10.48550/arXiv.2006.11239
- Praveen Gujjar J, Prasanna Kumar HR, Guru Prasad MS (2023) Advanced NLP framework for text processing. In: 2023 6th international conference on information systems and computer networks (ISCON), pp 1–3. https://doi.org/10.1109/ISCON57294.2023.10112058
- Sennrich R, Haddow B, Birch A (2016) Improving neural machine translation models with monolingual data. In: Erk K, Smith NA (eds) Proceedings of the 54th annual meeting of the association for computational linguistics (volume 1: long papers). Association for Computational Linguistics, Berlin, pp 86–96. https://doi.org/10.18653/v1/P16-1009
- Wei J, Zou K (2019) EDA: easy data augmentation techniques for boosting performance on text classification tasks. In: Inui K, Jiang J, Ng V, Wan X (eds) Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, pp 6382–6388. https://doi.org/10.18653/v1/D19-1670
- Belinkov Y, Bisk Y (2018) Synthetic and natural noise both break neural machine translation. arXiv. https://doi.org/10.48550/arXiv.1711.02173
- Yang H, Li K (2023) Boosting text augmentation via hybrid instance filtering framework. In: Rogers A, Boyd-Graber J, Okazaki N (eds) Findings of the association for computational linguistics: ACL 2023. Association for Computational Linguistics, Toronto, pp 1652–1669. https://doi.org/10.18653/v1/2023.findings-acl.105
- Pavlick E, Rastogi P, Ganitkevitch J, Van Durme B, Callison-Burch C (2015) PPDB 2.0: better paraphrase ranking, fine-grained entailment relations, word embeddings, and style classification. In: Zong C, Strube M (eds) Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing (volume 2: short papers). Association for Computational Linguistics, Beijing, pp 425–430. https://doi.org/10.3115/v1/P15-2070
- Mohammad F, Khan M, Nawaz Khan Marwat S, Jan N, Gohar N, Bilal M, Al-Rasheed A (2023) Text augmentation-based model for emotion recognition using transformers. Comput Mater Contin 76(3):3523–3547. https://doi.org/10.32604/cmc.2023.040202
Article Google Scholar - Tarj’an B, Szasz’ak G, Fegy’o T, Mihajlik P (2020) Deep transformer based data augmentation with subword units for morphologically rich online ASR. ArXiv
- Longpre S, Wang Y, DuBois C (2020) How effective is task-agnostic data augmentation for pretrained transformers? In: Findings of the association for computational linguistics: EMNLP 2020. Association for Computational Linguistics, pp 4401–4411. https://doi.org/10.18653/v1/2020.findings-emnlp.394
- Schmidt RM (2019) Recurrent neural networks (RNNs): a gentle introduction and overview. arXiv. doi:1048550/arXiv.1912.05911
- Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9:1735–80. https://doi.org/10.1162/neco.1997.9.8.1735
Article Google Scholar - Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2023). Attention is all you need. https://doi.org/10.48550/arXiv.1706.03762
- Kobayashi S (2018) Contextual augmentation: data augmentation by words with paradigmatic relations. In: Walker M, Ji H, Stent A (eds) Proceedings of the 2018 conference of the North American Chapter of the Association for Computational Linguistics: human language technologies, volume 2 (short papers). Association for Computational Linguistics, New Orleans, pp 452–457. https://doi.org/10.18653/v1/N18-2072
- Devlin J, Chang M-W, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. https://doi.org/10.48550/arXiv.1810.04805
- Kumar V, Choudhary A, Cho E (2021) Data augmentation using pre-trained transformer models. arXiv
- Chung J, Gulcehre C, Cho K, Bengio Y (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv. doi:1048550/arXiv.1412.3555
- Dai H, Liu Z, Liao W, Huang X, Cao Y, Wu Z, Zhao L, Xu S, Liu W, Liu N, Li S, Zhu D, Cai H, Sun L, Li Q, Shen D, Liu T, Li X (2023) AugGPT: leveraging ChatGPT for text data augmentation. arXiv
- Yoo KM, Park D, Kang J, Lee S-W, Park W (2021) GPT3Mix: leveraging large-scale language models for text augmentation. arXiv
- Li Z, Zhu H, Lu Z, Yin M (2023) Synthetic data generation with large language models for text classification: potential and limitations. arXiv
- Ye J, Gao J, Li Q, Xu H, Feng J, Wu Z, Yu T, Kong L (2022) ZeroGen: efficient zero-shot learning via dataset generation. arXiv
- Wu T, Ribeiro MT, Heer J, Weld D (2021) Polyjuice: generating counterfactuals for explaining, evaluating, and improving models. In: Zong C, Xia F, Li W, Navigli R (eds) Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (volume 1: long papers). Association for Computational Linguistics, pp 6707–6723. https://doi.org/10.18653/v1/2021.acl-long.523
- Dixit T, Paranjape B, Hajishirzi H, Zettlemoyer L (2022) CORE: a retrieve-then-edit framework for counterfactual data generation. arXiv
- Chen Z, Gao Q, Bosselut A, Sabharwal A, Richardson K (2023) DISCO: distilling counterfactuals with large language models. arXiv
- Kruschwitz U, Schmidhuber M (2024) Llm-based synthetic datasets: applications and limitations in toxicity detection. In: Kumar R, Ojha AK, Malmasi S, Chakravarthi BR, Lahiri B, Singh S, Ratan S (eds.) Proceedings of the fourth workshop on threat, aggression & cyberbullying @ LREC-COLING-2024. ELRA and ICCL, Torino, pp 37–51. https://aclanthology.org/2024.trac-1.6/
- Pendzel S, Wullach T, Adler A, Minkov E (2023) Generative ai for hate speech detection: evaluation and findings (arXiv:2311.09993), https://doi.org/10.48550/arXiv.2311.09993. arXiv:2311.09993
- Zelikman E, Wu Y, Mu J, Goodman ND (2022) STaR: bootstrapping reasoning with reasoning. arXiv
- Konen K, Jentzsch S, Diallo D, Schütt P, Bensch O, Baff RE, Opitz D, Hecking T (2024) Style vectors for steering generative large language model. arXiv
- Samory M, Sen I, Kohne J, Flöck F, Wagner C (2021) “Call me sexist, but...” : revisiting sexism detection using psychological scales and adversarial samples. In: Proceedings of the international AAAI conference on web and social media, vol 15. Association for the Advancement of Artificial Intelligence (AAAI), pp 573–584. https://doi.org/10.1609/icwsm.v15i1.18085
- Khullar A, Nkemelu D, Nguyen VC, Best ML (2024) Hate speech detection in limited data contexts using synthetic data generation. ACM J Comput Sustain Soc 2(1):4–1418. https://doi.org/10.1145/3625679
Article Google Scholar - Hartvigsen T, Gabriel S, Palangi H, Sap M, Ray D, Kamar E (2022) ToxiGen: a large-scale machine-generated dataset for adversarial and implicit hate speech detection. arXiv
- Girón A, Collell G, Hassan F, Huertas-Tato J, Camacho D (2025) Low-resource dataset synthetic generation for hate speech detection. In: Barhamgi M, Wang H, Wang X, Aïmeur E, Mrissa M, Chikhaoui B, Boukadi K, Grati R, Maamar Z (eds) Web information systems engineering—WISE 2024 PhD symposium, demos and workshops. Springer, Singapore, pp 75–89
- Mollas I, Chrysopoulou Z, Karlos S, Tsoumakas G (2022) ETHOS: an online hate speech detection dataset. Complex Intell Syst 8(6):4663–4678. https://doi.org/10.1007/s40747-021-00608-2arXiv:2006.08328 [cs, stat]
- de Gibert O, Perez N, García-Pablos A, Cuadros M (2018) Hate speech dataset from a white supremacy forum. In: Fišer D, Huang R, Prabhakaran V, Voigt R, Waseem Z, Wernimont J (eds) Proceedings of the 2nd workshop on abusive language online (ALW2). Association for Computational Linguistics, Brussels, pp 11–20. https://doi.org/10.18653/v1/W18-5102
- He B, Ziems C, Soni S, Ramakrishnan N, Yang D, Kumar S (2021) racism is a virus: anti-Asian hate and counterspeech in social media during the COVID-19 crisis. arXiv. doi:1048550/arXiv.2005.12423
- Ma E (2019) NLP augmentation. https://github.com/makcedward/nlpaug
- Ouyang L, Wu J, Jiang X, Almeida D, Wainwright CL, Mishkin P, Zhang C, Agarwal S, Slama K, Ray A, Schulman J, Hilton J, Kelton F, Miller L, Simens M, Askell A, Welinder P, Christiano P, Leike J, Lowe R (2022) Training language models to follow instructions with human feedback. arXiv. doi:1048550/arXiv.2203.02155
- Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V (2019) RoBERTa: a robustly optimized bert pretraining approach
- He P, Liu X, Gao J, Chen W (2021) DeBERTa: decoding-enhanced BERT with disentangled attention. arXiv
- Meister C, Pimentel T, Wiher G, Cotterell R (2023) Locally typical sampling. arXiv
- Keskar NS, McCann B, Varshney LR, Xiong C, Socher R (2019) CTRL: a conditional transformer language model for controllable generation. arXiv
- Bucher MJJ, Martini M (2024) Fine-tuned “small” llms (still) significantly outperform zero-shot generative ai models in text classification (arXiv:2406.08660) https://doi.org/10.48550/arXiv.2406.08660. arXiv:2406.08660