ShieldHead: Decoding-time Safeguard for Large Language Models (original) (raw)
Abstract
In light of the widespread deployment of Large Language Models (LLMs), the responsibility for safeguarding and regulating LLM-generated content has taken on heightened significance. Recent advancements in LLM-based moderation methods, e.g., LlamaGuard, have demonstrated remarkable promise in identifying safety risks associated with both inputs and outputs in human-AI interactions. However, integrating LLM-based safeguards into a chatbot system requires an additional inference stage involving a moderation LLM with billions of parameters, which significantly increases computational costs and reduces overall efficiency. In this paper, we demonstrate that simply learning a classification head on the last-layer hidden states of the dialogue model provides a strong capability to identify harmful contents. The classification head, referred to as ShieldHead, serves as an auxiliary branch paralleled with next-token-prediction LM head, enabling the detection of potential risks in past text sequences. Additionally, a label disambiguation technique is employed to supervise ShieldHead with both token-level and sentence-level labels, which further enhances its performance. ShieldHead exhibits remarkable efficiency during inference, providing real-time moderation results alongside token-wise streaming output during the chatbot system’s decoding phase. Extensive experimental results demonstrate the superiority of the proposed framework: a state-of-the-art performance on the XSTest and SafeRLHF datasets while running at a speed about **300×** faster (**<1ms**) than previous LLM-based moderation models with ** 99%** less parameters of LlamaGuard.
Anthology ID:
2025.findings-acl.932
Volume:
Findings of the Association for Computational Linguistics: ACL 2025
Month:
July
Year:
2025
Address:
Vienna, Austria
Editors:
Wanxiang Che,Joyce Nabende,Ekaterina Shutova,Mohammad Taher Pilehvar
Venue:
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
18129–18143
Language:
URL:
https://aclanthology.org/2025.findings-acl.932/
DOI:
10.18653/v1/2025.findings-acl.932
Bibkey:
Cite (ACL):
Zitao Xuan, Xiaofeng Mao, Da Chen, Xin Zhang, Yuhan Dong, and Jun Zhou. 2025. ShieldHead: Decoding-time Safeguard for Large Language Models. In Findings of the Association for Computational Linguistics: ACL 2025, pages 18129–18143, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
ShieldHead: Decoding-time Safeguard for Large Language Models (Xuan et al., Findings 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.findings-acl.932.pdf