ShieldHead: Decoding-time Safeguard for Large Language Models (original) (raw)


Abstract

In light of the widespread deployment of Large Language Models (LLMs), the responsibility for safeguarding and regulating LLM-generated content has taken on heightened significance. Recent advancements in LLM-based moderation methods, e.g., LlamaGuard, have demonstrated remarkable promise in identifying safety risks associated with both inputs and outputs in human-AI interactions. However, integrating LLM-based safeguards into a chatbot system requires an additional inference stage involving a moderation LLM with billions of parameters, which significantly increases computational costs and reduces overall efficiency. In this paper, we demonstrate that simply learning a classification head on the last-layer hidden states of the dialogue model provides a strong capability to identify harmful contents. The classification head, referred to as ShieldHead, serves as an auxiliary branch paralleled with next-token-prediction LM head, enabling the detection of potential risks in past text sequences. Additionally, a label disambiguation technique is employed to supervise ShieldHead with both token-level and sentence-level labels, which further enhances its performance. ShieldHead exhibits remarkable efficiency during inference, providing real-time moderation results alongside token-wise streaming output during the chatbot system’s decoding phase. Extensive experimental results demonstrate the superiority of the proposed framework: a state-of-the-art performance on the XSTest and SafeRLHF datasets while running at a speed about **300×** faster (**<1ms**) than previous LLM-based moderation models with ** 99%** less parameters of LlamaGuard.

Anthology ID:

2025.findings-acl.932

Volume:

Findings of the Association for Computational Linguistics: ACL 2025

Month:

July

Year:

2025

Address:

Vienna, Austria

Editors:

Wanxiang Che,Joyce Nabende,Ekaterina Shutova,Mohammad Taher Pilehvar

Venue:

Findings

SIG:

Publisher:

Association for Computational Linguistics

Note:

Pages:

18129–18143

Language:

URL:

https://aclanthology.org/2025.findings-acl.932/

DOI:

10.18653/v1/2025.findings-acl.932

Bibkey:

Cite (ACL):

Zitao Xuan, Xiaofeng Mao, Da Chen, Xin Zhang, Yuhan Dong, and Jun Zhou. 2025. ShieldHead: Decoding-time Safeguard for Large Language Models. In Findings of the Association for Computational Linguistics: ACL 2025, pages 18129–18143, Vienna, Austria. Association for Computational Linguistics.

Cite (Informal):

ShieldHead: Decoding-time Safeguard for Large Language Models (Xuan et al., Findings 2025)

Copy Citation:

PDF:

https://aclanthology.org/2025.findings-acl.932.pdf