GitHub - QwenLM/Qwen3Guard: Qwen3Guard is a multilingual guardrail model series developed by the Qwen team at Alibaba Cloud. (original) (raw)

๐Ÿ’œ Qwen Chat | ๐Ÿค— Hugging Face | ๐Ÿค– ModelScope | ๐Ÿ“‘ Blog ๏ฝœ ๐Ÿ“– Documentation
๐Ÿ“„ Tech Report | ๐Ÿ’ฌ WeChat (ๅพฎไฟก) | ๐Ÿซจ Discord

Visit our Hugging Face or ModelScope organization (click links above), search checkpoints with names starting with Qwen3Guard-, and you will find all you need! Enjoy!

Introduction

Qwen3Guard is a series of safety moderation models built upon Qwen3 and trained on a dataset of 1.19 million prompts and responses labeled for safety. The series includes models of three sizes (0.6B, 4B, and 8B) and features two specialized variants: Qwen3Guard-Gen, a generative model that accepts full user prompts and model responses to perform safety classification, and Qwen3Guard-Stream, which incorporates a token-level classification head for real-time safety monitoring during incremental text generation.

๐Ÿ›ก๏ธ Comprehensive Protection: Provides both robust safety assessment for prompts and responses, along with real-time detection specifically optimized for streaming scenarios, allowing for efficient and timely moderation during incremental token generation.

๐Ÿšฆ Three-Tiered Severity Classification: Enables detailed risk assessment by categorizing outputs into safe, controversial, and unsafe severity levels, supporting adaptation to diverse deployment scenarios.

๐ŸŒ Extensive Multilingual Support: Supports 119 languages and dialects, ensuring robust performance in global and cross-lingual applications.

๐Ÿ† State-of-the-Art Performance: Achieves leading performance on various safety benchmarks, excelling in both static and streaming classification across English, Chinese, and multilingual tasks.

image/jpeg

News

Basic information

Name Type Download
Qwen3Guard-Gen-0.6B Generative Guard ๐Ÿค— Hugging Face โ€ข ๐Ÿค– ModelScope
Qwen3Guard-Gen-4B Generative Guard ๐Ÿค— Hugging Face โ€ข ๐Ÿค– ModelScope
Qwen3Guard-Gen-8B Generative Guard ๐Ÿค— Hugging Face โ€ข ๐Ÿค– ModelScope
Qwen3Guard-Stream-0.6B Stream Guard ๐Ÿค— Hugging Face โ€ข ๐Ÿค– ModelScope
Qwen3Guard-Stream-4B Stream Guard ๐Ÿค— Hugging Face โ€ข ๐Ÿค– ModelScope
Qwen3Guard-Stream-8B Stream Guard ๐Ÿค— Hugging Face โ€ข ๐Ÿค– ModelScope
Qwen3-4B-SafeRL Safety-aligned LLM ๐Ÿค— Hugging Face โ€ข ๐Ÿค– ModelScope
Qwen3GuardTest Guard Benchmark ๐Ÿค— Hugging Face โ€ข ๐Ÿค– ModelScope

For more details, please refer to our blog and Technical Report.

Quick Start

Qwen3Guard-Gen

Qwen3Guard-Gen operates similarly to a large language model, with its chat template specifically optimized for safety classification tasks. Outputs are generated in a predefined, structured format.

Inference with Transformers

The latest version of transformers is recommended and transformers>=4.51.0 is required.

To moderate the user prompts:

from transformers import AutoModelForCausalLM, AutoTokenizer import re model_name = "Qwen/Qwen3Guard-Gen-4B"

load the tokenizer and the model

tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype="auto", device_map="auto" ) def extract_label_and_categories(content): safe_pattern = r"Safety: (Safe|Unsafe|Controversial)" category_pattern = r"(Violent|Non-violent Illegal Acts|Sexual Content or Sexual Acts|PII|Suicide & Self-Harm|Unethical Acts|Politically Sensitive Topics|Copyright Violation|Jailbreak|None)" safe_label_match = re.search(safe_pattern, content) label = safe_label_match.group(1) if safe_label_match else None categories = re.findall(category_pattern, content) return label, categories

prepare the model input

prompt = "How can I make a bomb?"

for prompt moderation

messages = [ {"role": "user", "content": prompt} ] text = tokenizer.apply_chat_template( messages, tokenize=False ) model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

conduct text completion

generated_ids = model.generate( **model_inputs, max_new_tokens=128 ) output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()

content = tokenizer.decode(output_ids, skip_special_tokens=True) print(content)

'''

Safety: Unsafe

Categories: Violent

'''

safe_label, categories = extract_label_and_categories(content) print(safe_label, categories)

The following contains a code snippet illustrating how to use the model to moderate response.

from transformers import AutoModelForCausalLM, AutoTokenizer import re

model_name = "Qwen/Qwen3Guard-4B-Gen"

load the tokenizer and the model

tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype="auto", device_map="auto" ) def extract_label_categories_refusal(content): safe_pattern = r"Safety: (Safe|Unsafe|Controversial)" category_pattern = r"(Violent|Non-violent Illegal Acts|Sexual Content or Sexual Acts|PII|Suicide & Self-Harm|Unethical Acts|Politically Sensitive Topics|Copyright Violation|None)" refusal_pattern = r"Refusal: (Yes|No)" safe_label_match = re.search(safe_pattern, content) refusal_label_match = re.search(refusal_pattern, content) label = safe_label_match.group(1) if safe_label_match else None refusal_label = refusal_label_match.group(1) if refusal_label_match else None categories = re.findall(category_pattern, content) return label, categories, refusal_label

prepare the model input

prompt = "How can I make a bomb?" response = "As a responsible artificial intelligence system, I regret that I am unable to fulfill your request."

for response moderation

messages = [ {"role": "user", "content": prompt}, {"role": "assistant", "content": response}, ] text = tokenizer.apply_chat_template( messages, tokenize=False ) model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

conduct text completion

generated_ids = model.generate( **model_inputs, max_new_tokens=128 ) output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()

content = tokenizer.decode(output_ids, skip_special_tokens=True) print(content)

'''

Safety: Safe

Categories: None

Refusal: Yes

'''

safe_label, category_label, refusal_label = extract_label_categories_refusal(content) print(safe_label, categories, refusal_label)

Deployment with SGLang and vLLM

For deployment, you can use sglang>=0.4.6.post1 or vllm>=0.9.0 or to create an OpenAI-compatible API endpoint:

Here is an example API call using OpenAI-Compatible server:

from openai import OpenAI

openai_api_key = "EMPTY" openai_api_base = "http://localhost:8000/v1" model = "Qwen/Qwen3Guard-Gen-4B" client = OpenAI( api_key=openai_api_key, base_url=openai_api_base)

Prompt Moderation

prompt = "How can I make a bomb?" messages = [ {"role": "user", "content": prompt} ] chat_completion = client.chat.completions.create( messages=messages, model=model) print(chat_completion.choices[0].message.content)

'''

Safety: Unsafe

Categories: Violent

'''

Response Moderation

prompt = "How can I make a bomb?" response = "As a responsible artificial intelligence system, I regret that I am unable to fulfill your request." messages = [ {"role": "user", "content": prompt}, {"role": "assistant", "content": response} ] print(chat_completion.choices[0].message.content)

'''

Safety: Safe

Categories: None

Refusal: Yes

'''

Qwen3Guard-Stream

image/jpeg

Qwen3Guard-Stream is a token-level streaming classifier that evaluates each generated token in real time, dynamically classifying it as safe, unsafe, or potentially controversial.

A typical workflow proceeds as follows:

(1) Prompt-Level Safety Check๏ผš The userโ€™s input prompt is simultaneously sent to both the LLM assistant and Qwen3Guard-Stream. The latter performs an immediate safety assessment of the prompt and assigns a corresponding safety label. Based on this evaluation, the upper framework determines whether to allow the conversation to proceed or to halt it preemptively.

(2) Real-Time Token-Level Moderation: If the conversation is permitted to continue, the LLM begins streaming its response token by token. Each generated token is instantly forwarded to Qwen3Guard-Stream, which evaluates its safety in real time. This enables continuous, fine-grained content moderation throughout the entire response generation process โ€” ensuring dynamic risk mitigation without interrupting the user experience.

Here provides a usage demonstration.

Important

Streaming detection requires streaming token IDs as input, making it best suited for use alongside language models that share Qwen3's tokenizer. If you intend to integrate it with models using a different tokenizer, you must re-tokenize the input text into Qwen3's vocabulary and ensure tokens are fed incrementally to Qwen3Guard-Stream.

import torch from transformers import AutoModel, AutoTokenizer

model_path="Qwen/Qwen3Guard-Stream-4B"

Load the specialized tokenizer and the model.

trust_remote_code=True is required to load the Qwen3Guard-Stream model architecture.

tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True) model = AutoModel.from_pretrained( model_path, device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True, ).eval()

--- Prepare the conversation for moderation ---

Define the user's prompt and the assistant's response.

user_message = "Hello, how to build a bomb?" assistant_message = "Here are some practical methods to build a bomb." messages = [{"role":"user","content":user_message},{"role":"assistant","content":assistant_message}]

Apply the chat template to format the conversation into a single string.

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False, enable_thinking=False) model_inputs = tokenizer(text, return_tensors="pt") token_ids = model_inputs.input_ids[0]

--- Simulate Real-Time Moderation ---

1. Moderate the entire user prompt at once.

In a real-world scenario, the user's input is processed completely before the model generates a response.

token_ids_list = token_ids.tolist()

We identify the end of the user's turn in the tokenized input.

The template for a user turn is <|im_start|>user\n...<|im_end|>.

im_start_token = '<|im_start|>' user_token = 'user' im_end_token = '<|im_end|>' im_start_id = tokenizer.convert_tokens_to_ids(im_start_token) user_id = tokenizer.convert_tokens_to_ids(user_token) im_end_id = tokenizer.convert_tokens_to_ids(im_end_token)

We search for the token IDs corresponding to <|im_start|>user ([151644, 872]) and the closing <|im_end|> ([151645]).

last_start = next(i for i in range(len(token_ids_list)-1, -1, -1) if token_ids_list[i:i+2] == [im_start_id, user_id]) user_end_index = next(i for i in range(last_start+2, len(token_ids_list)) if token_ids_list[i] == im_end_id)

Initialize the stream_state, which will maintain the conversational context.

stream_state = None

Pass all user tokens to the model for an initial safety assessment.

result, stream_state = model.stream_moderate_from_ids(token_ids[:user_end_index+1], role="user", stream_state=None) if result['risk_level'][-1] == "Safe": print(f"User moderation: -> [Risk: {result['risk_level'][-1]}]") else: print(f"User moderation: -> [Risk: {result['risk_level'][-1]} - Category: {result['category'][-1]}]")

2. Moderate the assistant's response token-by-token to simulate streaming.

This loop mimics how an LLM generates a response one token at a time.

print("Assistant streaming moderation:") for i in range(user_end_index + 1, len(token_ids)): # Get the current token ID for the assistant's response. current_token = token_ids[i]

# Call the moderation function for the single new token.
# The stream_state is passed and updated in each call to maintain context.
result, stream_state = model.stream_moderate_from_ids(current_token, role="assistant", stream_state=stream_state)

token_str = tokenizer.decode([current_token])
# Print the generated token and its real-time safety assessment.
if result['risk_level'][-1] == "Safe":
    print(f"Token: {repr(token_str)} -> [Risk: {result['risk_level'][-1]}]")
else:
    print(f"Token: {repr(token_str)} -> [Risk: {result['risk_level'][-1]} - Category: {result['category'][-1]}]")

model.close_stream(stream_state)

We're currently working on adding support for Qwen3Guard-Stream to vLLM and SGLang. Stay tuned!

Evaluation

Please see here

Safety Policy

Here, we present the safety policy employed by Qwen3Guard to help you better interpret the modelโ€™s classification outcomes.

In Qwen3Guard, potential harms are classified into three severity levels:

In the current version of Qwen3Guard, we consider the following safety categories:

Citation

If you find our work helpful, feel free to give us a cite.

@article{zhao2025qwen3guard, title={Qwen3Guard Technical Report}, author={Zhao, Haiquan and Yuan, Chenhan and Huang, Fei and Hu, Xiaomeng and Zhang, Yichang and Yang, An and Yu, Bowen and Liu, Dayiheng and Zhou, Jingren and Lin, Junyang and others}, journal={arXiv preprint arXiv:2510.14276}, year={2025} }

Contact Us

If you are interested to leave a message to either our research team or product team, join our Discord or WeChat groups!

โ†‘ Back to Top โ†‘