Working with Multimodal Data — NVIDIA NeMo Microservices (original) (raw)

Note

The time to complete this tutorial is approximately 30 minutes.

About Support for Multimodal Data#

The support is for input and output guardrails only. Depending on the image reasoning model, you can specify the image to check as a base64 encoded data or as a URL. Refer to the documentation for the model for more information.

The safety check uses the image reasoning model as LLM as-a-judge to determine if the content is safe. The OpenAI, Llama Vision, and Llama Guard models can accept multimodal input and act as a judge model.

You must ensure the image size and prompt length do not exceed the maximum context length of the model.

About the Tutorial#

This tutorial demonstrates how to use multimodal data with NeMo Guardrails to perform safety checks on image data. You configure a guardrail that uses an image-reasoning model and a prompt that instructs the model to check the image for unsafe content.

The tutorial uses the Meta Llama 3.2 90B Vision Instruct model for the application LLM and as an LLM-as-a-judge.

The model performs visual recognition, image reasoning, captioning, and answering general questions about an image. The ability to classify an image as unsafe relies on tuning the instructions and prompt that you supply to the model.

The model is available as a downloadable container from NVIDIA NGC and for interactive use from build.nvidia.com.

Prerequisites#

The following procedure uses endpoints from build.nvidia.com to simplify deployment. To use the endpoints, you must have an NVIDIA API key.
Refer to Deploying with Docker for information on how to start the NeMo Guardrails container.

Procedure#

Set an environment variable for your NVIDIA API key:
$ export NVIDIA_API_KEY=<nvapi-...>
Start the NeMo Guardrails microservice container:
docker run \
-p 7331:7331 \
-v :/config-store \
-e CONFIG_STORE_PATH=/config-store \
-e NIM_ENDPOINT_API_KEY="${NVIDIA_API_KEY}" \
-e NVIDIA_API_KEY \
nvcr.io/nvidia/nemo-microservices/guardrails:25.06
Create a guardrail configuration:
cURL
curl -X POST "${GUARDRAILS_BASE_URL}/v1/guardrail/configs" \
-H "Accept: application/json" \
-H "Content-Type: application/json" \
-d '{
"name": "demo-multimodal-stream",
"namespace": "default",
"description": "demo configuration for multimodal vision reasoning",
"data": {
"prompts": [{
"task": "content_safety_check_input $model=vision_rails",
"messages": [{
"type": "system",
"content": "You are an AI assistant that evaluates if a user'''s message contains unsafe content.\nCarefully examine any text and images in the user'''s message.\n\nTask: Check if there is unsafe content in "User" messages in conversations according to our safety policy with the below categories.\n\n\nS1: Violent Crimes.\nS2: Non-Violent Crimes.\nS3: Sex Crimes.\nS4: Child Exploitation.\nS5: Defamation.\nS6: Specialized Advice.\nS7: Privacy.\nS8: Intellectual Property.\nS10: Hate.\nS11: Self-Harm.\nS12: Sexual Content.\nS13: Elections.\n\n\nIMPORTANT:\nProvide your safety assessment for the user message:\n- First line must read '''safe''' or '''unsafe''' and nothing more\n- If unsafe, a second line must include a comma-separated list of violated categories."
},
{
"type": "user",
"content": "{{ user_input }}"
}],
"stop": ["<|eot_id|>", "<|eom_id|>"],
"output_parser": "is_content_safe",
"max_tokens": "200"
}],
"models": [{
"model": "meta/llama-3.2-90b-vision-instruct",
"engine": "nim",
"type": "main",
"base_url": "https://integrate.api.nvidia.com/v1"
},{
"model": "meta/llama-3.2-90b-vision-instruct",
"engine": "nim",
"type": "vision_rails",
"base_url": "https://integrate.api.nvidia.com/v1"
}],
"rails": {
"input": {"flows": ["content safety check input $model=vision_rails"]},
"output": {
"streaming": {
"enabled": "true",
"chunk_size": "200",
"context_size": "50"
}
},
"dialog": {
"single_call": {
"enabled": "false"
}
}
}
}
}' | jq
Python
import os
import json
import requests
url = f"{os.environ['GUARDRAILS_BASE_URL']}/v1/guardrail/configs"
headers = {
"Accept": "application/json",
"Content-Type": "application/json"
}
payload = {
"name": "demo-multimodal-stream",
"namespace": "default",
"description": "demo configuration for multimodal vision reasoning",
"data": {
"prompts": [{
"task": "content_safety_check_input $model=vision_rails",
"messages": [{
"type": "system",
"content": 'You are an AI assistant that evaluates if a user's message contains unsafe content.\nCarefully examine any text and images in the user's message.\n\n{%- if categories is not defined -%}\n {%- set categories = {\n "S1": "Violent Crimes.",\n "S2": "Non-Violent Crimes.",\n "S3": "Sex Crimes.",\n "S4": "Child Exploitation.",\n "S5": "Defamation.",\n "S6": "Specialized Advice.",\n "S7": "Privacy.",\n "S8": "Intellectual Property.",\n "S10": "Hate.",\n "S11": "Self-Harm.",\n "S12": "Sexual Content.",\n "S13": "Elections."\n } -%}\n{%- endif -%}\n\nTask: Check if there is unsafe content in "User" messages in conversations according to our safety policy with the below categories.\n\n\n{%- for key, desc in categories.items() -%}\n {%- if key not in excluded_category_keys -%}\n{{ key }}: {{ desc }}\n {%- endif -%}\n{%- endfor -%}\n\n\nIMPORTANT:\nProvide your safety assessment for the user message:\n- First line must read 'safe' or 'unsafe' and nothing more\n- If unsafe, a second line must include a comma-separated list of violated categories.\n',
},
{
"type": "user",
"content": "{{ user_input }}"
}],
"stop": ["<|eot_id|>", "<|eom_id|>"],
"output_parser": "is_content_safe",
"max_tokens": 200,
}],
"models": [{
"model": "meta/llama-3.2-90b-vision-instruct",
"engine": "nim",
"type": "main",
"base_url": "https://integrate.api.nvidia.com/v1"
},{
"model": "meta/llama-3.2-90b-vision-instruct",
"engine": "nim",
"type": "vision_rails",
"base_url": "https://integrate.api.nvidia.com/v1"
}],
"rails": {
"input": {"flows": ["content safety check input $model=vision_rails"]},
"output": {
"streaming": {
"enabled": "true",
"chunk_size": 200,
"context_size": 50
}
},
"dialog": {
"single_call": {
"enabled": "false"
}
}
}
}
}
response = requests.post(url, headers=headers, json=payload)
print(json.dumps(response.json(), indent=2))
{
"created_at": "2025-04-14T18:54:43.862326",
"updated_at": "2025-04-14T18:54:43.862329",
"name": "demo-multimodal-stream",
"namespace": "default",
"description": "demo configuration for multimodal vision reasoning",
"data": {
"models": [
{
"type": "main",
"engine": "nim",
"model": "meta/llama-3.2-90b-vision-instruct",
"reasoning_config": {
"remove_thinking_traces": true,
"start_token": null,
"end_token": null
},
"parameters": {}
},
{
"type": "vision_rails",
"engine": "nim",
"model": "meta/llama-3.2-90b-vision-instruct",
"reasoning_config": {
"remove_thinking_traces": true,
"start_token": null,
"end_token": null
},
"parameters": {}
}
],
"instructions": [
{
"type": "general",
"content": "Below is a conversation between a helpful AI assistant and a user. The bot is designed to generate human-like text based on the input that it receives. The bot is talkative and provides lots of specific details. If the bot does not know the answer to a question, it truthfully says it does not know."
}
],
"actions_server_url": null,
"sample_conversation": "user "Hello there!"\n express greeting\nbot express greeting\n "Hello! How can I assist you today?"\nuser "What can you do for me?"\n ask about capabilities\nbot respond about capabilities\n "As an AI assistant, I can help you with a wide range of tasks. This includes question answering on various topics, generating text for various purposes and providing suggestions based on your preferences."\nuser "Tell me a bit about the history of NVIDIA."\n ask general question\nbot response for general question\n "NVIDIA is a technology company that specializes in designing and manufacturing graphics processing units (GPUs) and other computer hardware. The company was founded in 1993 by Jen-Hsun Huang, Chris Malachowsky, and Curtis Priem."\nuser "tell me more"\n request more information\nbot provide more information\n "Initially, the company focused on developing 3D graphics processing technology for the PC gaming market. In 1999, NVIDIA released the GeForce 256, the world's first GPU, which was a major breakthrough for the gaming industry. The company continued to innovate in the GPU space, releasing new products and expanding into other markets such as professional graphics, mobile devices, and artificial intelligence."\nuser "thanks"\n express appreciation\nbot express appreciation and offer additional help\n "You're welcome. If you have any more questions or if there's anything else I can help you with, please don't hesitate to ask."\n",
"prompts": [
{
"task": "content_safety_check_input $model=vision_rails",
"content": null,
"messages": [
{
"type": "system",
"content": "You are an AI assistant that evaluates if a user's message contains unsafe content.\nCarefully examine any text and images in the user's message.\n\n{%- if categories is not defined -%}\n {%- set categories = {\n "S1": "Violent Crimes.",\n "S2": "Non-Violent Crimes.",\n "S3": "Sex Crimes.",\n "S4": "Child Exploitation.",\n "S5": "Defamation.",\n "S6": "Specialized Advice.",\n "S7": "Privacy.",\n "S8": "Intellectual Property.",\n "S10": "Hate.",\n "S11": "Self-Harm.",\n "S12": "Sexual Content.",\n "S13": "Elections."\n } -%}\n{%- endif -%}\n\nTask: Check if there is unsafe content in "User" messages in conversations according to our safety policy with the below categories.\n\n\n{%- for key, desc in categories.items() -%}\n {%- if key not in excluded_category_keys -%}\n{{ key }}: {{ desc }}\n {%- endif -%}\n{%- endfor -%}\n\n\nIMPORTANT:\nProvide your safety assessment for the user message:\n- First line must read 'safe' or 'unsafe' and nothing more\n- If unsafe, a second line must include a comma-separated list of violated categories.\n"
},
{
"type": "user",
"content": "{{ user_input }}"
}
],
"models": null,
"output_parser": "is_content_safe",
"max_length": 16000,
"mode": "standard",
"stop": [
"<|eot_id|>",
"<|eom_id|>"
],
"max_tokens": 200
}
],
"prompting_mode": "standard",
"lowest_temperature": 0.001,
"enable_multi_step_generation": false,
"colang_version": "1.0",
"custom_data": {},
"rails": {
"config": null,
"input": {
"flows": [
"content safety check input $model=vision_rails"
]
},
"output": {
"flows": [],
"streaming": {
"enabled": true,
"chunk_size": 200,
"context_size": 50,
"stream_first": true
}
},
"retrieval": {
"flows": []
},
"dialog": {
"single_call": {
"enabled": false,
"fallback_to_multiple_calls": true
},
"user_messages": {
"embeddings_only": false,
"embeddings_only_similarity_threshold": null,
"embeddings_only_fallback_intent": null
}
},
"actions": {
"instant_actions": null
}
},
"enable_rails_exceptions": false,
"passthrough": null
},
"files_url": null,
"schema_version": "1.0",
"project": null,
"custom_fields": {},
"ownership": null
}
Send an image-reasoning request.
1. Download an image of a street scene, street sign, or other image. You can use a website such as https://commons.wikimedia.orgor download the street-scene.jpgfile used to develop this documentation.
  
  Save the image to a file, such as, street-scene.jpg.
2. Send the image and the request:
  cURL
  if ! [ -f "street-scene.jpg" ]; then
  echo "street-scene.jpg not found, exiting..."
  exit 1
  fi
  image_b64=$( base64 -w 0 street-scene.jpg )
  echo '{
  "model": "meta/llama-3.2-90b-vision-instruct",
  "messages": [{
  "role": "user",
  "content": [{
  "type": "text",
  "text": "Is there a traffic sign in this image?"
  }, {
  "type": "image_url",
  "image_url": {
  "url": "data:image/png;base64,'"$image_b64"'"
  }
  }]
  }],
  "guardrails": {
  "config_id": "demo-multimodal-stream"
  },
  "max_tokens": 512,
  "temperature": 1.00,
  "stream": false
  }' > payload-street-view.json
  curl "${GUARDRAILS_BASE_URL}/v1/guardrail/chat/completions" \
  -H "Content-Type: application/json" \
  -H "Accept: text/event-stream" \
  -d @payload-street-view.json | jq '.choices[0].message.content'
  Python
  import os
  import base64
  import json
  import requests
  url = f"{os.environ['GUARDRAILS_BASE_URL']}/v1/guardrail/chat/completions"
  headers = {
  "Accept": "text/event-stream",
  "Content-Type": "application/json",
  }
  with open("street-scene.jpg", "rb") as f:
  image_b64 = base64.b64encode(f.read()).decode("utf-8")
  payload = {
  "model": "meta/llama-3.2-90b-vision-instruct",
  "messages": [{
  "role": "user",
  "content": [
  {"type": "text", "text": "Is there a traffic sign in this image?"},
  {"type": "image_url",
  "image_url": {
  "url": f"data:image/png;base64,{image_b64}"
  }
  }
  ],
  }],
  "guardrails": {
  "config_id": "demo-multimodal-stream"
  },
  "max_tokens": 512,
  "stream": True
  }
  with requests.post(url, headers=headers, json=payload, stream=True) as response:
  response.encoding = "utf-8"
  for chunk in response.iter_lines(decode_unicode=True):
  if not chunk:
  continue
  if chunk == "data: [DONE]":
  print()
  break
  data = json.loads(chunk[len("data: "):])
  if data.get("choices", [{}])[0].get("delta", {}).get("content", None) is not None:
  # Add a check if content includes {"error": {"message": "Blocked by "...
  print(data["choices"][0]["delta"]["content"], end="", flush=True)
Example Output
Yes, there is a traffic sign in this image. The traffic sign is a red octagon with the word "STOP"
written on it in white letters. It is attached to a brown pole and is located on the left side of
the image.
Send a possibly unsafe request using an image depicting car audio theft:

cURL
curl "${GUARDRAILS_BASE_URL}/v1/guardrail/chat/completions" \
-H "Content-Type: application/json" \
-H "Accept: text/event-stream" \
-d '{
"model": "meta/llama-3.2-90b-vision-instruct",
"messages": [{
"role": "user",
"content": [{
"type": "text",
"text": "How can I get rich doing this?"
}, {
"type": "image_url",
"image_url": {
"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/2/27/Car_audio_theft.jpg/960px-Car_audio_theft.jpg"
}
}]
}],
"guardrails": {
"config_id": "demo-multimodal-stream"
},
"max_tokens": 512,
"temperature": 1.00,
"stream": false
}' | jq '.choices[0].message.content'
Python
import os
import base64
import json
import requests
url = f"{os.environ['GUARDRAILS_BASE_URL']}/v1/guardrail/chat/completions"
headers = {
"Accept": "text/event-stream",
"Content-Type": "application/json",
}
payload = {
"model": "meta/llama-3.2-90b-vision-instruct",
"messages": [{
"role": "user",
"content": [
{"type": "text", "text": "How can I get rich doing this?"},
{"type": "image_url",
"image_url": {
"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/2/27/Car_audio_theft.jpg/960px-Car_audio_theft.jpg"
}
}
],
}],
"guardrails": {
"config_id": "demo-multimodal-stream"
},
"max_tokens": 512,
"stream": True
}
with requests.post(url, headers=headers, json=payload, stream=True) as response:
response.encoding = "utf-8"
for chunk in response.iter_lines(decode_unicode=True):
if not chunk:
continue
if chunk == "data: [DONE]":
print()
break
data = json.loads(chunk[len("data: "):])
if data.get("choices", [{}])[0].get("delta", {}).get("content", None) is not None:
# Add a check if content includes {"error": {"message": "Blocked by "...
print(data["choices"][0]["delta"]["content"], end="", flush=True)

Example Output
I'm sorry, I can't respond to that.