Nemotron Nano 9B V2 API | AIMLAPI (original) (raw)

Nemotron Nano 9B V2

NVIDIA Nemotron Nano 9B V2 is a compact yet capable language model built to balance performance, efficiency, and accessibility.

Nemotron Nano 9B V2 API Overview

NVIDIA Nemotron Nano 9B V2 is a state-of-the-art large language model (LLM) designed for efficient and high-throughput text generation, particularly excelling in complex reasoning tasks. Leveraging a hybrid Mamba-Transformer architecture, this model balances inference speed, accuracy, and moderate resource consumption.

Technical Specifications

Architecture: Hybrid Mamba-Transformer
Parameter count: 9 Billion
Training data: 20 trillion tokens, FP8 training precision
Context window: 131,072 tokens

Performance Benchmarks

Reasoning Accuracy: Matches or exceeds similarly sized models across benchmarks like GSM8K, MATH, AIME, MMLU, and GPQA.
Code Generation: 71.1% accuracy on LiveCodeBench, supporting 43 programming languages.
Memory Efficiency: INT4 quantization allows deployment on GPUs with 22 GiB memory while supporting massive context windows.

Key Features

Hybrid Mamba-Transformer Architecture: Combines efficient Mamba-2 state space layers with selective Transformer self-attention to accelerate long-context reasoning without sacrificing accuracy.
High Throughput: Achieves up to 6x faster inference speed compared to similar-sized models, such as Qwen3-8B, in reasoning-heavy scenarios.
Long Context Support: Can process sequences up to 128,000 tokens on commodity hardware, enabling extensive document comprehension and multi-document summarization.

Nemotron Nano 9B V2 API Pricing

Input: $0.05486 / 1M tokens
Output: $0.21944 / 1M tokens

‍

Code Sample

Comparison with Other Models

vs Qwen3-8B: Nemotron Nano uses a hybrid Mamba-Transformer architecture replacing most self-attention layers with Mamba-2 layers, resulting in up to 6x faster inference on reasoning-heavy tasks. It supports extremely long contexts (128K tokens) on a single GPU versus Qwen3-8B’s conventional transformer design with shorter context windows.

vs GPT-3.5: While GPT-3.5 is widely adopted for general NLP tasks with broad integration, Nemotron Nano 9B V2 specializes in efficient long-context reasoning and multi-step problem solving with better throughput on NVIDIA hardware.

vs Claude 2: Claude 2 focuses on safety and instruction-following with broad conversational abilities, but Nemotron Nano places more emphasis on mathematical/scientific reasoning and coding accuracy with dedicated controllable reasoning budget features.

vs PaLM 2: PaLM 2 targets high accuracy on broad AI benchmarks and multi-lingual tasks but generally demands more extensive hardware resources. Nemotron Nano excels in deployability with a smaller footprint, supporting effectively longer contexts and faster inference speeds specifically on NVIDIA GPU architectures, making it pragmatic for large-scale enterprise or edge applications.