AI Inference

NVIDIA Dynamo

Scale and serve AI inference—fast.

Get Started

Watch the Dynamo Day Playlist | Read the Press Release | Read the Dynamo 1.0 Tech Blog

Overview
Features
Starting Options
Benefits
Adopters
Use Cases
Customer Testimonials
Resources
Next Steps

Overview

Overview
Features
Starting Options
Benefits
Adopters
Use Cases
Customer Testimonials
Resources
Next Steps

Get Started

Overview

The Operating System of AI

Efficiently serving today’s frontier language models often requires resources that exceed the capacity of a single GPU—or even an entire node—making distributed, multi-node deployment essential for AI inference.

NVIDIA Dynamo is an open source, distributed inference-serving framework built to deploy models in multi-node environments at data center scale. It supports open source inference engines—including SGLang, NVIDIA TensorRT™ LLM, and vLLM—and simplifies the complexities of distributed serving by disaggregating inference phases across different GPUs, intelligently routing requests to the appropriate GPU to avoid redundant computation, and extending GPU memory through data caching to cost-effective storage tiers.

NVIDIA NIM™ microservices will include NVIDIA Dynamo capabilities, providing a quick and easy deployment option. NVIDIA Dynamo will also be supported and available with NVIDIA AI Enterprise.

What Is Distributed Inference?

Distributed inference is the process of running AI model inference across multiple computing devices or nodes to maximize throughput by parallelizing computations.

This approach enables efficient scaling for large-scale AI applications, such as generative AI, by distributing workloads across GPUs or cloud infrastructure. Distributed inference improves overall performance and resource utilization by allowing users to optimize latency and throughput for the unique requirements of each workload.

A Closer Look at NVIDIA Dynamo

Low-latency distributed inference framework for scaling reasoning AI models.

Independent benchmarks show that GB200 NVL72 combined with Dynamo improves Mixture of Expert model throughput by up to 15x compared to Hopper-based systems.

Independent benchmarks show that NVIDIA Dynamo combined with wide expert parallel on NVIDIA GB200 NVL72 improves mixture-of-experts (MoE) model throughput by up to 7x compared to NVIDIA B200-based systems.

The GB200 NVL72 connects 72 GPUs via high-speed NVIDIA NVLink™, enabling low-latency expert communication critical for MoE reasoning models. NVIDIA Dynamo enhances efficiency through disaggregated inference, splitting prefill and decode phases across nodes for independent optimization. Together, GB200 NVL72 and NVIDIA Dynamo form a high-performance stack optimized for large-scale MoE inference.

Features

Explore the Features of NVIDIA Dynamo

Disaggregated Serving

Separates large language model (LLM) context and generation phases across distinct GPUs, enabling independent GPU allocation and optimization to increase requests served per GPU.

LLM-Aware Router

Routes inference traffic efficiently, minimizing costly recomputation of repeat or overlapping requests to preserve compute resources while ensuring balanced load distribution across large GPU fleets.

KV Caching to Storage

Instantly offloads KV cache from limited GPU memory to scalable, cost-efficient storage, such as CPU RAM, local SSDs, or network storage.

Topology-Optimized Kubernetes Serving (Grove)

Enables efficient scaling and declarative startup ordering of interdependent AI inference components in single-node and multi-node setups using a unified Kubernetes custom resource.

GPU Planner

Monitors GPU capacity in distributed inference environments and dynamically allocates GPU workers across context and generation phases to resolve bottlenecks and optimize performance.

Low-Latency Communication Library (NIXL)

Accelerates data movement in distributed inference settings while simplifying transfer complexities across diverse hardware, including GPUs, CPUs, networks, and storage.

AIConfigurator

Removes the guesswork from disaggregated serving clusters by recommending optimal prefill and decode configs and model parallel strategies tailored to the model, GPU budget, and SLOs.

AIPerf

Benchmark generative AI model performance across any inference solution, with detailed metrics via command-line output and in-depth performance reports.

Accelerate Distributed Inference

NVIDIA Dynamo is fully open source, giving you complete transparency and flexibility. Deploy NVIDIA Dynamo, contribute to its growth, and seamlessly integrate it into your existing stack.

Check it out on GitHub and join the community!

Get Started

Benefits

The Benefits of NVIDIA Dynamo

Seamlessly Scale From One GPU to Thousands of GPUs

Streamline and automate GPU cluster setup with prebuilt, easy-to-deploy tools and enable dynamic autoscaling with real-time LLM-specific metrics, avoiding over or under provisioning of GPU resources.

Increase Inference Serving Capacity While Reducing Costs

Leverage advanced LLM inference serving optimizations like disaggregated serving and topology-aware autoscaling to increase the number of inference requests served without compromising user experience.

Future-Proof Your AI Infrastructure and Avoid Costly Migrations

Open and modular design allows you to easily pick and choose the inference-serving components that suit your unique needs, ensuring compatibility with your existing AI stack and avoiding costly migration projects.

Accelerate Time to Deploy New AI Models in Production

NVIDIA Dynamo’s support for all major frameworks—including NVIDIA TensorRT-LLM, vLLM, SGLang, PyTorch, and more—ensures your ability to quickly deploy new generative AI models, regardless of their backend.

Dynamo Ecosystem Partners

Use Cases

Deploying AI with NVIDIA Dynamo

Find out how you can drive innovation with NVIDIA Dynamo.

Reasoning Model Serving
Kubernetes AI Scaling
Deploying AI Agents
Code Generation

Serving Reasoning Models

Reasoning models generate more tokens to solve complex problems, increasing inference costs. NVIDIA Dynamo optimizes these models with features like disaggregated serving. This approach separates the prefill and decode computational phases into distinct GPUs, allowing AI inference teams to optimize each phase independently. The result is better resource utilization, more queries served per GPU, and lower inference costs. When combined with the NVIDIA GB200 NVL72, NVIDIA Dynamo boosts compounding performance up to 15x.

Kubernetes AI Scaling

As AI models grow too large to fit on a single node, serving them efficiently becomes a challenge. Distributed inference requires splitting models across multiple nodes, which adds complexity in orchestration, scaling, and communication in Kubernetes-based environments. Ensuring these nodes function as a cohesive unit—especially under dynamic workloads—demands careful management. NVIDIA Dynamo simplifies this by using Grove, which seamlessly handles scheduling, scaling, and serving, so you can focus on deploying AI—not managing infrastructure.

Scalable AI Agents

AI agents generate massive amounts of KV cache as they work with multiple models—LLMs, retrieval systems, and specialized tools—in real time. This KV cache often exceeds the capacity of GPU memory, creating a bottleneck for scaling and performance.

To overcome GPU memory limitations, caching KV data to host memory or external storage extends capacity, enabling AI agents to scale without constraints. NVIDIA Dynamo simplifies this with its KV Cache Manager and integrations with open source tools like LMCache, ensuring efficient cache management and scalable AI agent performance.

Code Generation

Code generation often requires iterative refinement to adjust prompts, clarify requirements, or debug outputs based on the model’s responses. This back-and-forth necessitates context re-computation with each user turn, increasing inference costs. NVIDIA Dynamo optimizes this process by enabling context reuse.

NVIDIA Dynamo’s LLM-aware router intelligently manages KV cache across multi-node GPU clusters. It routes requests based on cache overlap, directing them to GPUs with the highest reuse potential. This minimizes redundant computation and ensures balanced performance in large-scale deployments.

Customer Testimonials

What Are Industry Leaders Saying About NVIDIA Dynamo?

CoreWeave

“As AI moves from experimental pilots to continuous, large-scale production, the underlying infrastructure must be as dynamic as the models it supports. Supporting NVIDIA Dynamo allows us to offer a more seamless, resilient environment for deploying complex AI agents. This foundation provides the durability and high-performance orchestration required to move the industry’s most ambitious agentic workloads into global production.”

Chen Goldberg, EVP of Product & Engineering at CoreWeave

Together AI

“AI Natives require inference that can reliably and efficiently scale with their application. NVIDIA Dynamo 1.0, combined with cutting-edge inference research from Together AI, helps us deliver a high performance stack to offer accelerated, cost-effective inference for large scale production workloads.”

Vipul Ved Prakash, cofounder and CEO of Together AI

“Delivering an intuitive, multimodal AI experience to hundreds of millions of users requires real-time intelligence at global scale. said. As a significant adopter in open source, we’re committed to building scalable AI technologies. With NVIDIA Dynamo optimizing our deployment, we’re expanding the seamless and personalized experiences we deliver, powered by high-performance AI infrastructure.”

Matt Madrigal, CTO of Pinterest

Customer Stories

How Industry Leaders Are Enhancing Model Deployment With the NVIDIA Dynamo Platform

Adopters

Leading Adopters Across All Industries

Customers
Ecosystem Integrations

NVIDIA Blackwell Ultra Delivers up to 50x Better Performance and 35x Lower Cost for Agentic AI

Built to accelerate the next generation of agentic AI, NVIDIA Blackwell Ultra delivers breakthrough inference performance with dramatically lower cost. Cloud providers such as Microsoft, CoreWeave, and Oracle Cloud Infrastructure are deploying NVIDIA GB300 NVL72 systems at scale for low-latency and long-context use cases, such as agentic coding and coding assistants.

This is enabled by deep co-design across NVIDIA Blackwell, NVLink™, and NVLink Switch for scale-out; NVFP4 for low-precision accuracy; and NVIDIA Dynamo and TensorRT™ LLM for speed and flexibility—as well as development with community frameworks SGLang, vLLM, and more.

Explore Key Results

Resources

The Latest in NVIDIA Inference

Blogs
Sessions
Training
Videos

Get the Latest News

Read about the latest inference updates and announcements for NVIDIA Dynamo Inference Server.

See All Inference Blogs

Explore Technical Blogs

Read technical walkthroughs on how to get started with inference.

See All Technical LLM Inference Blogs

Take a Deep Dive

Get tips and best practices for deploying, running, and scaling AI models for inference for generative AI, LLMs, recommender systems, computer vision, and more.

Read Now

View All Blogs

Boosting LLM Inference Performance

Watch our NVIDIA Dynamo Office Hour recording to learn how to optimize LLM serving with NVIDIA Dynamo. Discover how to meet SLAs and boost interactivity and throughput using LLM-aware routing, disaggregated serving, and dynamic autoscaling on open-source models and inference backends.

Watch On-Demand Office Hours

Low-Latency Distributed Inference for Scaling LLMs

Learn how to deploy and scale reasoning LLMs using NVIDIA Dynamo. Explore advanced serving techniques like disaggregated prefill and decode, and see how NVIDIA NIM enables fast, production-ready deployment of next-gen AI inference at scale.

Watch On-Demand GTC Session

Kubernetes-Native AI Serving

Discover Grove, a Kubernetes-native solution for orchestrating complex AI inference workloads. Part of NVIDIA Dynamo or a deployable standalone, Grove bridges the gap between AI frameworks and Kubernetes through a powerful API—making scalable, efficient AI inference on Kubernetes easier than ever.

Watch On-Demand Office Hours

View More Sessions

Quick-Start Guide

New to NVIDIA Dynamo and want to deploy your model quickly? Make use of this quick-start guide to begin your NVIDIA Dynamo journey.

Read Now

Tutorials

Getting started with NVIDIA Dynamo can lead to many questions. Explore this repository to familiarize yourself with NVIDIA Dynamo’s features and find guides and examples that can help ease migration.

Read Now

NVIDIA Brev

Unlock NVIDIA GPU power in seconds with NVIDIA Brev—instant access, automatic setup, and flexible deployment on top cloud platforms. Start building and scaling your AI projects right away.

Explore Now

How to Optimize AI Serving With NVIDIA Dynamo AIConfigurator

AIConfigurator takes the guesswork out of disaggregated serving. It recommends the best configurations to meet your performance goals based on your model, GPU budget, and SLOs. In this video, you’ll learn how to get started with AIConfigurator.

Watch Now

Scaling Inference With SGLang and NVIDIA Dynamo

Watch the recorded SGLang × NVIDIA Meetup to explore inference performance at scale with insights from the SGLang and NVIDIA Dynamo teams. Learn about the latest advancements and integration strategies to optimize AI inference in your applications.

Watch Now

Advanced Techniques for Efficient AI Inference

This video dives into the three key levers of AI inference—quality, cost, and speed—and how test-time scaling impacts each. Learn how NVIDIA Dynamo gives you precise control through advanced techniques like disaggregation, KV offloading, and KV routing, empowering you to optimize large model deployments without trade-offs.

Watch Now

View More Videos

Next Steps

Ready to Get Started?

Download on GitHub and join the community!

For Developers

Explore everything you need to start developing with NVIDIA Dynamo, including the latest documentation, tutorials, technical blogs, and more.

Start Developing

Get in Touch

Talk to an NVIDIA product specialist about moving from pilot to production with the security, API stability, and support of NVIDIA AI Enterprise.

Learn How Snapchat Is Using Triton to Enhance the Shopping Experience

See How Triton Model Analyzer Optimizes Model Deployment

Read the Generative AI Performance Analyzer Guide

Read About Serving Model Pipelines on Triton With Ensemble Models

Deploy on Amazon SageMaker

Deploy on Google Vertex AI

Deploy on Azure ML Studio

Deploy on Oracle Cloud

Read the Press Release | Read the Tech Blog

Blogs
Sessions
Training
Videos

Get the Latest News

Read about the latest inference updates and announcements for Dynamo Inference Server.

See All Dynemo Blogs

Explore Technical Blogs

Read technical walkthroughs on how to get started with inference.

See All Technical LLM Inference Blogs

Take a Deep Dive

Get tips and best practices for deploying, running, and scaling AI models for inference for generative AI, LLMs, recommender systems, computer vision, and more.

Read Now

View All Blogs

Deploying, Optimizing, and Benchmarking LLMs

Learn how to serve LLMs efficiently with step-by-step instructions. We’ll cover how to easily deploy an LLM across multiple backends and compare their performance, as well as how to fine-tune deployment configurations for optimal performance.

Watch On-Demand GTC Session

Move Enterprise AI Use Cases From Development to Production

Learn what AI inference is, how it fits into your enterprise's AI deployment strategy, what key challenges in deploying enterprise-grade AI use cases are, why a full-stack AI inference solution is needed to address these challenges, the main components of a full-stack platform are, and how to deploy your first AI inferencing solution.

Watch On-Demand Session

Harness the Power of Cloud-Ready AI Inference Solutions

Explore how the NVIDIA AI inferencing platform seamlessly integrates with leading cloud service providers, simplifying deployment and expediting the launch of LLM-powered AI use cases.

Watch On-Demand Session

View More Sessions

Quick-Start Guide

New to Dynamo and want to deploy your model quickly? Make use of this quick-start guide to begin your Dynamo journey.

Read Now

Tutorials

Getting started with Dynamo can lead to many questions. Explore this repository to familiarize yourself with Dynamo’s features and find guides and examples that can help ease migration.

Read Now

NVIDIA LaunchPad

In hands-on labs, experience fast and scalable AI using NVIDIA Dynamo. You’ll be able to immediately unlock the benefits of NVIDIA’s accelerated computing infrastructure and scale your AI workloads.

Explore Now

View All Blogs

Top 5 Reasons Why Dynamo Is Simplifying Inference

NVIDIA Dynamo Inference Server simplifies the deployment of AI models at scale in production, letting teams deploy trained AI models from any framework from local storage or cloud platform on any GPU- or CPU-based infrastructure.

Watch Now

Triton for Effortless Stable Diffusion Pipeline Deployment

Deploy HuggingFace’s Stable Diffusion Pipeline With Dynamo

This video showcases deploying the Stable Diffusion pipeline available through the HuggingFace diffuser library. We use Dynamo Inference Server to deploy and run the pipeline.

Watch Now

Getting Started With NVIDIA Triton Inference Server

Getting Started With NVIDIA Dynamo Inference Server

Dynamo Inference Server is an open-source inference solution that standardizes model deployment and enables fast and scalable AI in production. Because of its many features, a natural question to ask is, where do I begin? Watch to find out.

Watch Now

View All Blogs