title=

AI Inference

NVIDIA Dynamo

Scale and serve AI inference—fast.

Overview

The Operating System of AI

Efficiently serving today’s frontier language models often requires resources that exceed the capacity of a single GPU—or even an entire node—making distributed, multi-node deployment essential for AI inference.

NVIDIA Dynamo is an open source, distributed inference-serving framework built to deploy models in multi-node environments at data center scale. It supports open source inference engines—including SGLang, NVIDIA TensorRT™ LLM, and vLLM—and simplifies the complexities of distributed serving by disaggregating inference phases across different GPUs, intelligently routing requests to the appropriate GPU to avoid redundant computation, and extending GPU memory through data caching to cost-effective storage tiers.

NVIDIA NIM™ microservices will include NVIDIA Dynamo capabilities, providing a quick and easy deployment option. NVIDIA Dynamo will also be supported and available with NVIDIA AI Enterprise.

What Is Distributed Inference?

Distributed inference is the process of running AI model inference across multiple computing devices or nodes to maximize throughput by parallelizing computations. 

This approach enables efficient scaling for large-scale AI applications, such as generative AI, by distributing workloads across GPUs or cloud infrastructure. Distributed inference improves overall performance and resource utilization by allowing users to optimize latency and throughput for the unique requirements of each workload.

A Closer Look at NVIDIA Dynamo

Low-latency distributed inference framework for scaling reasoning AI models.

Independent benchmarks show that GB200 NVL72 combined with Dynamo improves Mixture of Expert model throughput by up to 15x compared to Hopper-based systems.

Independent benchmarks show that NVIDIA Dynamo combined with wide expert parallel on NVIDIA GB200 NVL72 improves mixture-of-experts (MoE) model throughput by up to 7x compared to NVIDIA B200-based systems.

The GB200 NVL72 connects 72 GPUs via high-speed NVIDIA NVLink™, enabling low-latency expert communication critical for MoE reasoning models. NVIDIA Dynamo enhances efficiency through disaggregated inference, splitting prefill and decode phases across nodes for independent optimization. Together, GB200 NVL72 and NVIDIA Dynamo form a high-performance stack optimized for large-scale MoE inference.

Features

Explore the Features of NVIDIA Dynamo

Disaggregated Serving

Disaggregated Serving

Separates large language model (LLM) context and generation phases across distinct GPUs, enabling independent GPU allocation and optimization to increase requests served per GPU.

LLM-Aware  Router

LLM-Aware Router

Routes inference traffic efficiently, minimizing costly recomputation of repeat or overlapping requests to preserve compute resources while ensuring balanced load distribution across large GPU fleets.

KV Caching  to Storage

KV Caching to Storage

Instantly offloads KV cache from limited GPU memory to scalable, cost-efficient storage, such as CPU RAM, local SSDs, or network storage.

Topology-Optimized Kubernetes Serving (Grove)

Topology-Optimized Kubernetes Serving (Grove)

Enables efficient scaling and declarative startup ordering of interdependent AI inference components in single-node and multi-node setups using a unified Kubernetes custom resource.

GPU Planner

GPU Planner

Monitors GPU capacity in distributed inference environments and dynamically allocates GPU workers across context and generation phases to resolve bottlenecks and optimize performance.

Low-Latency Communication Library (NIXL)

Low-Latency Communication Library (NIXL)

Accelerates data movement in distributed inference settings while simplifying transfer complexities across diverse hardware, including GPUs, CPUs, networks, and storage.

AIConfigurator

AIConfigurator

Removes the guesswork from disaggregated serving clusters by recommending optimal prefill and decode configs and model parallel strategies tailored to the model, GPU budget, and SLOs.

AIPerf

AIPerf

Benchmark generative AI model performance across any inference solution, with detailed metrics via command-line output and in-depth performance reports.

Accelerate Distributed Inference

NVIDIA Dynamo is fully open source, giving you complete transparency and flexibility. Deploy NVIDIA Dynamo, contribute to its growth, and seamlessly integrate it into your existing stack.

 Check it out on GitHub and join the community!

Benefits

The Benefits of NVIDIA Dynamo

Scalability icon

Seamlessly Scale From One GPU to Thousands of GPUs

Streamline and automate GPU cluster setup with prebuilt, easy-to-deploy tools and enable dynamic autoscaling with real-time LLM-specific metrics, avoiding over or under provisioning of GPU resources.

Serving icon

Increase Inference Serving Capacity While Reducing Costs

Leverage advanced LLM inference serving optimizations like disaggregated serving and topology-aware autoscaling to increase the number of inference requests served without compromising user experience.

Checkbox icon

Future-Proof Your AI Infrastructure and Avoid Costly Migrations

Open and modular design allows you to easily pick and choose the inference-serving components that suit your unique needs, ensuring compatibility with your existing AI stack and avoiding costly migration projects.

Iterative process icon

Accelerate Time to Deploy New AI Models in Production

NVIDIA Dynamo’s support for all major frameworks—including NVIDIA TensorRT-LLM, vLLM, SGLang, PyTorch, and more—ensures your ability to quickly deploy new generative AI models, regardless of their backend.

Dynamo Ecosystem Partners

Alibaba
Astra Zeneca
AWS
Baseten Logo
Blackrock
Bytedance
Cineca
Cloudian
Cognition
Coreweave
Ddn
Dell Technologies
Everpure
Gcore
Google
Google Cloud
Harmonic
Hitachi Vantara
HPE
IBM
Intel
Lalamove
LLMCache
Llm-d
Meituan
Microsoft Azure
Mooncake
Nebius
Netapp
Oracle
Paypal
Pinterest
Ppio
Prime Intellect
RayServe
Rednote
SGL
Skypilot
SoftBank
Tencent
TiKTok Ecommerce
Together.AI
Vast
Volcengine
vLLM
Weka
WPS
Zouyeoung

Use Cases

Deploying AI with NVIDIA Dynamo

Find out how you can drive innovation with NVIDIA Dynamo.

Serving Reasoning Models

Reasoning models generate more tokens to solve complex problems, increasing inference costs. NVIDIA Dynamo optimizes these models with features like disaggregated serving. This approach separates the prefill and decode computational phases into distinct GPUs, allowing AI inference teams to optimize each phase independently. The result is better resource utilization, more queries served per GPU, and lower inference costs. When combined with the NVIDIA GB200 NVL72, NVIDIA Dynamo boosts compounding performance up to 15x.

AI Reasoning Model Serving

Customer Testimonials

What Are Industry Leaders Saying About NVIDIA Dynamo?

CoreWeave

CoreWeave

“As AI moves from experimental pilots to continuous, large-scale production, the underlying infrastructure must be as dynamic as the models it supports. Supporting NVIDIA Dynamo allows us to offer a more seamless, resilient environment for deploying complex AI agents. This foundation provides the durability and high-performance orchestration required to move the industry’s most ambitious agentic workloads into global production.”

Chen Goldberg, EVP of Product & Engineering at CoreWeave

Together.ai

Together AI

“AI Natives require inference that can reliably and efficiently scale with their application. NVIDIA Dynamo 1.0, combined with cutting-edge inference research from Together AI, helps us deliver a high performance stack to offer accelerated, cost-effective inference for large scale production workloads.”

Vipul Ved Prakash, cofounder and CEO of Together AI

Pinterest

Pinterest

“Delivering an intuitive, multimodal AI experience to hundreds of millions of users requires real-time intelligence at global scale. said. As a significant adopter in open source, we’re committed to building scalable AI technologies. With NVIDIA Dynamo optimizing our deployment, we’re expanding the seamless and personalized experiences we deliver, powered by high-performance AI infrastructure.”

Matt Madrigal, CTO of Pinterest

Customer Stories

How Industry Leaders Are Enhancing Model Deployment With the NVIDIA Dynamo Platform

Adopters

Leading Adopters Across All Industries

Amazon
American Express
Azure AI Translator
Encord
GE Healthcare
InfoSys
Intelligent Voice
Nio
Siemens Energy
Trax Retail
USPS
Yahoo Japan

NVIDIA Blackwell Ultra Delivers up to 50x Better Performance and 35x Lower Cost for Agentic AI

Built to accelerate the next generation of agentic AI, NVIDIA Blackwell Ultra delivers breakthrough inference performance with dramatically lower cost. Cloud providers such as Microsoft, CoreWeave, and Oracle Cloud Infrastructure are deploying NVIDIA GB300 NVL72 systems at scale for low-latency and long-context use cases, such as agentic coding and coding assistants.

This is enabled by deep co-design across NVIDIA Blackwell, NVLink™, and NVLink Switch for scale-out; NVFP4 for low-precision accuracy; and NVIDIA Dynamo and TensorRT™ LLM for speed and flexibility—as well as development with community frameworks SGLang, vLLM, and more.

nvidia

Resources

The Latest in NVIDIA Inference

Get the Latest News

Get the Latest News

Read about the latest inference updates and announcements for NVIDIA Dynamo Inference Server.

Explore Technical Blogs

Explore Technical Blogs

Read technical walkthroughs on how to get started with inference.

Take a Deep Dive

Take a Deep Dive

Get tips and best practices for deploying, running, and scaling AI models for inference for generative AI, LLMs, recommender systems, computer vision, and more.

Next Steps

Ready to Get Started?

Download on GitHub and join the community!

decorative

For Developers

Explore everything you need to start developing with NVIDIA Dynamo, including the latest documentation, tutorials, technical blogs, and more.

decorative

Get in Touch

Talk to an NVIDIA product specialist about moving from pilot to production with the security, API stability, and support of NVIDIA AI Enterprise.

Read the Press Release | Read the Tech Blog

Get the Latest News

Get the Latest News

Read about the latest inference updates and announcements for Dynamo Inference Server.

Explore Technical Blogs

Explore Technical Blogs

Read technical walkthroughs on how to get started with inference.

Take a Deep Dive

Take a Deep Dive

Get tips and best practices for deploying, running, and scaling AI models for inference for generative AI, LLMs, recommender systems, computer vision, and more.

Select Location
Middle East