Skip to main content

NVIDIA Rubin CPX (Context Processing Unit)#

Product Overview#

NVIDIA Rubin CPX (full name Rubin Context Processing Unit) is the world's first GPU specifically designed for ultra-long-context AI inference, announced by NVIDIA on September 9, 2025, with H2 2026 shipment. It adopts a monolithic design, equipped with 128GB GDDR7 memory, delivering 30 PFLOPS FP4 compute, with memory bandwidth of only 2 TB/s—intentionally optimzed for the context processing stage (Compute-bound), not the generation stage (Memory-bandwidth-bound).

Rubin CPX works in conjunction with Rubin GPU (handles generation stage) and Vera CPU (handles scheduling) to form a decoupled inference architecture. In the Vera Rubin NVL144 CPX rack, 144 CPX + 144 Rubin GPU + 36 Vera CPU deliver 8 EFLOPS total compute, which is 7.5× that of GB300 NVL72.

Core Specifications#

ItemParameter
ArchitectureRubin (CPX dedicated variant)
ProcessTSMC 3NP (estimated)
PackagingMonolithic, non-MCM
Memory128 GB GDDR7 (consumer-grade memory, not HBM)
Memory Bandwidth2 TB/s
FP4 (NVFP4)30 PFLOPS (sparse, official claim)
FP8 / FP16Not publicly disclosed
Attention Acceleration3× (vs GB300 NVL72)
TDP~500–600W (estimated, pending official confirmation)
Board Form FactorIndependent GPU (paired with Rubin GPU)
Announcement Date2025-09-09
Shipment DateH2 2026

⚠️ Design Philosophy: CPX's 2 TB/s bandwidth is significantly lower than HBM solutions (B200: 8 TB/s, Rubin R200: 22 TB/s), because the context processing stage is Compute-bound (compute bottleneck), not Memory-bandwidth-bound (generation stage). Low-bandwidth GDDR7 significantly reduces cost.

Decoupled Inference Architecture#

StageExecution UnitBottleneck TypeCPX Role
Context Stage (Context / Pre-fill)Rubin CPXCompute-boundProcess 1M+ token input
Generation Stage (Generation / Decode)Rubin GPUMemory-bandwidth-boundToken-by-token output generation
Scheduling/PreprocessingVera CPUI/O boundRequest scheduling + KV Cache management

Traditional solution (GB200/B200): Same GPU handles both stages, wasting high-bandwidth memory on context stage. CPX solution: Dedicated CPX handles context, Rubin GPU focuses on generation, total throughput improved 6.5× (NVIDIA official data).

Vera Rubin NVL144 CPX Rack#

ItemParameter
CPX GPU Count144
Rubin GPU Count144
Vera CPU Count36
Total FP4 Compute8 EFLOPS
Total High-Bandwidth Memory100 TB
Total Memory Bandwidth1.7 PB/s
vs GB300 NVL72Compute 7.5×, memory capacity ~14×
NetworkingQuantum-X800 InfiniBand / Spectrum-X Ethernet + ConnectX-9
Scheduling SoftwareNVIDIA Dynamo platform

Comparison with Rubin R200#

MetricRubin CPXRubin R200 (Training GPU)
PositioningInference context dedicatedTraining + inference universal
Memory128GB GDDR7288GB HBM4
Bandwidth2 TB/s22 TB/s
FP4 Compute30 PFLOPS50 PFLOPS (sparse)
DesignMonolithic6-chip MCM
TDP~500–600W (estimated)~1,800W
CostLow (GDDR7 vs HBM4)High

Suitable Scenarios#

  • Ultra-long-context inference (1M+ tokens, code generation, video understanding)
  • Multi-turn dialog systems (Context stage throughput critical)
  • RAG (Retrieval-Augmented Generation) (large document input)
  • Inference-dedicated clusters (separated from training clusters)
  • ❌ Large-scale model training (not target scenario)
  • ❌ High-bandwidth-demand workloads (generation stage still handled by Rubin GPU)

Return on Investment (ROI)#

NVIDIA official data (SemiAnalysis citation):

  • Single rack (NVL144 CPX): ~$50M
  • Annual revenue contribution: $1.5B–2.5B (inference as a service)
  • ROI: 30–50×

References#