Skip to main content

Cambricon MLU590 (Siyuan 590)#

Product Overview#

Cambricon MLU590 (product name Siyuan 590) is Cambricon's third-generation cloud AI training/inference chip, released in 2024, with 2025 mass shipment. It adopts 7nm process + Chiplet packaging, delivering 256 TFLOPS FP16 and 512 TOPS INT8 compute. It is the first Cambricon chip to surpass NVIDIA H20 in energy efficiency ratio (52.3 vs 49.8 TFLOPS/W).

Positioning: Inference + training all-round card, single-card compute is that of MLU370, with only ~50W increase in power consumption. It is a cost-effective choice for domestic large model training/inference.

Core Specifications#

ItemParameter
Architecture3rd-generation MLU architecture (Da Vinci-like)
Process7nm (TSMC, estimated)
PackagingChiplet (chiplet technology)
NPU Core Count128 (or 32 AI large cores, two counting methods)
FP16256 TFLOPS
FP32~64 TFLOPS (estimated, 1/4 of FP16)
INT8512 TOPS
HBM Capacity48 GB (estimated, pending official confirmation)
HBM Bandwidth~400 GB/s (estimated, pending official confirmation)
TDP250 W (typical) / 300 W (max)
InterconnectMLU-Link 3.0 (8-way high-speed interconnect, max 16 chips form supercompute node)
Board Form FactorPCIe Gen5 ×16 / OAM
Mass Production2024 release, 2025 mass shipment
Unit Price (Estimated)~$8,000–10,000

⚠️ Specification Note: HBM capacity and bandwidth are estimated values (not fully disclosed by official sources). Subject to Cambricon's subsequent official data sheet.

Comparison with MLU370#

MetricMLU370MLU590Improvement
Process7nm7nm (Chiplet)Same process, packaging upgrade
FP16128 TFLOPS256 TFLOPS
INT8256 TOPS512 TOPS
TDP~200W250–300W+25–50%
InterconnectMLU-Link 2.0MLU-Link 3.0Bandwidth improved
Energy Efficiency~40 TFLOPS/W52.3 TFLOPS/W+31%

Comparison with Competitors (2024–2025 Domestic)#

MetricMLU590NVIDIA H20Ascend 910CGap
FP16256 TFLOPS~300 TFLOPS~780 TFLOPS-15% vs H20, -67% vs 910C
INT8512 TOPS~600 TOPS~1,600 TOPSDisadvantage
Energy Efficiency52.3 TFLOPS/W49.8 TFLOPS/WNot disclosed+5% vs H20
Software EcosystemCANNCUDACANNEcosystem disadvantge
Price~$8–10K~$20K+~$12KPrice advantage

Energy efficiency breakthrough: MLU590 achieves 52.3 TFLOPS/W in ResNet-50 training, first time surpassing H20's 49.8 TFLOPS/W (Chinese Academy of Sciences Institute of Computing Technology test data).

ItemParameter
ProtocolMLU-Link 3.0 (Cambricon self-developed)
Max Interconnect8-way (direct) / 16 chips (supercompute node)
vs NVLink 5Lower bandwidth, but open standard
Cluster ExpansionSupports PyTorch DistributedDataParallel

CANN Software Stack#

LayerToolDescription
AI FrameworkCANN RuntimePyTorch / TensorFlow compatible
Graph CompilerBangC CompilerSimilar to XLA, automatic operator fusion
Quantization ToolCANN QuantINT8 / FP8 post-training quantization
Communication LibraryCNCLCollective communication (similar to NCCL)
Model LibraryModelZooPre-optimized ResNet / BERT / GPT

Suitable Scenarios#

  • Domestic large model training (below 100B parameters, price-performance advantage)#
  • Inference as a Service (energy efficiency surpasses H20)#
  • Government/SOE AI projects (supply chain security)#
  • Computer vision (ResNet-50 optimzed)#
  • ❌ Trillion-parameter LLM training (compute disadvantge)#
  • ❌ CUDA ecosystem strong dependency (requires migration to CANN)#

Product Evolution#

ProductReleaseFP16 TFLOPSStatus
MLU270202016 TFLOPSEOL
MLU3702022128 TFLOPSCurrent mainstream
MLU5902024256 TFLOPSCurrent flagship
MLU6902025+~512 TFLOPS (estimated)Next generation

Key Features#

  • Chiplet packaging: 7nm + Chiplet, yield and cost optimzed
  • Energy efficiency leadership: 52.3 TFLOPS/W, surpasses H20
  • MLU-Link 3.0: 8-way interconnect, supports medium-scale clusters
  • Inference + training all-round: Single card handles both scenarios
  • Weaknesses: FP16 compute still lower than H20/910C, software ecosystem 5 years vs CUDA 18 years#

References#