Skip to main content

Etched Sohu (Transformer-dedicated ASIC)#

Product Overview#

Etched Sohu is the world's first Transformer-architecture-dedicated ASIC chip, released by US chip startup Etched AI (founded 2022) in June 2024. Sohu fully hard-codes Transformer attention mechanisms into silicon, with no programmable layers, specifically designed for inference (does not support training/fine-tuning). Each chip is equipped with 144GB HBM3E memory, and can achieve 62,500 tokens/sec on Llama 70B at batch size=1, which is 89× that of NVIDIA H100 (at batch size=1).

⚠️ Important Limitation: Sohu only supports Transformer attention architecture, and does not support:

  • Multi-modal models (LLaVA, Qwen-VL, etc. with vision encoders)#
  • Diffusion models (Stable Diffusion, video generation models)#
  • Dynamic expert routing MoE models (DeepSeek V4, Mixtral, Qwen3-235B-A22B)#
  • SSM/Mamba architecture models#
  • Any model with non-dense Transformer attention architecture#

Core Specifications#

ItemParameter
ArchitectureTransformer-dedicated ASIC (non-programmable)
ProcessTSMC 4nm (estimated)
HBM TypeHBM3E
HBM Capacity144 GB
HBM Bandwidth~6.03 TB/s (1.8× that of H100 SXM5)
Compute (Llama 70B)62,500 tokens/sec (batch size=1)
8-chip server performance500,000 tokens/sec (Llama 70B)
TDPNot publicly disclosed
Form FactorPCIe (estimated)
Release DateJune 2024
Market StatusNot publicly sold (as of April 2026)
AvailabilityOnly demonstrated controlled benchmarks to investors

Technical Architecture#

Fixed-Function Transformer Unit#

Sohu has no general-purpose compute units — all compute resources are dedicated to Transformer attention computation:

Hardware-Implemented FunctionDescription
Attention computationDirect hardware implementation, no kernel launch overhead
QKV projectionFixed-function unit
KV cache processingDedicated hardware circuitry
Feed-forward network (FFN)Hard-coded into silicon
No scheduler overheadNo OS, no driver, no kernel scheduling

Architecture Comparison with GPU#

DimensionNVIDIA H100Etched Sohu
Programmability✅ Fully programmable (CUDA)Completely non-programmable
Supported model architecturesAll architecturesOnly Transformer attention
Batch size=1 performance~700 tokens/sec62,500 tokens/sec (89×)
Batch size=32 performance~9,000 tokens/secNot publicly disclosed (advantage shrinks)
EcosystemCUDA, vLLM, TensorRT-LLMProprietary compiler (extremely high migration cost)
Suitable scenariosAll scenariosOnly dense Transformer inference

Performance Details#

Impact of Batch Size on Performance#

Batch SizeH100 PerformanceSohu PerformanceSohu Advantage
1~700 tokens/sec62,500 tokens/sec89×
8~4,000 tokens/secNot publicly disclosedAdvantage shrinks
32~9,000 tokens/secNot publicly disclosedAdvantage further shrinks
>32GPU amortizes overhead via batchingNot publicly disclosedPossibly no advantage

Key Insight: Sohu's advantage is most significant in batch size=1 (real-time interaction) scenarios. In high-concurrency scenarios (batch size > 32), GPUs amortize overhead via batching, and Sohu's advantage may disappear.

8-Chip Server Performance#

  • Llama 70B: 500,000 tokens/sec (8-chip Sohu server)#
  • Comparison: 8× H100 SXM5 server ~64,000 tokens/sec (batch size=32)#
  • Advantage: 7.8× (at high batch size)#

Suitable Scenarios & Limitations#

✅ Suitable Scenarios#

  • Dense Transformer inference: Llama, Qwen, Mistral, and other standard Transformer models#
  • Real-time interactive AI: Batch size=1 latency extremely low (<10ms)#
  • High-concurrency inference service: 8-chip server can reach 500,000 tokens/sec#

❌ Unsuitable Scenarios#

Model TypeExampleSohu Support Status
Multi-modal modelsLLava, Qwen-VL❌ Not supported
Diffusion modelsStable Diffusion, Sora❌ Not supported
MoE modelsDeepSeek V4, Mixtral❌ Not supported (dynamic expert routing)
SSM/MambaMamba, RWKV❌ Not supported
Training/fine-tuningAny training task❌ Not supported (inference only)

Comparison with Competitors#

MetricEtched SohuNVIDIA H200NVIDIA B200Groq LPU
Batch size=1 performance62,500 tokens/sec~800~1,50080,000+
Batch size=32 performanceNot disclosed~10,000~18,000~40,000
Programmability❌ Non-programmable✅ CUDA✅ CUDA❌ Limited programming
Model architecture supportOnly TransformerAll architecturesAll architecturesTransformer + partial
EcosystemProprietary compilerCUDA full ecosystemCUDA full ecosystemLimited ecosystem
Market statusNot yet available✅ Available✅ Available✅ Available

Company Background & Funding#

ItemDetails
Company NameEtched AI
Founding Date2022
HeadquartersCalifornia, USA
Total FundingNearly $1 billion
Latest Funding$500 million (valuation $5 billion)
InvestorsNot publicly disclosed (prominent VCs)
2nd-Gen ProductMore advanced process, targeting inference + prefill-heavy training

2nd-Generation Product Roadmap#

  • More advanced process: Targeting TSMC 3nm or higher#
  • Expanded functionality: Targeting simultaneous support for inference and prefill-heavy training workloads#
  • Smaller, lower-power version: Targeting edge inference scenarios#
  • Release date: 2027-2028 (estimated)#

Launch Date & Availability#

  • Official Announcement: June 2024#
  • Current Status: Not publicly available for sale (as of April 2026)#
  • Availability:
    • Only demonstrated controlled benchmarks to investors#
    • No public cloud rental channels#
    • No public pricing information#
  • Expected availability: H2 2026 or 2027 (estimated)#