Etched Sohu (Transformer-dedicated ASIC)#
Product Overview#
Etched Sohu is the world's first Transformer-architecture-dedicated ASIC chip, released by US chip startup Etched AI (founded 2022) in June 2024. Sohu fully hard-codes Transformer attention mechanisms into silicon, with no programmable layers, specifically designed for inference (does not support training/fine-tuning). Each chip is equipped with 144GB HBM3E memory, and can achieve 62,500 tokens/sec on Llama 70B at batch size=1, which is 89× that of NVIDIA H100 (at batch size=1).
⚠️ Important Limitation: Sohu only supports Transformer attention architecture, and does not support:
- Multi-modal models (LLaVA, Qwen-VL, etc. with vision encoders)#
- Diffusion models (Stable Diffusion, video generation models)#
- Dynamic expert routing MoE models (DeepSeek V4, Mixtral, Qwen3-235B-A22B)#
- SSM/Mamba architecture models#
- Any model with non-dense Transformer attention architecture#
Core Specifications#
| Item | Parameter |
|---|
| Architecture | Transformer-dedicated ASIC (non-programmable) |
| Process | TSMC 4nm (estimated) |
| HBM Type | HBM3E |
| HBM Capacity | 144 GB |
| HBM Bandwidth | ~6.03 TB/s (1.8× that of H100 SXM5) |
| Compute (Llama 70B) | 62,500 tokens/sec (batch size=1) |
| 8-chip server performance | 500,000 tokens/sec (Llama 70B) |
| TDP | Not publicly disclosed |
| Form Factor | PCIe (estimated) |
| Release Date | June 2024 |
| Market Status | Not publicly sold (as of April 2026) |
| Availability | Only demonstrated controlled benchmarks to investors |
Technical Architecture#
Sohu has no general-purpose compute units — all compute resources are dedicated to Transformer attention computation:
| Hardware-Implemented Function | Description |
|---|
| Attention computation | Direct hardware implementation, no kernel launch overhead |
| QKV projection | Fixed-function unit |
| KV cache processing | Dedicated hardware circuitry |
| Feed-forward network (FFN) | Hard-coded into silicon |
| No scheduler overhead | No OS, no driver, no kernel scheduling |
Architecture Comparison with GPU#
| Dimension | NVIDIA H100 | Etched Sohu |
|---|
| Programmability | ✅ Fully programmable (CUDA) | ❌ Completely non-programmable |
| Supported model architectures | All architectures | Only Transformer attention |
| Batch size=1 performance | ~700 tokens/sec | 62,500 tokens/sec (89×) |
| Batch size=32 performance | ~9,000 tokens/sec | Not publicly disclosed (advantage shrinks) |
| Ecosystem | CUDA, vLLM, TensorRT-LLM | Proprietary compiler (extremely high migration cost) |
| Suitable scenarios | All scenarios | Only dense Transformer inference |
| Batch Size | H100 Performance | Sohu Performance | Sohu Advantage |
|---|
| 1 | ~700 tokens/sec | 62,500 tokens/sec | 89× |
| 8 | ~4,000 tokens/sec | Not publicly disclosed | Advantage shrinks |
| 32 | ~9,000 tokens/sec | Not publicly disclosed | Advantage further shrinks |
| >32 | GPU amortizes overhead via batching | Not publicly disclosed | Possibly no advantage |
Key Insight: Sohu's advantage is most significant in batch size=1 (real-time interaction) scenarios. In high-concurrency scenarios (batch size > 32), GPUs amortize overhead via batching, and Sohu's advantage may disappear.
- Llama 70B: 500,000 tokens/sec (8-chip Sohu server)#
- Comparison: 8× H100 SXM5 server ~64,000 tokens/sec (batch size=32)#
- Advantage: 7.8× (at high batch size)#
Suitable Scenarios & Limitations#
✅ Suitable Scenarios#
- Dense Transformer inference: Llama, Qwen, Mistral, and other standard Transformer models#
- Real-time interactive AI: Batch size=1 latency extremely low (<10ms)#
- High-concurrency inference service: 8-chip server can reach 500,000 tokens/sec#
❌ Unsuitable Scenarios#
| Model Type | Example | Sohu Support Status |
|---|
| Multi-modal models | LLava, Qwen-VL | ❌ Not supported |
| Diffusion models | Stable Diffusion, Sora | ❌ Not supported |
| MoE models | DeepSeek V4, Mixtral | ❌ Not supported (dynamic expert routing) |
| SSM/Mamba | Mamba, RWKV | ❌ Not supported |
| Training/fine-tuning | Any training task | ❌ Not supported (inference only) |
Comparison with Competitors#
| Metric | Etched Sohu | NVIDIA H200 | NVIDIA B200 | Groq LPU |
|---|
| Batch size=1 performance | 62,500 tokens/sec | ~800 | ~1,500 | 80,000+ |
| Batch size=32 performance | Not disclosed | ~10,000 | ~18,000 | ~40,000 |
| Programmability | ❌ Non-programmable | ✅ CUDA | ✅ CUDA | ❌ Limited programming |
| Model architecture support | Only Transformer | All architectures | All architectures | Transformer + partial |
| Ecosystem | Proprietary compiler | CUDA full ecosystem | CUDA full ecosystem | Limited ecosystem |
| Market status | Not yet available | ✅ Available | ✅ Available | ✅ Available |
Company Background & Funding#
| Item | Details |
|---|
| Company Name | Etched AI |
| Founding Date | 2022 |
| Headquarters | California, USA |
| Total Funding | Nearly $1 billion |
| Latest Funding | $500 million (valuation $5 billion) |
| Investors | Not publicly disclosed (prominent VCs) |
| 2nd-Gen Product | More advanced process, targeting inference + prefill-heavy training |
2nd-Generation Product Roadmap#
- More advanced process: Targeting TSMC 3nm or higher#
- Expanded functionality: Targeting simultaneous support for inference and prefill-heavy training workloads#
- Smaller, lower-power version: Targeting edge inference scenarios#
- Release date: 2027-2028 (estimated)#
Launch Date & Availability#
- Official Announcement: June 2024#
- Current Status: Not publicly available for sale (as of April 2026)#
- Availability:
- Only demonstrated controlled benchmarks to investors#
- No public cloud rental channels#
- No public pricing information#
- Expected availability: H2 2026 or 2027 (estimated)#
External Links#