Article Not Found

Two ASUS Spark GPUs Run LLMs Slightly Slower: AI Inference Needs No Expensive HW

A Reddit test shows an ASUS Spark cluster, priced at 1/3 and consuming 1/4 the power, runs LLMs less than 5x slower than a $20,000 RTX 6000 setup—the AI inference cost-efficiency inflection point has arrived.

What this is

A developer applied 4-bit quantization (a technique to compress model size and reduce compute requirements) to MiniMax-M2.7 (an open-source Chinese-English bilingual LLM), running it on two hardware setups: one with two NVIDIA RTX 6000 GPUs (~$20,000, 1450W power draw), and the other with two ASUS Spark (Ascent GX10) GPUs (~$7,000, 365W power draw). The results were surprising: the RTX 6000 was 2.7x faster at prompt processing and 4.88x faster at text generation. However, factoring in the 2.9x price difference and 4x power difference, the Spark cluster's cost-efficiency is highly attractive. Calculated by power consumption per 1 million tokens generated, the two setups are nearly tied. That said, the Spark's 100W idle power draw is on the high side, and when handling high-concurrency requests, its performance drops significantly due to limited KV-cache (the memory mechanism caching context during LLM inference) capacity.

Industry view

We note that this test confirms a judgment: AI inference hardware is accelerating its divergence. For batch processing tasks that don't demand extreme real-time performance, expensive high-end GPUs are no longer the only solution; cheap accelerators are sufficient. But the risks are equally obvious: cheap hardware struggles in real-world business scenarios involving high concurrency and long contexts. In the test, when the Spark cluster processed two long-text requests in parallel, tight KV-cache capacity forced request throttling, causing a performance crash. This means if enterprises choose low-spec hardware to save money, they might suffer degraded user experience due to response latency and throughput bottlenecks, potentially offsetting the hardware cost savings.

Impact on regular people

For enterprise IT: Deploying internal AI tools no longer has Nvidia as the only option. Low-concurrency, non-real-time internal knowledge base scenarios can save substantial budgets using cheaper hardware.

For individual careers: The barrier to running LLMs locally is substantively lowering. A hardware cluster under 10,000 RMB can already run top-tier open-source models, giving individual developers more room for trial and error.

For the consumer market: Future desktop-level AI machines will welcome more chip choices, and power consumption along with thermal control will be prioritized by manufacturers over mere compute stacking.

Two ASUS Spark GPUs Run LLMs Slightly Slower: AI Inference Needs No Expensive HW

What this is

Industry view

Impact on regular people

相关推荐

两张华硕 Spark 显卡跑大模型只慢一点 — AI 推理不再是昂贵硬件的专属

Qwen 3.6 本地替代 Copilot — 零 API 费，但新手别碰

Simon Willison 在手机上写完博客功能 — AI 辅助编程让个人项目开发门槛实质性下移

Warp 开源 AI 终端客户端 — 40 年没变过的黑框终于要被重做

单张 3090 在 Windows 跑通 Qwen3 — 本地部署大模型不再必须折腾 Linux

面壁智能开源多语言语音模型VoxCPM2 — 高质量声音克隆不再是闭源专属