A Reddit test shows an ASUS Spark cluster, priced at 1/3 and consuming 1/4 the power, runs LLMs less than 5x slower than a $20,000 RTX 6000 setup—the AI inference cost-efficiency inflection point has arrived.

What this is

A developer applied 4-bit quantization (a technique to compress model size and reduce compute requirements) to MiniMax-M2.7 (an open-source Chinese-English bilingual LLM), running it on two hardware setups: one with two NVIDIA RTX 6000 GPUs (~$20,000, 1450W power draw), and the other with two ASUS Spark (Ascent GX10) GPUs (~$7,000, 365W power draw). The results were surprising: the RTX 6000 was 2.7x faster at prompt processing and 4.88x faster at text generation. However, factoring in the 2.9x price difference and 4x power difference, the Spark cluster's cost-efficiency is highly attractive. Calculated by power consumption per 1 million tokens generated, the two setups are nearly tied. That said, the Spark's 100W idle power draw is on the high side, and when handling high-concurrency requests, its performance drops significantly due to limited KV-cache (the memory mechanism caching context during LLM inference) capacity.

Industry view

We note that this test confirms a judgment: AI inference hardware is accelerating its divergence. For batch processing tasks that don't demand extreme real-time performance, expensive high-end GPUs are no longer the only solution; cheap accelerators are sufficient. But the risks are equally obvious: cheap hardware struggles in real-world business scenarios involving high concurrency and long contexts. In the test, when the Spark cluster processed two long-text requests in parallel, tight KV-cache capacity forced request throttling, causing a performance crash. This means if enterprises choose low-spec hardware to save money, they might suffer degraded user experience due to response latency and throughput bottlenecks, potentially offsetting the hardware cost savings.

Impact on regular people

For enterprise IT: Deploying internal AI tools no longer has Nvidia as the only option. Low-concurrency, non-real-time internal knowledge base scenarios can save substantial budgets using cheaper hardware.

For individual careers: The barrier to running LLMs locally is substantively lowering. A hardware cluster under 10,000 RMB can already run top-tier open-source models, giving individual developers more room for trial and error.

For the consumer market: Future desktop-level AI machines will welcome more chip choices, and power consumption along with thermal control will be prioritized by manufacturers over mere compute stacking.