Phenomenon and Business Essence

An engineer used two RTX PRO 6000 Blackwell GPUs (96GB VRAM each, market price ~35,000 RMB per unit) to achieve 198 tok/s inference speed for the Qwen3.5-122B model on a local server. Three rounds of verification: 197, 200, 198 tok/s, with curl command cross-validation — 2000 tokens generated in 12.7 seconds. Key data: total hardware cost approximately 150,000-200,000 RMB, capable of replicating AI API services costing tens of thousands of yuan monthly. The cost moat protecting the per-call cloud AI rental model is collapsing.

Dimension Analogy: Containers Annihilating Breakbulk Docks

Before McLean's container invention in 1956, breakbulk stevedores monopolized port profits through information asymmetry and operational barriers. After containers emerged, handling costs dropped from $5.83 per ton to $0.16 — breakbulk docks vanished within a decade.

Today's logic is identical: cloud AI providers (Alibaba Cloud, Baidu AI Cloud, Azure) build billing moats through compute black boxes + per-token pricing. This verification proves that combining PCIe topology optimization, SGLang b12x MoE kernel, and NEXTN speculative decoding can boost inference speed by over 65% — the "container cranes" that once had to be rented are becoming standardized equipment anyone can purchase. Core validity of the analogy: technical barriers transform into engineering manuals, pricing power transfers accordingly.

Industry Shakeout and Endgame Projection

Using Andrew Grove's "strategic inflection point" framework:

  • 12-month eliminations: Pure API call resellers "AI integrators" — no proprietary compute, no model tuning capability, just arbitrage margins. Once customers realize self-built costs fall below annual fees, contracts won't renew.
  • 18-24 month pressure zone: Small-to-medium cloud AI API providers. Large customers (annual calls exceeding 500,000 RMB) will migrate first to local deployment, leaving only zero-ops tail customers — ARPU will plummet.
  • Winners: Two types — ①System integrators offering "turnkey" local deployment services (hardware + tuning + ops, one-time fee); ②Vertical scenario fine-tuning model service providers (generic models become commoditized, proprietary data becomes valuable).
  • Endgame: Within 3 years, over 70% of enterprises with daily calls exceeding 100,000 will migrate to local or hybrid architecture. Cloud AI API market transforms from growth market to zero-sum competition.

The Boss's Two Exit Paths

Path One (Defensive): Lock in existing cloud API contracts while launching in-house evaluation in parallel. Immediately request pricing commitment letters for the next 12 months from current providers; simultaneously assign one Linux-savvy engineer to complete local deployment POC (proof of concept) within one month, hardware rental test cost approximately 30,000-50,000 RMB. Make decisions with data.

Path Two (Offensive): Package local inference capability as a service, reverse-selling to industry peers. Procure one dual-GPU server costing 150,000-200,000 RMB, offer private AI inference rental to small and medium enterprises in the same industry — your marginal cost approaches zero while they still pay per token. Step one: quote five industry peers, test market acceptance, no additional investment required.

Community Discussion

"PCI topology (GPUs directly connected via PCIe switch) runs 18% faster than routing through CPU root complex — for MoE models, sync latency matters more than bandwidth, since each forward pass only activates ~10B parameters, making message packets extremely small." — u/Visual-Synthesizer

"397B GGUF model fully loaded into 192GB memory achieving 79 tok/s — that's the real surprise. It means ultra-large parameter models can now run locally without data center-grade hardware." — u/LocalLLaMA community user