A developer dropped a number this week: consuming 89 million tokens daily cost only 4.39 RMB. Behind this is not a simple price war, but DeepSeek using disk caching to reshape the pricing logic of LLM inference.
What this is
When LLMs generate text, every token requires referencing the preceding context. KV Cache (Key-Value Cache: storing historical computation results for reuse to avoid redundant calculations) is a common optimization, but traditional solutions can only store this in expensive GPU VRAM, making it impossible to share across different users.
DeepSeek's move: using architectural innovation to compress cache volume by 5-13x, making it small enough to store on cheap disks. As long as the prefixes of two requests (like system prompts) match, the cache can be read directly from disk, skipping redundant computation. On cache hits, API prices drop by 90%, and the time-to-first-token for 128K long-context texts shrinks from 13 seconds to 500 milliseconds.
Industry view
We note that this marks LLM competition shifting from training compute to inference engineering. DeepSeek's solution of trading disk for VRAM establishes a new cost baseline for the industry. However, we must point out that this mechanism relies heavily on "prefix consistency": even changing a single word in the system prompt invalidates all subsequent caches.
Some development teams point out that this fragility makes enterprise bills uncontrollable—once a business requires frequent prompt tweaks, a plummeting hit rate will cause costs to rebound instantly. Furthermore, disk cache read/write latency under high concurrency is less stable than VRAM, remaining a hidden risk for enterprise applications promising strict response times.
Impact on regular people
For enterprise IT: API costs for integrating LLMs drop significantly, but interface specifications must be restructured, forcing developers to fix system prompts and historical dialogue structures to maintain hit rates.
For individual careers: The skill requirements for writing prompts are upgrading. Not only must they be written correctly, but they must also be "positionally stable." Understanding how to maintain prefix consistency will become a new fundamental skill in AI application operations.
For the consumer market: The sudden drop in inference costs means consumer applications relying on ultra-long contexts (like long-term memory companions, long-document analysis) finally have a sustainable business model that doesn't burn cash.