Article Not Found

TurboQuant KV Cache Quantization Beats Baselines on Gemma 4 and Qwen

What Happened

A developer running llama.cpp on an Apple M4 Pro 48GB tested TurboQuant KV cache quantization against standard q4_0 baselines on Gemma 4 26B A4B-it Q4_K_M. Using QJL and Fast Walsh-Hadamard Transform (FWHT) as structured rotation on attention heads with dk=256/512, the tq3j/q4_0 configuration scored 37/37 on quality tests and 8/8 on Needle-in-a-Haystack (NIAH) retrieval. Compression reached approximately 3.1 bits per K channel. The tq2j/q4_0 variant scored 36/37 with one empty response miss. Both configurations delivered a 34% speed increase over q4_0/q4_0 at 131K context, with TurboQuant outperforming q4_0 from 4K context onward.

Separately, per-layer outlier-aware adaptive K quantization on Qwen2.5 and Qwen3 models produced perplexity scores that beat current public fork implementations at comparable bits-per-value: Qwen2.5 1.5B hit 11.514 PPL versus q8_0's 11.524 at 6.21 bpv; Qwen2.5 7B reached 8.927 versus q8_0's 8.949 at 6.41 bpv; Qwen3 8B landed within confidence interval of both f16 and q8_0 at just 5.125 bpv.

Why It Matters

KV cache is the primary memory bottleneck for long-context inference on consumer hardware. Reducing K cache from 8-bit to roughly 3 bits without measurable accuracy loss means indie developers and small teams can run 131K-context sessions on a single 48GB Apple Silicon machine. The per-layer outlier-aware approach also suggests that much of the quality gap in existing quantization schemes comes from uniform allocation rather than the base quantizer itself, opening a practical path to better compression without new hardware.

34% throughput gain at 131K context is directly usable in RAG pipelines and document processing applications.
Qwen3 8B at 5.125 bpv within f16 CI means smaller teams can deploy competitive models on budget GPU instances.
Per-layer variance analysis on Gemma 4 indicates mixed K-type recipes per layer could push results further.

Asia-Pacific Angle

Qwen2.5 and Qwen3 are the dominant open-weight models for Chinese-language tasks and are widely deployed by Southeast Asian teams building multilingual products. These PPL results confirm that per-layer outlier-aware quantization preserves Qwen model quality at significantly lower memory cost, making local on-premise deployment more viable for teams in markets where cloud API costs are high or data residency requirements apply. Developers in China, Vietnam, and Indonesia building document-heavy or long-context applications on Qwen can directly apply this calibration approach in llama.cpp today.

Action Item This Week

Pull the current llama.cpp main branch, run your Qwen2.5 7B or Qwen3 8B model with the existing K quantization options, and benchmark PPL using the wikitext-2 dataset. Record your baseline bpv and PPL score so you have a concrete reference point when TurboQuant per-layer calibration patches land in the public fork or main branch.

TurboQuant KV Cache Quantization Beats Baselines on Gemma 4 and Qwen

What Happened

Why It Matters

Asia-Pacific Angle

Action Item This Week

相关推荐

高盛警告：标普500指数已经约等于半个“AI指数”

DeepSeek V4 Launches: Claims Global Open- Source Leadership

GPT- 5.5 Tops Every Benchmark, Edges Out Opus 4.7 — OpenAI Strikes Back

GP T-5.5 Launches : Is Claude Being Pushed Out of China ?

客户聊天记录太长、 AI 总「断片」？ De epSeek 新版能一口气读完一本书的内容了

同样的AI 对话质量，费用只要四分之一 — 我最近在帮客户省这笔钱