Article Not Found

What Happened

A Reddit user on r/LocalLLaMA (u/herpnderpler) posted a reproducible configuration achieving approximately 40 tokens per second on the Qwen3.6-35B-A3B model using a single NVIDIA RTX 3080 with 12GB VRAM. The result relies on llama-cpp-turboquant — a GPU-accelerated fork of llama.cpp authored by TheTom — with a custom turbo3 KV cache quantization type applied to both K and V c aches. The post was shared on r/LocalLLaMA and has drawn attention from the GPU -poor inference community.

Why It Matters

The 35B-A3B model family from Qwen (Alibaba) uses a Mixture-of-Experts architecture with only 3 billion active parameters per forward pass, making it a practical target for consumer GPU inference. However, fitting a 260,000- token context window on 12GB of VRAM is a non-trivial memory engineering problem . The turbo3 KV cache type appears to compress the attention cache ag gressively enough to make this feasible without offloading to CPU RAM — a configuration that would otherwise collapse throughput on a mid-range GPU.

For engineering teams evaluating local LLM deployment — particularly those running work loads on developer workstations or edge servers with consumer-grade GPUs — a validated configuration that unlocks long-context inference at this price point is oper ationally significant. A single RTX 3080 card currently trades on the used market for well under $400.

The Multi-Stage Prompt Architecture

The user also notes that reasoning mode is disabled (--reasoning off), o pting instead for a manual four-stage prompt harness: ask → validate → review → refine/accept. The rationale given is that "time-to-first-acceptable-solution" is lower with explicit pipeline stages than with the model's built-in chain-of-thought loop. This is a relevant workflow note for teams building agentic pipelines where latency per turn matters more than single-pass accuracy.

The Technical Detail

The full reproducible configuration requires compiling llama.cpp's turboquant fork with the following CMake flags:

cmake -B build -D GGML_CUDA=ON -DGGML_CUDA_FA_ALL_QUANTS=ON -DGGML_CUDA_F16=ON -DGGML_CUDA_FORCE_MMQ=ON

The server is then launched with:

--cache-type-k turbo3 and --cache-type- v turbo3 — the core differentiator enabling compressed KV storage
--flash -attn on — standard Flash Attention 2 for memory-efficient attention computation
--ctx -size 0 --fit on — dynamic context sizing that fills available VRAM rather than pre -allocating a fixed buffer
--jinja — Jinja templating for prompt formatting
Model source: unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_M via Hugging Face, a 4-bit K-quant quantization from Unsloth

Sampling parameters follow Qwen3's official non-thinking mode recommendations: temperature 0.6, top-p 0.95, top-k 20, min-p 0.0, repeat-penalty 1.0, presence-penalty 0.0. The GGML _CUDA_FORCE_MMQ flag forces matrix multiplication quantization kernels on CUDA, which typically fav ors throughput over latency on mid-range GPUs.

The turbo3 cache type is not part of upstream llama.cpp as of this writing. It is specific to TheTom's turboquant fork, and the quantization method's intern als are not detailed in the source post. Engineers attempting to replicate this should treat it as an experimental build path, not a stable production dependency .

What To Watch

Upstream merge status: Watch whether turbo3 or equivalent aggressive KV quantization gets proposed for upstream llama.cpp. The ggml project has previously merged community quantization schemes after community validation — a PR or RFC would be a strong signal.
Unsloth GGUF updates: The UD-Q4_K_M variant from Unsloth is the quant ization used here. Any updated quants targeting lower VRAM footprint for the 35B-A3B family could further extend context window feasibility on 8GB cards.
Q wen3 model releases: Alibaba's Qwen team has been releasing model variants on a short cadence. A higher-density MoE release or a revised 35B architecture could shift these benchmarks materially within 30 days.
Community replication: The r/LocalLLaMA thread will likely produce independent throughput measurements across different GPU SKUs (3070 Ti, 4070, RX 7900 G RE). Cross-GPU data would validate whether the gains are CUDA -architecture-specific or broadly applicable.

GPoUr with ~12gb vram and a 3080 getting 40tg/s on qwen3.6 35BA3B w/ 260k ctx

What Happened

Why It Matters

The Multi-Stage Prompt Architecture

The Technical Detail

What To Watch

相关推荐

高盛警告：标普500指数已经约等于半个“AI指数”

DeepSeek V4 Launches: Claims Global Open- Source Leadership

GPT- 5.5 Tops Every Benchmark, Edges Out Opus 4.7 — OpenAI Strikes Back

GP T-5.5 Launches : Is Claude Being Pushed Out of China ?

客户聊天记录太长、 AI 总「断片」？ De epSeek 新版能一口气读完一本书的内容了

同样的AI 对话质量，费用只要四分之一 — 我最近在帮客户省这笔钱