What Happened
A Reddit user on r/LocalLLaMA (u/herpnderpler) posted a reproducible configuration achieving approximately 40 tokens per second on the Qwen3.6-35B-A3B model using a single NVIDIA RTX 3080 with 12GB VRAM. The result relies on llama-cpp-turboquant — a GPU-accelerated fork of llama.cpp authored by TheTom — with a custom turbo3 KV cache quantization type applied to both K and V c aches. The post was shared on r/LocalLLaMA and has drawn attention from the GPU -poor inference community.
Why It Matters
The 35B-A3B model family from Qwen (Alibaba) uses a Mixture-of-Experts architecture with only 3 billion active parameters per forward pass, making it a practical target for consumer GPU inference. However, fitting a 260,000- token context window on 12GB of VRAM is a non-trivial memory engineering problem . The turbo3 KV cache type appears to compress the attention cache ag gressively enough to make this feasible without offloading to CPU RAM — a configuration that would otherwise collapse throughput on a mid-range GPU.
For engineering teams evaluating local LLM deployment — particularly those running work loads on developer workstations or edge servers with consumer-grade GPUs — a validated configuration that unlocks long-context inference at this price point is oper ationally significant. A single RTX 3080 card currently trades on the used market for well under $400.
The Multi-Stage Prompt Architecture
The user also notes that reasoning mode is disabled (--reasoning off), o pting instead for a manual four-stage prompt harness: ask → validate → review → refine/accept. The rationale given is that "time-to-first-acceptable-solution" is lower with explicit pipeline stages than with the model's built-in chain-of-thought loop. This is a relevant workflow note for teams building agentic pipelines where latency per turn matters more than single-pass accuracy.
The Technical Detail
The full reproducible configuration requires compiling llama.cpp's turboquant fork with the following CMake flags:
cmake -B build -D GGML_CUDA=ON -DGGML_CUDA_FA_ALL_QUANTS=ON -DGGML_CUDA_F16=ON -DGGML_CUDA_FORCE_MMQ=ONThe server is then launched with:
--cache-type-k turbo3and--cache-type- v turbo3— the core differentiator enabling compressed KV storage--flash -attn on— standard Flash Attention 2 for memory-efficient attention computation--ctx -size 0 --fit on— dynamic context sizing that fills available VRAM rather than pre -allocating a fixed buffer--jinja— Jinja templating for prompt formatting- Model source:
unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_Mvia Hugging Face, a 4-bit K-quant quantization from Unsloth
Sampling parameters follow Qwen3's official non-thinking mode recommendations: temperature 0.6, top-p 0.95, top-k 20, min-p 0.0, repeat-penalty 1.0, presence-penalty 0.0. The GGML _CUDA_FORCE_MMQ flag forces matrix multiplication quantization kernels on CUDA, which typically fav ors throughput over latency on mid-range GPUs.
The turbo3 cache type is not part of upstream llama.cpp as of this writing. It is specific to TheTom's turboquant fork, and the quantization method's intern als are not detailed in the source post. Engineers attempting to replicate this should treat it as an experimental build path, not a stable production dependency .
What To Watch
- Upstream merge status: Watch whether turbo3 or equivalent aggressive KV quantization gets proposed for upstream llama.cpp. The ggml project has previously merged community quantization schemes after community validation — a PR or RFC would be a strong signal.
- Unsloth GGUF updates: The UD-Q4_K_M variant from Unsloth is the quant ization used here. Any updated quants targeting lower VRAM footprint for the 35B-A3B family could further extend context window feasibility on 8GB cards.
- Q wen3 model releases: Alibaba's Qwen team has been releasing model variants on a short cadence. A higher-density MoE release or a revised 35B architecture could shift these benchmarks materially within 30 days.
- Community replication: The r/LocalLLaMA thread will likely produce independent throughput measurements across different GPU SKUs (3070 Ti, 4070, RX 7900 G RE). Cross-GPU data would validate whether the gains are CUDA -architecture-specific or broadly applicable.