Article Not Found

What Happened

Unsloth, the open-source quantization team, published K LD (Kullback-Leibler Divergence) performance benchmarks for Qwen3.6-35B-A3B GGUF quantizations on June 2025, claiming their quants reach the pareto frontier — best quality per disk byte — in 21 out of 22 test cases. The full GGUF model set is available at huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF. The Reddit post, sourced from r/LocalLLaMA, has accumulated 229 upvotes and 57 comments as of publication.

The post doubles as a transparency report addressing community complaints about Unsloth's frequent re-uploads, attributing roughly 95% of root causes to external factors including upstream llama.cpp bugs and official model template changes from vendors .

Why It Matters

For engineers running local inference , quant selection directly determines whether a model fits in V RAM and how much quality degrades at a given size. K LD benchmarking — measuring divergence from the full- precision model's output distribution — is a more principled quality signal than per plexity alone. If Unsloth's pareto dominance holds under independent review , it shifts the default recommendation for Qwen3.6-35B deployments away from competing providers like Bartowski.

More broadly, the post surfaces a systemic problem in the GGUF ecosystem: quantization quality is a moving target tied to llama.cpp stability , CUDA driver versions, and upstream model releases. Teams pinning to specific quants without tracking upstream fixes risk silent quality regressions.

M iniMax M2.7 NaN Contamination

The post discloses an active quality-control incident across multiple providers. According to Unsloth's investigation , 38% of Bartowski's MiniMax M2.7 quants — 10 out of 26 — contain NaN values, including IQ3_XXS, IQ3_XS, IQ3_M, Q3_K_M, Q3_K_L, Q3_K_XL, Q4_K_S, Q4_1, and Q5_K_S at chunk-32 boundaries, plus IQ1_S crashing at chunk 311. Unsloth reports it identified and patched its own affected quants — 5 out of 23 (21%): UD-Q4_K_S, UD -Q4_K_M, UD-Q4_K_XL, UD-Q5_K_S, and MXFP4_MOE. Bartowski has not yet released patches but is described as actively working on a fix. Engineers currently running Bartowski MiniMax M2.7 quants at Q3 or Q4 sizes should treat outputs as potentially corrupted.

The Technical Detail

CUDA 13.2 Gibberish Bug

Unsloth confirms a CUDA 13.2 regression that causes low-bit quantizations across all models to produce gibberish output . This is not model-specific. NVIDIA has acknowledged the issue; a fix is confirmed for CUDA 13.3 by contributor johnnynunez in ll ama.cpp issue #21255. The temporary workaround is to downgrade to CUDA 13.1. Affected issue trackers: Unsloth Issue #4849, llama.cpp issues #21255 and #21371.

Qwen3.5 SSM Layer Findings

Unsloth previously shared 7TB of research artifacts identifying which layers should remain unquantized in Qwen3.5 SSM models — specifically ssm_out and ssm _* tensors. According to the post, most quant providers have since updated their releases based on these findings. Unsloth claims it now also leads on KLD vs. disk space for Qwen3.5.

Gemma 4 Re-upload Attribution

Gemma 4 was re-uploaded four times. Three re-uploads were attributed to approximately 10–20 llama.cpp bug fixes, some co -investigated by Unsloth. The fourth was triggered by an official Google chat template update that required all providers to refresh their releases. Approximately 30 PR s are associated with Gemma 4 fixes in the llama.cpp repository, according to the post.

What To Watch

CUDA 13.3 release: The fix for low-bit quant gibberish is confirmed staged. Watch NVIDIA 's CUDA toolkit release channel; until 13.3 ships , any low-bit GGUF inference on 13.2 is unreliable.
Bartowski MiniMax M2.7 patch: 10 of 26 quants remain NaN-contaminated as of this writing. Expect a patched release within days given the active investigation.
Independent KLD re plication: Unsloth's 21/22 pareto frontier claim for Qwen3.6 will face scrutiny from the LocalLLaMA community. Watch for third-party benchmark posts within the next two weeks.
llama.cpp upstream velocity: With ~ 30 PRs required to stabilize Gemma 4 alone, teams depending on stable GGUF inference should monitor llama.cpp release notes before pinning production quants.

Qwen3.6 GGUF Benchmarks

What Happened

Why It Matters

M iniMax M2.7 NaN Contamination

The Technical Detail

CUDA 13.2 Gibberish Bug

Qwen3.5 SSM Layer Findings

Gemma 4 Re-upload Attribution

What To Watch

相关推荐

高盛警告：标普500指数已经约等于半个“AI指数”

DeepSeek V4 Launches: Claims Global Open- Source Leadership

GPT- 5.5 Tops Every Benchmark, Edges Out Opus 4.7 — OpenAI Strikes Back

GP T-5.5 Launches : Is Claude Being Pushed Out of China ?

客户聊天记录太长、 AI 总「断片」？ De epSeek 新版能一口气读完一本书的内容了

同样的AI 对话质量，费用只要四分之一 — 我最近在帮客户省这笔钱