What Happened
Oobabooga, the developer behind the widely-used text-generation- webui, published five detailed benchmark reports evaluating GGUF quantization performance across four models: Gemma 4 26 B-A4B, Gemma 4 E4B, Qwen3.5-35B-A3B, and Qwen3.5-27B. Reports are hosted on the LocalBench Substack, with the 31B analysis available for free and the remaining reports behind a paywall, according to the Reddit post by u/Plenty_Extent_9047 on r/LocalLLaMA (61 upvotes as of publication).
Each report covers roughly 70 to 90 individual GGUF quant izations, sourced from providers including Unsloth, Bartowski, LM Studio, GGML, Mradermacher, AesSedai, and Ubergarm.
Why It Matters
Local inference practitioners routinely face a poorly -documented tradeoff: which quantization level preserves model quality at a given VRAM budget? Most public benchmarks use WikiText perplexity, which correlates weakly with real-world chat performance. Oobabooga's methodology targets that gap directly.
With 70-90 quants evaluated per model , this represents one of the most exhaustive public comparisons of GGUF quantization fidelity available for current-generation MoE and dense models. For teams running Gemma 4 or Qwen 3.5 at the edge or on consumer hardware, the reports provide actionable quant selection data that would otherwise require significant internal testing infrastructure to replicate.
The paywall structure also signals a broader trend: high-compute evaluation work — the kind that generates trustworthy benchmarks — is increasingly difficult to sustain as a free community resource. The post author notes that "running these benchmarks takes a lot of time and money," and suggests oobabooga may periodically release paid reports for free.
The Technical Detail
The benchmark methodology uses KL Divergence rather than perplexity on WikiText. KL Divergence measures how much the probability distribution of a quantized model diverges from the full -precision reference model, providing a more direct signal of degradation across diverse prompt types.
The evaluation dataset spans approximately 250,000 tokens across six categories:
- Coding
- General chat
- Tool calling
- Science
- Non-Latin scripts
- Long documents
This multi-domain approach matters for MoE architectures like Gemma 4 26B-A4B ( active 4B parameters) and Qwen3.5-35B-A3B (active 3B parameters), where expert routing behavior can degrade unevenly across quantization levels and task types. A coding-focused quant might preserve perplexity on WikiText while showing measurable KL divergence on tool- calling prompts — exactly the failure mode this methodology is designed to surface.
The models under evaluation:
- Gemma 4 26B-A4B: Google 's MoE model with 26B total / 4B active parameters
- Gemma 4 E4B: A dense 4B variant in the Gemma 4 family
- Qwen3.5-35B-A 3B: Alibaba's MoE with 35B total / 3B active parameters
- Qwen3.5-27B: Dense 27B model from Alibaba's Qwen3.5 series
Quant providers covered include the major community distributors. Notably, Ubergarm special izes in high-quality IQ-series quants that often out perform same-bit standard GGUF quants — making cross-provider comparison at equivalent bit depths particularly relevant.
What To Watch
Over the next 30 days, several developments are worth tracking:
- Free report releases : The post author indicates oobabooga may periodically unlock paid reports. The Gem ma 4 26B-A4B and Qwen3.5-35B-A3B reports are the most operationally relevant for teams running M oE models on consumer or prosumer hardware — watch the LocalBench Substack for access .
- Quant provider responses: Bartowski and Unsloth both iterate quickly on quantization approaches. Public KL Divergence data at this scale typically prompts updated quant releases from these providers within weeks.
- Methodology adoption: KL Divergence- based GGUF evaluation is not yet standard. If oobabooga's reports gain traction, expect llama.cpp maintainers and other benchmark projects to incorporate similar multi-domain evaluation datasets.
- Qwen3.5 coverage expansion: Alib aba continues to release additional Qwen3.5 variants. Additional benchmark reports covering larger or smaller model sizes are plausible if community support for the Substack grows.