GGUF

11 articles tagged with this topic

Qwen3.6 GGUF Benchmarks

Un sloth claims top KLD-vs-disk-space performance for Qwen3.6-35B-A3B quants in 21 of 22 pareto frontier comparisons.

Oobabooga published 5 benchmark reports covering 70-90 GGUF quants each for Gemma 4 and Qwen 3.5 models using KL Divergence methodology.

KLD benchmarks across community GGUF quants show Q8_0 variants cluster near 0.001 KLD, with quality degrading shar ply below Q5.

Step-by-step JNI bridge implementation for running quantized LLMs on Android using llama.cpp.

Unsloth uploads 22 GGUF quantizations of MiniMax M2.7, ranging from 1-bit (60.7 GB) to BF16 (457 GB).

MiniMax-M2.7 (229B MoE) quantized to Q3_K_L (110GB) and Q8_0 (243GB) GGUF formats, now on HuggingFace.

Running Gemma 4 locally on CUDA requires strict dtype matching at KV cache boundaries or output degenerates silently.

Switching from F16 to Q8_0 mmproj on Gemma 4 26B adds ~30K context tokens with no vision quality loss.

Tencent's HunyuanOCR 1B model runs at 90 tokens/sec on a GTX 1060 via GGUF, enabling local OCR on budget hardware.

A community GGUF quantization shrinks Qwen3-Coder 80B to 54.1GB, making fast local coding inference practical.

Community researcher releases Harmonic-9B, a staged fine-tune of Qwen3-9B targeting reliable tool-calling and structured reasoning.