llama.cpp
30 articles tagged with this topic
Qwen3.6 GGUF Benchmarks
Un sloth claims top KLD-vs-disk-space performance for Qwen3.6-35B-A3B quants in 21 of 22 pareto frontier comparisons.
GPoUr with ~12gb vram and a 3080 getting 40tg/s on qwen3.6 35BA3B w/ 260k ctx
A llama.cpp fork with turbo3 KV cache quantization achieves ~40 tok/s on Qwen3-35 B-A3B with only 12GB VRAM.
Gemma 4 and Qwen 3.5 GGUFs: Detailed Analysis by oobabooga
Oobabooga published 5 benchmark reports covering 70-90 GGUF quants each for Gemma 4 and Qwen 3.5 models using KL Divergence methodology.
Gemma 4 Jailbreak System Prompt
A system prompt designed to bypass Gemma 4's safety filters is circulating on Reddit with 112 upvotes.
Local AI is the best
A Reddit post praising local AI tools contains no verifiable news, data, or technical developments.
Qwen3.5-9B GGUF Quant Rankings: Q8_0 Dominates KLD Scores
KLD benchmarks across community GGUF quants show Q8_0 variants cluster near 0.001 KLD, with quality degrading shar ply below Q5.
端侧AI 模型部署实战五(Android大模型加载)
Step-by-step JNI bridge implementation for running quantized LLMs on Android using llama.cpp.
Unsloth Releases Full GGUF Quant Suite for MiniMax M2.7
Unsloth uploads 22 GGUF quantizations of MiniMax M2.7, ranging from 1-bit (60.7 GB) to BF16 (457 GB).
MiniMax-M1 229B MoE Gets First GGUF Quants for Apple Silicon
MiniMax-M2.7 (229B MoE) quantized to Q3_K_L (110GB) and Q8_0 (243GB) GGUF formats, now on HuggingFace.
KV Cache Compression Breakthrough: Structural Rewrite of Local LLM Deployment Costs
llama.cpp achieves 6.8x KV cache compression, cutting 131K context VRAM from 8.2GB to 1.2GB, rewriting local AI hardware procurement logic.
The Rise of Local OCR Models: The Countdown to the End of Bill Recognition Outsourcing
llama.cpp now enables local OCR deployment, letting enterprises bypass cloud APIs and forcing repricing in the annual bill recognition outsourcing mar
Local LLMs Lose Tool Call Accuracy After 8–9 Chained Calls
Qwen 32B, Gemma 9B, and Command R 32B all fail similarly after 8+ tool calls — attention dilution, not context limits.
Qwen 3.5 35B Benchmarks: Vulkan vs ROCm on AMD Strix Halo
Vulkan wins token generation (~57.5 t/s) while ROCm leads prompt processing (~1052 t/s) on AMD Ryzen AI MAX+ 395.
Fixing Gemma 4 Tool Calls in llama.cpp: Root Causes Explained
Four bugs in llama.cpp's Gemma 4 chat template handling caused tool call results to crash or loop.
Controlling Gemma 4 Thinking Tokens via System Prompts
Users struggle to reliably toggle Gemma 4's reasoning mode via system prompts, unlike Qwen-30B-A3B.
Local LLM Setup Guide for RTX 5070 12GB VRAM
Choosing local AI models for chat, writing, and music on a 12GB VRAM RTX 5070 build.
Google Edge Gallery App: First Impressions from LocalLLaMA Community
A LocalLLaMA user shares early impressions of Google's Edge Gallery on-device AI app for Android.
Gemma 4 Local CUDA Setup: Precision Traps and Real Benchmarks
Running Gemma 4 locally on CUDA requires strict dtype matching at KV cache boundaries or output degenerates silently.
Gemma-4 E4B Vision Benchmarked: Scores 0.27 vs Qwen3.5-4B's 0.5
Community testing shows Gemma-4 E4B scores 0.27 on 100 vision tasks vs Qwen3.5-4B's baseline 0.5, raising red flags for multimodal use.
llama.cpp llama-bench Adds -fitc and -fitt Benchmark Flags
llama-bench gains -fitc and -fitt flags from build b4679, enabling finer control over benchmark timing output.
GGML Adds Q1_0 1-Bit Quantization: Run 8B Models at 1.15GB
GGML now supports Q1_0 1-bit quantization, shrinking Bonsai 8B models to 1.15GB for CPU-only inference.
llama.cpp Q8_0 Gets 3.1x Speedup on Intel Arc GPUs via SYCL Fix
A 200-line SYCL patch fixes missing reorder optimization for Q8_0, boosting Arc B70 from 4.88 to 15.24 t/s.
37 LLMs Benchmarked on MacBook Air M5 32GB: Full Speed Results
Community benchmark of 37 local LLMs on M5 Air 32GB using llama-bench reveals MoE models as clear winners for speed-to-quality ratio.
Best Local LLM for Agentic Coding on a Single RTX 4090
A 4090 owner benchmarks GLM-4.7, Nemotron-30B, and Qwen3-Coder for local agentic coding via llama.cpp.
APEX Quantization vs K-Quants: Why MoE Coding Models Need Different Compression
APEX quantization targets MoE architecture coherence layers at Q8, outperforming generic K-quants for multi-file coding agents.
Qwen3.5 vs Gemma4 vs Cloud LLMs: Python Turtle Drawing Benchmark
A Reddit user benchmarks local and cloud LLMs on Python turtle graphics, revealing Gemma4 and Gemini share visual style.
Gemma 4 26B: Q8 mmproj Unlocks 60K+ Context With Vision
Switching from F16 to Q8_0 mmproj on Gemma 4 26B adds ~30K context tokens with no vision quality loss.
HunyuanOCR 1B Runs at 90 t/s on GTX 1060 via GGUF
Tencent's HunyuanOCR 1B model runs at 90 tokens/sec on a GTX 1060 via GGUF, enabling local OCR on budget hardware.
Local AI Goes Mainstream When the Tooling Becomes Boring Infrastructure
A Reddit argument: local LLM adoption hinges on reliable tooling stacks, not benchmark gains, mirroring Docker's container revolution.
LLM Test Prompts That Reveal Real Model Quality for Builders
Community-sourced prompts expose reasoning gaps in local LLMs, helping solo builders pick reliable models for production workflows.