llama.cpp

30 articles tagged with this topic

UnslothQwen3.6

Qwen3.6 GGUF Benchmarks

Un sloth claims top KLD-vs-disk-space performance for Qwen3.6-35B-A3B quants in 21 of 22 pareto frontier comparisons.

Apr 173 min read

llama.cppQwen3

GPoUr with ~12gb vram and a 3080 getting 40tg/s on qwen3.6 35BA3B w/ 260k ctx

A llama.cpp fork with turbo3 KV cache quantization achieves ~40 tok/s on Qwen3-35 B-A3B with only 12GB VRAM.

Apr 163 min read

Gemma- 4Qwen3.5

Gemma 4 and Qwen 3.5 GGUFs: Detailed Analysis by oobabooga

Oobabooga published 5 benchmark reports covering 70-90 GGUF quants each for Gemma 4 and Qwen 3.5 models using KL Divergence methodology.

Apr 153 min read

Gemma-4Google-De epMind

Gemma 4 Jailbreak System Prompt

A system prompt designed to bypass Gemma 4's safety filters is circulating on Reddit with 112 upvotes.

Apr 153 min read

LocalLLaMAllama.cpp

Local AI is the best

A Reddit post praising local AI tools contains no verifiable news, data, or technical developments.

Apr 151 min read

Qwen3.5GGUF

Qwen3.5-9B GGUF Quant Rankings: Q8_0 Dominates KLD Scores

KLD benchmarks across community GGUF quants show Q8_0 variants cluster near 0.001 KLD, with quality degrading shar ply below Q5.

Apr 143 min read

llama.cppAndroid

端侧AI 模型部署实战五(Android大模型加载)

Step-by-step JNI bridge implementation for running quantized LLMs on Android using llama.cpp.

Apr 143 min read

UnslothMiniMax-M2.7

Unsloth Releases Full GGUF Quant Suite for MiniMax M2.7

Unsloth uploads 22 GGUF quantizations of MiniMax M2.7, ranging from 1-bit (60.7 GB) to BF16 (457 GB).

Apr 123 min read

MiniMax-M2.7llama.cpp

MiniMax-M1 229B MoE Gets First GGUF Quants for Apple Silicon

MiniMax-M2.7 (229B MoE) quantized to Q3_K_L (110GB) and Q8_0 (243GB) GGUF formats, now on HuggingFace.

Apr 123 min read

local-deploymentvram-optimization

KV Cache Compression Breakthrough: Structural Rewrite of Local LLM Deployment Costs

llama.cpp achieves 6.8x KV cache compression, cutting 131K context VRAM from 8.2GB to 1.2GB, rewriting local AI hardware procurement logic.

Apr 112 min read

OCRLocal Deployment

The Rise of Local OCR Models: The Countdown to the End of Bill Recognition Outsourcing

llama.cpp now enables local OCR deployment, letting enterprises bypass cloud APIs and forcing repricing in the annual bill recognition outsourcing mar

Apr 102 min read

Qwen-32Bllama.cpp

Local LLMs Lose Tool Call Accuracy After 8–9 Chained Calls

Qwen 32B, Gemma 9B, and Command R 32B all fail similarly after 8+ tool calls — attention dilution, not context limits.

Apr 84 min read

Qwen3.5LocalAI

Qwen 3.5 35B Benchmarks: Vulkan vs ROCm on AMD Strix Halo

Vulkan wins token generation (~57.5 t/s) while ROCm leads prompt processing (~1052 t/s) on AMD Ryzen AI MAX+ 395.

Apr 83 min read

Gemma 4llama.cpp

Fixing Gemma 4 Tool Calls in llama.cpp: Root Causes Explained

Four bugs in llama.cpp's Gemma 4 chat template handling caused tool call results to crash or loop.

Apr 83 min read

Gemma 4Qwen3

Controlling Gemma 4 Thinking Tokens via System Prompts

Users struggle to reliably toggle Gemma 4's reasoning mode via system prompts, unlike Qwen-30B-A3B.

Apr 83 min read

Ollamallama.cpp

Local LLM Setup Guide for RTX 5070 12GB VRAM

Choosing local AI models for chat, writing, and music on a 12GB VRAM RTX 5070 build.

Apr 83 min read

Google Edge Galleryon-device LLM

Google Edge Gallery App: First Impressions from LocalLLaMA Community

A LocalLLaMA user shares early impressions of Google's Edge Gallery on-device AI app for Android.

Apr 71 min read

Gemma 4llama.cpp

Gemma 4 Local CUDA Setup: Precision Traps and Real Benchmarks

Running Gemma 4 locally on CUDA requires strict dtype matching at KV cache boundaries or output degenerates silently.

Apr 72 min read

Gemma-4Qwen3.5

Gemma-4 E4B Vision Benchmarked: Scores 0.27 vs Qwen3.5-4B's 0.5

Community testing shows Gemma-4 E4B scores 0.27 on 100 vision tasks vs Qwen3.5-4B's baseline 0.5, raising red flags for multimodal use.

Apr 72 min read

llama.cppllama-bench

llama.cpp llama-bench Adds -fitc and -fitt Benchmark Flags

llama-bench gains -fitc and -fitt flags from build b4679, enabling finer control over benchmark timing output.

Apr 61 min read

ggmlllama.cpp

GGML Adds Q1_0 1-Bit Quantization: Run 8B Models at 1.15GB

GGML now supports Q1_0 1-bit quantization, shrinking Bonsai 8B models to 1.15GB for CPU-only inference.

Apr 62 min read

llama.cppIntel Arc

llama.cpp Q8_0 Gets 3.1x Speedup on Intel Arc GPUs via SYCL Fix

A 200-line SYCL patch fixes missing reorder optimization for Q8_0, boosting Arc B70 from 4.88 to 15.24 t/s.

Apr 62 min read

llama.cppQwen

37 LLMs Benchmarked on MacBook Air M5 32GB: Full Speed Results

Community benchmark of 37 local LLMs on M5 Air 32GB using llama-bench reveals MoE models as clear winners for speed-to-quality ratio.

Apr 62 min read

llama.cppGLM-4.7

Best Local LLM for Agentic Coding on a Single RTX 4090

A 4090 owner benchmarks GLM-4.7, Nemotron-30B, and Qwen3-Coder for local agentic coding via llama.cpp.

Apr 61 min read

llama.cppQwen Coder

APEX Quantization vs K-Quants: Why MoE Coding Models Need Different Compression

APEX quantization targets MoE architecture coherence layers at Q8, outperforming generic K-quants for multi-file coding agents.

Apr 62 min read

Qwen3.5Gemma4

Qwen3.5 vs Gemma4 vs Cloud LLMs: Python Turtle Drawing Benchmark

A Reddit user benchmarks local and cloud LLMs on Python turtle graphics, revealing Gemma4 and Gemini share visual style.

Apr 62 min read

llama.cppGemma 4

Gemma 4 26B: Q8 mmproj Unlocks 60K+ Context With Vision

Switching from F16 to Q8_0 mmproj on Gemma 4 26B adds ~30K context tokens with no vision quality loss.

Apr 62 min read

HunyuanOCRGGUF

HunyuanOCR 1B Runs at 90 t/s on GTX 1060 via GGUF

Tencent's HunyuanOCR 1B model runs at 90 tokens/sec on a GTX 1060 via GGUF, enabling local OCR on budget hardware.

Apr 61 min read

llama.cppOllama

Local AI Goes Mainstream When the Tooling Becomes Boring Infrastructure

A Reddit argument: local LLM adoption hinges on reliable tooling stacks, not benchmark gains, mirroring Docker's container revolution.

Apr 62 min read

local-llmollama

LLM Test Prompts That Reveal Real Model Quality for Builders

Community-sourced prompts expose reasoning gaps in local LLMs, helping solo builders pick reliable models for production workflows.

Apr 62 min read