Back to home

vLLM

8 articles tagged with this topic

Qwen-32Bllama.cpp

Local LLMs Lose Tool Call Accuracy After 8–9 Chained Calls

Qwen 32B, Gemma 9B, and Command R 32B all fail similarly after 8+ tool calls — attention dilution, not context limits.

Apr 84 min read
Gemma 4vLLM

Running Gemma 4 26B-A4B on vLLM: Community Troubleshooting Notes

Developers report mixed results deploying Gemma 4 26B-A4B on vLLM, with INT4 quants too slow on DGX Spark GB10.

Apr 61 min read
QwenvLLM

Agent Swarms + Continuous Batching Cut LLM Task Time 36x

Running 50 parallel agents on Qwen 27B drops a 42-minute research job to 70 seconds using continuous batching.

Apr 62 min read
llama.cppOllama

Local AI Goes Mainstream When the Tooling Becomes Boring Infrastructure

A Reddit argument: local LLM adoption hinges on reliable tooling stacks, not benchmark gains, mirroring Docker's container revolution.

Apr 62 min read
vLLMGemma4

Run Gemma 4 26B Locally with vLLM and NVFP4 Quantization

A working bash script runs Gemma 4 26B via vLLM with NVFP4 quantization in Docker on consumer hardware.

Apr 62 min read
Qwen 3.5llama.cpp

Qwen 3.5 Tool Calling Bugs: What's Broken and How to Fix Them

Four confirmed bugs break Qwen 3.5 tool calling in agentic setups. Here's what's fixed, what's still open, and client-side workarounds.

Apr 62 min read
vLLMPagedAttention

vLLM PagedAttention: From Memory Management to Production Deployment

vLLM's PagedAttention raises GPU memory utilization from 60% to 95%+ using OS paging concepts for LLM inference.

Apr 52 min read
Hermes AgentNous Research

Hermes Agent: Best Open-Source Local LLM Agent Framework in 2025

Nous Research's Hermes Agent offers per-model tool call parsers, Ollama/vLLM support, and MIT license at 22k stars.

Apr 52 min read