vLLM
8 articles tagged with this topic
Local LLMs Lose Tool Call Accuracy After 8–9 Chained Calls
Qwen 32B, Gemma 9B, and Command R 32B all fail similarly after 8+ tool calls — attention dilution, not context limits.
Running Gemma 4 26B-A4B on vLLM: Community Troubleshooting Notes
Developers report mixed results deploying Gemma 4 26B-A4B on vLLM, with INT4 quants too slow on DGX Spark GB10.
Agent Swarms + Continuous Batching Cut LLM Task Time 36x
Running 50 parallel agents on Qwen 27B drops a 42-minute research job to 70 seconds using continuous batching.
Local AI Goes Mainstream When the Tooling Becomes Boring Infrastructure
A Reddit argument: local LLM adoption hinges on reliable tooling stacks, not benchmark gains, mirroring Docker's container revolution.
Run Gemma 4 26B Locally with vLLM and NVFP4 Quantization
A working bash script runs Gemma 4 26B via vLLM with NVFP4 quantization in Docker on consumer hardware.
Qwen 3.5 Tool Calling Bugs: What's Broken and How to Fix Them
Four confirmed bugs break Qwen 3.5 tool calling in agentic setups. Here's what's fixed, what's still open, and client-side workarounds.
vLLM PagedAttention: From Memory Management to Production Deployment
vLLM's PagedAttention raises GPU memory utilization from 60% to 95%+ using OS paging concepts for LLM inference.
Hermes Agent: Best Open-Source Local LLM Agent Framework in 2025
Nous Research's Hermes Agent offers per-model tool call parsers, Ollama/vLLM support, and MIT license at 22k stars.