vLLM

8 articles tagged with this topic

Local LLMs Lose Tool Call Accuracy After 8–9 Chained Calls

Qwen 32B, Gemma 9B, and Command R 32B all fail similarly after 8+ tool calls — attention dilution, not context limits.

Developers report mixed results deploying Gemma 4 26B-A4B on vLLM, with INT4 quants too slow on DGX Spark GB10.

Running 50 parallel agents on Qwen 27B drops a 42-minute research job to 70 seconds using continuous batching.

A Reddit argument: local LLM adoption hinges on reliable tooling stacks, not benchmark gains, mirroring Docker's container revolution.

A working bash script runs Gemma 4 26B via vLLM with NVFP4 quantization in Docker on consumer hardware.

Four confirmed bugs break Qwen 3.5 tool calling in agentic setups. Here's what's fixed, what's still open, and client-side workarounds.

vLLM's PagedAttention raises GPU memory utilization from 60% to 95%+ using OS paging concepts for LLM inference.

Nous Research's Hermes Agent offers per-model tool call parsers, Ollama/vLLM support, and MIT license at 22k stars.