Run Gemma 4 26B Locally with vLLM and NVFP4 Quantization

What Happened

A Reddit user shared a working bash script that successfully runs Google's Gemma 4 26B (A4B mixture-of-experts variant) locally using vLLM with NVFP4 quantization. The setup uses the bg-digitalservices/Gemma-4-26B-A4B-it-NVFP4 model from Hugging Face, patches the vLLM gemma4.py model file via Docker, and serves the model as an OpenAI-compatible API endpoint. GPU memory utilization is set to 88%, max context is 512 tokens, and it runs one sequence at a time — indicating this works on a single high-end consumer GPU.

Solo Founder Angle

For a one-person company, this setup provides a private, zero-API-cost inference endpoint for a capable 26B MoE model. Concrete workflow:

Use the provided bash script as-is — it handles Docker image patching, directory setup, and container launch automatically.
Point your existing OpenAI SDK calls to localhost:[PORT] with model name gemma-4-26b-a4b-it-nvfp4 — no code changes needed.
Pair with tools like LangChain, LlamaIndex, or Open WebUI for a full local AI stack.
Store the HF_TOKEN as an environment variable in your shell profile — the script validates it on startup.

The 512-token context limit is a real constraint for long documents, but works fine for classification, short-form generation, and API prototyping.

Why It Matters for Indie Builders

NVFP4 quantization reduces VRAM requirements significantly compared to FP16, making a 26B model accessible on single-GPU workstations. Running locally means no per-token costs, no data leaving your machine, and no rate limits — critical for batch processing tasks or client work with sensitive data. The OpenAI-compatible endpoint means you can swap between local and cloud inference by changing one URL, giving you a cost-control lever without refactoring your codebase.

Action Item This Week

If you have an NVIDIA GPU with 16GB+ VRAM, clone the script, export your HF_TOKEN, and run it. Test latency against your current API provider on a 10-prompt benchmark using a task you actually run in production — then calculate your monthly savings at your real usage volume.

Run Gemma 4 26B Locally with vLLM and NVFP4 Quantization

What Happened

Solo Founder Angle

Why It Matters for Indie Builders

Action Item This Week

Related Reading

Full Head , Blank Page : How I Pulled 100 Content Ideas in One Session

That 'Free Tool ' in Your Browser May Be Stealing Client Passwords

Sent the Quote , Heard Nothing ? Here 's What Fixed It

Wrong Note App W rec ked My Client Files — I Learned the Hard Way

Your AI Account : Are You the Only One Using It?

Your Files , Your Server : Esc aping Big - Platform Lock -In