What Happened

A Reddit user shared a working bash script that successfully runs Google's Gemma 4 26B (A4B mixture-of-experts variant) locally using vLLM with NVFP4 quantization. The setup uses the bg-digitalservices/Gemma-4-26B-A4B-it-NVFP4 model from Hugging Face, patches the vLLM gemma4.py model file via Docker, and serves the model as an OpenAI-compatible API endpoint. GPU memory utilization is set to 88%, max context is 512 tokens, and it runs one sequence at a time — indicating this works on a single high-end consumer GPU.

Solo Founder Angle

For a one-person company, this setup provides a private, zero-API-cost inference endpoint for a capable 26B MoE model. Concrete workflow:

  • Use the provided bash script as-is — it handles Docker image patching, directory setup, and container launch automatically.
  • Point your existing OpenAI SDK calls to localhost:[PORT] with model name gemma-4-26b-a4b-it-nvfp4 — no code changes needed.
  • Pair with tools like LangChain, LlamaIndex, or Open WebUI for a full local AI stack.
  • Store the HF_TOKEN as an environment variable in your shell profile — the script validates it on startup.

The 512-token context limit is a real constraint for long documents, but works fine for classification, short-form generation, and API prototyping.

Why It Matters for Indie Builders

NVFP4 quantization reduces VRAM requirements significantly compared to FP16, making a 26B model accessible on single-GPU workstations. Running locally means no per-token costs, no data leaving your machine, and no rate limits — critical for batch processing tasks or client work with sensitive data. The OpenAI-compatible endpoint means you can swap between local and cloud inference by changing one URL, giving you a cost-control lever without refactoring your codebase.

Action Item This Week

If you have an NVIDIA GPU with 16GB+ VRAM, clone the script, export your HF_TOKEN, and run it. Test latency against your current API provider on a 10-prompt benchmark using a task you actually run in production — then calculate your monthly savings at your real usage volume.