What Happened

A developer running an Intel Xeon E5-2680 v4 server (56 cores, ~63GB RAM, GTX 1060 5GB GPU) attempted local inference of Google's newly released Gemma 4 26B model via Ollama, documenting the setup process and performance results on Juejin ( 掘金) this week. The test was motivated by a desire to eliminate recurring Claude and ChatGPT subscription costs by self-hosting a capable open model.

Google has positioned Gemma 4 as "the most powerful open model at equivalent parameter counts," built on the same research architecture as Gemini 3. The model family spans four sizes: E2B (effective 2B, ~3GB quantized), E4B (effective 4B), 26B MoE (Mixture of Experts, activating ~4B parameters per inference pass), and 31B dense flagship.

The author used llmfit — a hardware compatibility tool installed via curl -fsSL https://llmfit.axjns.dev/install.sh | sh — to determine which Gemma 4 variant the server could run. The tool recommended either the 26B or 31B variant. The 26B M oE was selected as the best balance of capability and speed.

Why It Matters

This test surfaces a practical constraint that affects anyone evaluating commodity or legacy hardware for local LLM deployment : raw compute and RAM capacity are not the binding constraints. Memory bandwidth is.

The Xeon E5-2680 v4, in its ideal four-channel configuration, delivers a theoretical memory bandwidth of approximately 76.8 GB/s according to the author's analysis. For a 26B model at 4-bit quantization, model weights occupy roughly 16–18GB. Every token generated requires moving those weights through the CPU-memory bus. The result: CPU utilization hit 100% while output speed remained visibly slow — confirmed via ps monitoring during inference.

The author's framing is precise: "It 's like an eight-lane highway where only one toll booth is open." More quantization reduces the memory footprint but degrades model quality, undermining the reason for running a large model in the first place.

For engineering teams evaluating on-premise or edge deployments of models in the 20 B–30B range, this test is a concrete data point: a 2016-era workstation- class CPU with ample RAM still cannot deliver acceptable inference through put for models at this scale without a modern GPU providing high-bandwidth VRAM access.

The Technical Detail

Gemma 4 Model Variants

  • E2B: Targets smartphones and IoT; ~3GB quantized footprint
  • E4B: Mobile/edge devices with stronger reasoning; offline -capable
  • 26B MoE: 26B total parameters, ~4B activated per forward pass; designed for consumer GPUs with low latency
  • 31B Dense: Flagship variant; bench marks cited by Google show performance exceeding models "dozens of times larger" on math and coding tasks

Installation Path

Model download via Ollama: ollama run gemma4:26b. Download size approximately 17GB. Inference backend used was Vulkan (CPU -side), as the GTX 1060's 4GB VRAM (reported as 4.00GB available) was insufficient to hold the model weights .

Bottleneck Analysis

The GTX 1060 5GB card — with only 4GB VRAM available — could not offload the 26B model, forcing full CPU inference. The E5-2680 v4's theoretical 76.8 GB/s bandwidth becomes the hard ceiling on tokens-per-second output. No specific tokens -per-second figure was reported; the author described the output as vis ibly slow in an unaccelerated video demonstration.

What To Watch

  • GPU follow-up: The author plans to install an Nvidia RTX 3090 (24GB VRAM) in the same server chassis. A 3090 provides roughly 936 GB/s memory bandwidth — over 12x the Xeon's theoretical ceiling — which should allow the 26B MoE model to fit entirely in VRAM and deliver substantially faster inference. Results expected in a follow-up post .
  • Gemma 4 adoption benchmarks: Google's "strongest open model at equivalent parameter counts" claim for Gemma 4 will face independent verification from the open-source benchmarking community (LM Sys, El euther AI Harness) in the coming weeks. Watch for MMLU, HumanEval, and MATH scores from third parties .
  • Ollama compatibility updates: As Gemma 4's MoE architecture is relatively new to local runtimes, watch for Ollama version updates that may improve layer-splitting or partial GPU offload for mixed CPU/GPU configurations — which could partially rehabilitate set ups like this one.
  • Competitive pressure on subscriptions: If Gemma 4 26 B MoE delivers Claude-comparable output on a single RTX 3090, the economics of self-hosting versus $20/month SaaS subsc riptions shift meaningfully for individual developers and small teams.