Article Not Found

What Happened

A developer running an Intel Xeon E5-2680 v4 server (56 cores, ~63GB RAM, GTX 1060 5GB GPU) attempted local inference of Google's newly released Gemma 4 26B model via Ollama, documenting the setup process and performance results on Juejin ( 掘金) this week. The test was motivated by a desire to eliminate recurring Claude and ChatGPT subscription costs by self-hosting a capable open model.

Google has positioned Gemma 4 as "the most powerful open model at equivalent parameter counts," built on the same research architecture as Gemini 3. The model family spans four sizes: E2B (effective 2B, ~3GB quantized), E4B (effective 4B), 26B MoE (Mixture of Experts, activating ~4B parameters per inference pass), and 31B dense flagship.

The author used llmfit — a hardware compatibility tool installed via curl -fsSL https://llmfit.axjns.dev/install.sh | sh — to determine which Gemma 4 variant the server could run. The tool recommended either the 26B or 31B variant. The 26B M oE was selected as the best balance of capability and speed.

Why It Matters

This test surfaces a practical constraint that affects anyone evaluating commodity or legacy hardware for local LLM deployment : raw compute and RAM capacity are not the binding constraints. Memory bandwidth is.

The Xeon E5-2680 v4, in its ideal four-channel configuration, delivers a theoretical memory bandwidth of approximately 76.8 GB/s according to the author's analysis. For a 26B model at 4-bit quantization, model weights occupy roughly 16–18GB. Every token generated requires moving those weights through the CPU-memory bus. The result: CPU utilization hit 100% while output speed remained visibly slow — confirmed via ps monitoring during inference.

The author's framing is precise: "It 's like an eight-lane highway where only one toll booth is open." More quantization reduces the memory footprint but degrades model quality, undermining the reason for running a large model in the first place.

For engineering teams evaluating on-premise or edge deployments of models in the 20 B–30B range, this test is a concrete data point: a 2016-era workstation- class CPU with ample RAM still cannot deliver acceptable inference through put for models at this scale without a modern GPU providing high-bandwidth VRAM access.

The Technical Detail

Gemma 4 Model Variants

E2B: Targets smartphones and IoT; ~3GB quantized footprint
E4B: Mobile/edge devices with stronger reasoning; offline -capable
26B MoE: 26B total parameters, ~4B activated per forward pass; designed for consumer GPUs with low latency
31B Dense: Flagship variant; bench marks cited by Google show performance exceeding models "dozens of times larger" on math and coding tasks

Installation Path

Model download via Ollama: ollama run gemma4:26b. Download size approximately 17GB. Inference backend used was Vulkan (CPU -side), as the GTX 1060's 4GB VRAM (reported as 4.00GB available) was insufficient to hold the model weights .

Bottleneck Analysis

The GTX 1060 5GB card — with only 4GB VRAM available — could not offload the 26B model, forcing full CPU inference. The E5-2680 v4's theoretical 76.8 GB/s bandwidth becomes the hard ceiling on tokens-per-second output. No specific tokens -per-second figure was reported; the author described the output as vis ibly slow in an unaccelerated video demonstration.

What To Watch

GPU follow-up: The author plans to install an Nvidia RTX 3090 (24GB VRAM) in the same server chassis. A 3090 provides roughly 936 GB/s memory bandwidth — over 12x the Xeon's theoretical ceiling — which should allow the 26B MoE model to fit entirely in VRAM and deliver substantially faster inference. Results expected in a follow-up post .
Gemma 4 adoption benchmarks: Google's "strongest open model at equivalent parameter counts" claim for Gemma 4 will face independent verification from the open-source benchmarking community (LM Sys, El euther AI Harness) in the coming weeks. Watch for MMLU, HumanEval, and MATH scores from third parties .
Ollama compatibility updates: As Gemma 4's MoE architecture is relatively new to local runtimes, watch for Ollama version updates that may improve layer-splitting or partial GPU offload for mixed CPU/GPU configurations — which could partially rehabilitate set ups like this one.
Competitive pressure on subscriptions: If Gemma 4 26 B MoE delivers Claude-comparable output on a single RTX 3090, the economics of self-hosting versus $20/month SaaS subsc riptions shift meaningfully for individual developers and small teams.

放弃 Claude 订阅？我用 8 年前的服务器，强跑 Google 最强开源模型 Gemma 4 真实测评！

What Happened

Why It Matters

The Technical Detail

Gemma 4 Model Variants

Installation Path

Bottleneck Analysis

What To Watch

相关推荐

It 's a Big One

Qwen 3.6 27B Makes Huge Gains in Agency on Artificial Analysis - Ties with Sonnet 4.6

Google 让 AI 替你重新构图 — 拍照技术的门槛又低了一截

Google Engineers Want One Ruleset for Production - Ready AI Code — Harder Than It Sounds

Your AI Isn 't D umb — It Just Needs Constraints

A Low -Code Platform's Internal Doc Got Pushed as AI News — The Filter Is Broken