What Happened

Developer Simon Willison published a working uv run recipe on April 12, 2026 for running local audio transcription on macOS using Google's Gemma 4 E2B model via mlx- vlm. The tip originated from Rahim Nathwani. The model weighs in at 10.28 GB and runs entirely on-device via Apple's MLX framework.

The command to invoke it is a single one-liner:

uv run --python 3 .13 --with mlx_vlm --with torchvision --with gradio mlx_vlm.generate --model google/gemma-4-e2b-it --audio file.wav --prompt "Transcribe this audio" --max -tokens 500 --temperature 1.0

Willison tested the recipe against a 14-second WAV file. The model successfully transcribed the clip with minor errors — mishearing "This right here" as "This front here" and "how well that works" as "how that works" — indicating functional but imperfect accuracy on short-form voice input, according to Willison's post.

Why It Matters

This is a practical demonstration that Gemma 4's multimodal capabilities extend to audio, and that the pipeline is accessible enough to run with zero infrastructure setup beyond uv and a Python 3.13 environment. For engineering teams evaluating on-device or air-gapped transcription workflows, this lowers the barrier significantly.

  • No API dependency: The entire pipeline runs locally, meaning no data leaves the machine — relevant for teams handling sensitive audio.
  • MLX integration: Apple's MLX framework continues to expand its model compatibility surface. Gemma 4 joining the supported roster means more teams on Apple Silicon can run multimodal work loads without cloud egress costs.
  • uv as glue: The use of uv run with inline dependency flags (--with) means no virtual environment setup and no requirements.txt — the entire dependency chain is resolved at runtime. This pattern is increasingly common in the Python ML tooling space.

The Technical Detail

The model being used is google/gem ma-4-e2b-it — the instruction-tuned variant of Gemma 4's E2B (presumably edge/embedded 2B parameter class) at 10.28 GB on disk. It is invoked through mlx_ vlm.generate, the same CLI entry point used for image- based multimodal tasks in the mlx-vlm package, suggesting the library has unified its audio and vision input handling under a single interface.

Key parameters from Willison's invocation:

  • --max-tokens 500 — sufficient headroom for short- to-medium audio clips
  • --temperature 1.0 — default sampling temperature, not tuned for transcription accuracy
  • Input format: . wav — no mention of other container support in the source

Transcription quality on the 14-second test clip was near -accurate with two word-level substitutions. Willison noted the errors were phonetically plausible, suggesting the model is performing genuine speech dec oding rather than hallucinating. No latency or throughput figures were reported in the source.

Dependencies

  • mlx_vlm
  • torch vision
  • gradio
  • Python 3.13

The gradio dependency is listed but not explained in the source — it may be a transitive requirement of mlx_vlm rather than a direct UI dependency for this use case.

What To Watch

  • Accuracy benchmarks: No Word Error Rate (WER) figures exist yet for Gemma 4 E2B on standard speech datasets like LibriSpeech or CommonVoice. Expect community benchmarks to surface within weeks as more developers replicate this setup.
  • ml x-vlm audio support expansion: The library's unified CLI for audio and vision suggests the maintainers are building toward a general multimodal interface. Watch the mlx-vlm GitHub for additional audio format support and streaming output.
  • Competitive positioning: This puts Gemma 4 in direct comparison with Whisper (OpenAI) for local transcription on Apple Silicon. Whis per already has mature MLX support via mlx-whisper. Developer adoption will depend on whether Gemma 4's multimodal flexibility outweighs Whisper's established accuracy track record.
  • Google Gemma 4 rollout: The existence of an audio-capable edge model suggests Google is actively expanding Gemma 4's deployment surface beyond text. Additional modality announcements or model variants are plausible in the near term.