What Happened

A Chinese developer published a complete infrastructure walkthrough on Juejin detailing how to run Google's Gemma 4 model locally on Apple Silicon Macs and expose it as a public HTTPS API — using Ollama for model serving, OrbStack for containerized middleware, frp for reverse tunneling, and Nginx as the TLS termination layer on a public-facing VPS.

The architecture creates a five -hop request chain: remote client → Nginx (port 443) on a public server → frps tunnel relay → frpc container on the local Mac → a Node.js chat-api container → Ollama process at localhost:11434 → Gemma 4 inference. Responses return on the same path .

The model download is approximately 5GB according to the article. Minimum hardware requirements are listed as Apple Silicon (M1 through M4) with 8GB RAM, 16GB recommended, and 20GB+ free disk space.

Why It Matters

This pattern is gaining traction among developers who want to avoid per-token API costs for personal or small-team workloads while keeping inference on hardware they already own. The combination of Ollama's local model management, OrbStack's low-overhead Docker runtime on macOS, and frp's lightweight tunnel elimin ates the need to rent GPU cloud instances for moderate-throughput use cases.

For engineering teams evaluating AI infrastructure costs, this represents a legitimate architecture decision point: local Apple Silicon inference versus managed API endpoints. The M-series Neural Engine provides meaningful throughput for models in the 4B–12B parameter range without per-query billing.

The approach also has data residency implications. Queries never leave the developer's machine before hitting the Nginx proxy — relevant for teams handling sensitive data who cannot send it to third-party inference APIs under current data governance policies .

OrbStack's positioning as a Docker Desktop replacement on macOS is notable here. By using host.docker.internal as the bridge between the container network and the host Ollama process, the architecture avoids network namespace complexity that trips up many Docker-on-Mac setups.

The Technical Detail

The Node.js middleware layer wraps Ollama's native API with a conversation-management layer using in-memory Map storage and crypto.randomUUID() for session IDs. The API surface exposed includes :

  • GET /health — liveness probe
  • GET /models — enumerate available Ollama models
  • POST /chat — stateful conversation with streaming and non-streaming modes, multi-turn context
  • GET /conversations — list active sessions
  • GET /conversations/:id — full message history retrieval
  • DELETE /conversations/:id — single session teardown
  • DELETE /conversations — flush all sessions

The Ollama API call uses the standard /api/chat endpoint with a stream boolean parameter, forwarded from the client request . The OLLAMA_URL defaults to http://host.docker.internal:11434, which resolves to the Mac host from within the OrbStack container — a critical detail for Docker networking on macOS.

The frp configuration runs frpc as a container within OrbStack, tunneling to an frps instance on the public VPS at port 7000, with the exposed service forwarded on port 6100. Nginx on the public server handles SSL termination and proxies to the frp forwarded port.

Key Ollama invocation for model pull and serve:

brew install ollama
ollama serve
ollama run gemma4

The ollama run command handles the initial download on first execution. Subsequent calls skip the download and launch the model directly.

What To Watch

In the next 30 days, watch for:

  • Ollama release cadence: Oll ama has been shipping updates roughly every 2–3 weeks. Any API-breaking changes to /api/chat response format would require updates to middleware wrappers built on this pattern.
  • Gemma 4 variant availability: Google has released multiple Gemma 4 parameter configurations. Monitor ollama.com/library/gemma4 for additional quantization options (Q4, Q8) that affect the 5GB download figure cited here.
  • OrbStack licensing changes: OrbStack remains free for personal use but has commercial licensing terms. Teams scaling this pattern across multiple developer machines should verify current pricing before standardizing on it .
  • frp alternatives: Cloudflare Tunnel and ngrok have been positioning against self-hosted frp deployments. If Cloudflare tightens its free tier restrictions on AI API traffic, expect migration guides to frp to proliferate further.
  • Apple Silicon memory bandwidth improvements: Apple's M4 Pro and M4 Max ship with higher memory bandwidth than M3 equivalents. Performance comparisons for Gemma 4 inference across M -series generations remain sparse — community benchmarks should emerge in this window.