Building the foundation for running extra-large language models

What Happened

Cloudflare published a technical deep dive this week detailing the infrastructure architecture behind its Workers AI platform's support for extra-large language models, including Moonshot's Kimi K2. 5. According to the Cloudflare engineering blog, the team has achieved a 3x performance improvement on Kimi K2.5 since the model's initial launch on the platform, with additional model additions described as "in-flight. " The post outlines two core engineering decisions driving those gains: hardware configuration tuning for agentic workloads and prefill-decode ( PD) disaggregation.

The timing is deliberate — Cloudflare's engineering team noted that Workers AI large model hosting was announced "a few weeks ago" and that these models have served as the backbone for agentic products, harnesses, and tools launched during the same week as this post.

Why It Matters

Cloudflare's entry into large open-source model hosting puts it in direct competition with dedicated inference providers like Together AI, Fireworks AI, and Replicate, as well as hyperscaler inference endpoints from AWS, Google, and Azure. The differentiation Cloudflare is betting on: network-edge proximity and, as this post signals, deep hardware-software co-optimization rather than raw GPU scale.

For CTOs evaluating inference infrastructure, the architectural choices here have direct cost and lat ency implications. Agentic workloads — the stated primary use case for Workers AI's large model tier — are structurally different from single-turn completions. Each agent turn re-submits the full context window: system prompt, tool definitions, MCP configurations, and all prior turns. This means input token volume grows with every step, making prefill through put the dominant performance variable, not decode speed. Cloudflare has explicitly tuned for this pattern.

The broader market signal: inference infrastructure is bifurcating. General-purpose endpoints optimized for short completions are no longer sufficient for production a gentic systems. Providers who fail to disaggregate compute for pref ill-heavy workloads will face compounding latency penalties as agent context windows grow — a structural disadvantage that compounds with model size.

The Technical Detail

Prefill-Decode Disaggregation

Cloudflare's implementation separates the two stages of LLM inference onto distinct inference servers:

Prefill servers handle input token processing and KV cache population. This stage is compute-bound, meaning it saturates GPU FLOPS rather than memory bandwidth.
Decode servers handle autoregressive token generation. This stage is memory-bound, dominated by K V cache read bandwidth rather than raw compute.

The problem with co-locating both stages on a single machine, as Cloudflare's engineers describe it: prefill and decode require different GPU subs ystems, and because prefill always precedes decode sequentially, the stages block each other. A GPU optimized for high memory bandwidth (decode) is underutilized during prefill, and vice versa. Disaggregation allows each server class to be provis ioned and scaled independently to match its bottleneck resource.

This is not a novel concept — academic work on P D disaggregation (including the Splitwise and DistSer ve papers) has circulated in the inference research community, and providers like De epMind and several startups have explored similar splits. What is notable is Cloudflare operationalizing this at the network edge rather than in a centralized data center, where inter-server KV cache transfer latency becomes a more acute engineering constraint.

Hardware Configuration Strategy

According to the post, Cloudflare runs multiple hardware configurations tuned to different input /output token ratios. The engineering team identifies two opposing work load archetypes:

Generation-heavy workloads (e.g., long-form content creation): low input token count, high output token count — decode- bound.
Summarization or agentic workloads: high input token count (full context re -submission each turn), low-to-moderate output — prefill-bound.

For Workers AI's target use case — agentic pipelines — the team explicitly prioritized fast input token processing and fast tool-call handling over raw generation throughput. The post does not disclose specific tokens-per-second figures beyond the 3x improvement claim for Kimi K2.5.

KV Cache Considerations

The post references K V cache population as part of the prefill stage. In disaggregated architectures, the populated KV cache must be transferred from the prefill server to the decode server before generation begins. This transfer overhead is a known challenge in PD disaggregation implementations — the post does not detail Cloudflare's specific mechanism for handling this transfer, whether via RDMA, NVLink fabric extensions, or network transport, leaving that as an open technical question.

What To Watch

Additional model launches on Workers AI: Cloudflare explicitly states more models are "in-flight. " Expect announcements within the next 30 days, likely targeting the 70B–400B parameter range given the "extra-large" fr aming of this infrastructure post.
Benchmark disclosures: The 3x speed claim for Kimi K2.5 lacks a public baseline. Watch for Cloudflare releasing tokens-per-second or time-to-first-token numbers against competing inference providers — this would be a competitive necessity as enterprise procurement cycles begin.
Competitive responses from Fireworks AI and Together AI: Both providers have invested heavily in custom inference kernels and hardware configurations for large models. A Cloudflare entry with edge-network advantages will pressure both on latency SLAs for geographically distributed agentic workloads.
KV cache transfer architecture disclosure: The current post leaves the inter -server KV transfer mechanism unspecified. A follow-up technical post or conference presentation detailing this would be a strong signal about the maturity of Cloudflare's disaggregated inference stack.

Building the foundation for running extra-large language models

What Happened

Why It Matters

The Technical Detail

Prefill-Decode Disaggregation

Hardware Configuration Strategy

KV Cache Considerations

What To Watch

Related Reading

Goldman Sachs Warning : S &P 500 Now Half an AI Index

DeepSeek V4 Launches: Claims Global Open- Source Leadership

GPT- 5.5 Tops Every Benchmark, Edges Out Opus 4.7 — OpenAI Strikes Back

GP T-5.5 Launches : Is Claude Being Pushed Out of China ?

AI Keeps Forg etting Half Your Docs? DeepSeek Now Reads a Full Book at Once

Quarter the Cost , Same AI Quality : How I Cut Client Bills