Tinygrad Tests MoE on Blackwell: Local AI Geeks Build Priciest Hardware Lego

A Reddit user this week showed off their setup: a Blackwell architecture GPU paired with an M3 Ultra, clustered via RDMA (Remote Direct Memory Access, a high-speed networking technology that bypasses the CPU), bringing total available memory to nearly 2TB. They are preparing to run MoE (Mixture of Experts, a large model architecture that only activates a subset of parameters) benchmarks and are soliciting benchmark requests from the community.

What this is

Tinygrad is an open-source deep learning framework maintained by George Hotz (notorious hacker/comma.ai founder), positioned as "lighter than PyTorch and easier to hack at the lower level." It has always been niche but boasts a loyal user base among local deployment and hardware geeks. The highlight of this experiment is the hardware combination: Blackwell is NVIDIA's latest architecture with extremely limited shipments currently; M3 Ultra is Apple's most powerful desktop chip; connecting the two via RDMA is a non-standard configuration. The test target is MoE models—these models have massive parameters but only activate some experts during inference, placing extreme demands on memory bandwidth and scheduling, which is exactly the optimization direction Tinygrad claims to excel at.

Industry view

We note that the local AI community has always been highly enthusiastic about such experiments; the post's comment section quickly filled with benchmark suggestions. This reflects two realities: First, MoE models (like Mixtral, DeepSeek-MoE) are becoming the mainstream choice in the open-source community, but inference optimization for them in existing frameworks is far from mature. Second, very few people have physical Blackwell machines, so any actual data run is of reference value.

But it's worth staying grounded: Tinygrad's ecosystem and industrial adoption rate remain low, and PyTorch will remain the mainstream for the foreseeable future. As one community commenter pointed out, the results of such a "franken-cluster" experiment have limited reference value for most developers—your hardware stack is completely different from theirs, and you cannot reuse their optimization paths. Additionally, the configuration threshold for RDMA clusters is extremely high, which inherently filters out the vast majority of potential users.

Impact on regular people

For enterprise IT: No direct short-term impact. This type of experiment is frontier exploration, not a production solution; enterprises need not adjust their infrastructure planning for this.

For individual careers: If you are an AI engineer, the low-level hackability of frameworks like Tinygrad is worth watching—it cultivates an understanding of hardware and operators, a skill increasingly valuable in model deployment optimization.

For the consumer market: The Blackwell + Apple Silicon combination reaffirms a trend: the hardware ceiling for local inference is still being pushed higher, but it remains at least one or two hardware cycles away from products accessible to ordinary consumers.

Tinygrad Tests MoE on Blackwell: Local AI Geeks Build Priciest Hardware Lego

What this is

Industry view

Impact on regular people

Related Reading

Qwen3.6 35B Beats 27B in Speed and Quality: Parameter Count Is Unreliable

Qwen3.6-27B Ties Coder-Next: Pick Models by Scenario, Not Benchmarks

AI Screening? You Might Lose to AI-Polished Rivals

Microsoft MAF 1.0 Merges AutoGen & Semantic Kernel, Ending Fragmentation

AI Interviews Now Ask 'How to Handle Agent Failures'—Engineering Beats Jargon

Qwen Open-Sources SAE: Decoding & Steering LLMs, China Enters Interpretability