A developer's custom benchmark shows Mistral's Devstral Small 2 scoring over 80% on 8 code engineering tasks—the first local model to beat multiple closed-source rivals. The real capabilities of open-source code models may be hidden by mainstream benchmarks.
What This Is
The mainstream code benchmark SWE Bench only tests Python and only checks whether tests pass (pass/fail), without caring whether the model introduces new bugs or writes redundant code. A developer considered this insufficient and built Scaffold Bench, covering 8 dimensions: precise fixes (only changing what should be changed), code auditing (read-only, no modifications), scope discipline (not touching extra code), read-only analysis, verification fixes, feature implementation, response speed, and long-context retrieval.
In his testing, Mistral's Devstral Small 2 (24B parameters, using Q8 quantization—an 8-bit precision compression technique that reduces memory requirements) scored over 80% on JavaScript, TypeScript, React, Go, and SQL tasks, making it the first local model to break that threshold. According to him, it even outperformed certain versions of Sonnet and Codex. This developer previously used a hybrid approach of Qwen for execution + Claude for writing specs + Codex for review, but found that Devstral introduced fewer anti-patterns and less repetitive code.
Industry View
We've noticed that Mistral has long occupied an awkward position in the community of "not being taken seriously." If this benchmark is reproducible, it suggests that open-source models may have caught up with closed-source ones in a specific vertical scenario—code engineering. This is a substantive boon for enterprises committed to on-premise deployment.
But caveats are warranted: First, this is a single developer's custom test—the methodology hasn't been peer-reviewed, and he himself admits "maybe my benchmark itself has issues"; Second, the model's inference speed (TPS) is on the slow side, and latency in production environments could be a hard blocker; Third, this benchmark only covers frontend and the Go ecosystem, with zero coverage of Python, Java, and other mainstream enterprise stacks. Good benchmark scores don't equal production readiness—we stand by this judgment.
Impact on Regular People
For enterprise IT: Open-source code models are approaching closed-source levels, improving the feasibility of on-premise deployment, but a 24B model with Q8 quantization still requires 24GB+ of VRAM—the hardware barrier remains significant.
For individual careers: Developers can run a near-closed-source-level code assistant locally with data never leaving the machine, but this requires a high-end GPU and inference speed may be a daily friction point.
For the consumer market: If Mistral continues producing models like this, it has the potential to reshape the open-source AI competitive landscape, but for now it remains a niche choice—the ecosystem and toolchain are far less mature than Meta's Llama.