What Happened

A community-maintained benchmark on r/LocalLLaMA has published extended NYT Connections puzzle scores for several local and open-weight LLMs. MiniMax-M1 scored 34.4, Gemma 4 31B scored 30.1, and Arcee Trinity Large Thinking scored 29.5. The full dataset and methodology are published on GitHub at lechmazur/nyt-connections.

Why It Matters

NYT Connections requires categorizing 16 words into 4 hidden groups — a task that tests semantic reasoning, ambiguity resolution, and common-sense knowledge rather than raw text generation. For indie developers and SMEs evaluating which local model to deploy, this benchmark provides a practical signal on reasoning quality beyond standard coding or math tasks.

  • MiniMax-M1's 34.4 score positions it as a strong open-weight reasoning model for tasks requiring semantic grouping.
  • Gemma 4 31B at 30.1 remains competitive for teams constrained to smaller hardware footprints.
  • Arcee Trinity Large Thinking at 29.5 shows that specialized thinking-mode models still trail general-purpose models on this specific task type.

Asia-Pacific Angle

MiniMax is a Shanghai-based AI company, making this benchmark result particularly relevant for Chinese developers evaluating domestically developed models against Western alternatives. MiniMax-M1 topping this leaderboard suggests it is a credible option for teams in China and Southeast Asia who need strong semantic reasoning without relying on OpenAI or Anthropic APIs. Developers in the region building consumer apps, customer support bots, or content tools in Chinese or multilingual contexts should test MiniMax-M1 directly — its training data likely includes stronger CJK language coverage than Gemma 4.

Action Item This Week

Clone the lechmazur/nyt-connections benchmark repo, run it against whichever local model you are currently evaluating, and compare your score against the published leaderboard to make a data-driven model selection decision before your next sprint.