What Happened

A thread on r/LocalLLaMA compiled practical prompts designed to test common-sense reasoning in local LLMs. The original poster identified that models frequently fail simple spatial logic questions, such as whether to walk 50 meters to a car wash or drive there. Tests cover factual prioritization (Apple A6 microarchitecture, Pentium D design flaws) and basic situational reasoning. Even the Gemma 3 4B Thinking model (Q6_K quantization) failed multiple prompts that the larger Gemma 27B MoE model passed.

Solo Founder Angle

If you are running local models via Ollama, LM Studio, or llama.cpp to automate any client-facing or internal workflow, these prompts give you a fast, free benchmark before committing to a model. Specific workflow:

  • Run the spatial logic prompts against any model you are evaluating in Ollama with a simple ollama run modelname session before integrating it into an n8n or LangChain pipeline.
  • Use the factual prioritization prompts (Apple A6, Pentium D) to check if a model surfaces the most relevant information first, critical if you are building a research assistant or content summarizer.
  • Log pass/fail results in a simple Notion table or Airtable base to build your own model selection matrix over time.
  • For production RAG pipelines, prefer models that pass the common-sense spatial tests, since reasoning failures there often predict failures in multi-step task chains.

Why It Matters for Indie Builders

Solo founders using local models to cut API costs need a reliable way to evaluate model quality without running expensive formal benchmarks. Standard benchmarks like MMLU do not catch the practical reasoning failures that break real automations. A model that fails the keyboard prompt, where it should tell you to get the keyboard before typing, will also fail multi-step agent tasks. Knowing which quantization level and model size clears these bars helps you make hardware and workflow decisions with real data instead of marketing claims.

Action Item This Week

Pull two or three local models you currently use or are considering, run all six spatial logic prompts from this thread against each one, and record which pass. Use that result to decide which model handles your most reasoning-heavy automation task this week.