The Signal
A post circulating on Hacker News ( 972 points, 262 comments) references the "Mythos" cybersecurity benchmark and makes a pointed claim: small models found the same vulnerabilities that the headline-grabbing frontier models did. The original article at aisle.com frames this as the "jagged frontier" problem — AI capability isn't a smooth curve where bigger always wins. On specific, well -scoped tasks like vulnerability detection, smaller models can match or approximate what the expensive giants produce. This has direct cost implications for anyone building security tooling, code review agents , or automated audit pipelines.
Builder's Take
This is a leverage story , not a hype story. Here's the first-principles read:
The "jagged frontier" framing is the key concept. Capability isn't monolithic. GPT-4-class models crush small models at open-ended reasoning, creative synthesis, and long-context tasks. But for narrow, structured tasks — "does this code contain a buffer overflow?" or "is this input sanitized?" — a fine-tuned or well-prompted smaller model can close the gap dramatically.
The cost/capability calculation: If a frontier model costs 10 -50x more per token than a small open-source model, and the small model gets you 80-90% of the output quality on your specific task, the math is obvious. You don't need to be a quant to see that running 100K security scans with a small model vs . a frontier model is the difference between a viable product and a money pit.
What moat does this create or destroy?
- Destroys: The moat of "only big labs can do AI security tooling." If small models work, a solo dev with a fine-tuned Mistral or Phi-3 instance can ship a competitive security scanner.
- Creates: A moat for builders who invest in task-specific evaluation. The edge isn't model size — it's knowing exactly where small models succeed on YOUR specific use case and building reliable pipelines around that .
- The real leverage: Fine-tuning a small model on a curated vulnerability dataset is now a defensible strategy, not a compromise. Your competitor paying $500/month in API costs is your opportunity .
Caveat: the source article is thin on specifics ( the HN post links to a blog, not a peer-reviewed benchmark). Treat this as a directional signal, not a guarantee. Run your own e vals before betting your product on it.
Tools & Stack
Here's what a solo builder should actually look at to operationalize this:
Small Models Worth Testing
- Mistral 7B / Mist ral 8x7B — Strong code understanding, open weights, runs locally via Ollama. Free to self-host.
- Phi-3 Mini / Phi-3 Medium (Microsoft) — Punches above its weight on code tasks. Available via Azure AI or locally. Check current pricing on Azure.
- CodeLlama 7B/13B — Purpose -built for code analysis. Free via Ollama or Hugging Face Inference API (check current pricing).
- Qwen2.5-Coder — Recent strong performer on code bench marks. Open weights, run locally.
Running Locally
# Pull and run a code-focused model with Ollama
ollama pull codellama:7b
ollama run codellama:7b
# Or via API
curl http://localhost:11434/api/generate -d '{
"model": "codellama:7b",
"prompt": "Review this function for SQL injection vulnerabilities:\n[your code here]",
"stream ": false
}'
Evaluation Framework
- promptf oo — Open-source LLM eval framework. Run structured tests across multiple models and compare outputs. Free, self-hosted.
- Inspect (UK AISI) — Python-based eval framework for security/ capability testing. Open source.
# Install promptfoo and run a quick model comparison
npx promptfoo@latest init
# Edit promptfooconfig.yaml to add your models and test cases
npx promptfoo@latest eval
Fine -tuning on a Budget
- Unsloth — Fine -tune Llama/Mistral models 2x faster with 70% less VRAM. Free, open source. Runs on a single consumer GPU.
- Modal.com — Serverless GPU for fine-tuning runs. Pay per second of compute. Check current pricing — often the cheapest option for one-off fine-tune jobs.
- Hu gging Face AutoTrain — No-code fine-tuning UI. Check current pricing per training hour .
Ship It This Week
Build a lightweight code security scanner CLI using a local small model.
Here's the concrete spec:
- Install Ollama and pull
codellama:7 bormistral:7b. - Write a Python script that accepts a file path as input, reads the code, chunks it into functions /classes, and sends each chunk to the local model with a structured prompt asking for vulnerability analysis.
- Output a structured JSON report: file, function name , risk level (low/medium/high), description, suggested fix.
- Add a GitHub Action integration so it runs on every PR.
- Charge $9 /month for unlimited scans on a repo. Your marginal cost: near zero (local compute or a $10/ month VPS).
import ollama
import ast, sys, json
def scan_file(filepath):
with open(filepath) as f:
code = f.read()
prompt = f"""You are a security code reviewer. Analyze the following code for vulnerabilities.
Return a JSON array of issues with fields: severity, description , line_hint, fix_suggestion.
Code:\n{code}"""
response = ollama.generate (model='codellama:7b', prompt=prompt)
return response['response' ]
if __name__ == "__main__":
result = scan_file(sys.argv[1])
print(result)
The moat isn't the model — it's the workflow, the UI, and the integrations you build around it. Start there. Validate that the model output is useful on real codebases before you spend a single dollar on fine-tuning.
The jagged frontier is your friend. Find where small models win. Build there.