Small Models Can Find Real Vulnerabilities: What This Means for Builders

The Signal

A post circulating on Hacker News ( 972 points, 262 comments) references the "Mythos" cybersecurity benchmark and makes a pointed claim: small models found the same vulnerabilities that the headline-grabbing frontier models did. The original article at aisle.com frames this as the "jagged frontier" problem — AI capability isn't a smooth curve where bigger always wins. On specific, well -scoped tasks like vulnerability detection, smaller models can match or approximate what the expensive giants produce. This has direct cost implications for anyone building security tooling, code review agents , or automated audit pipelines.

Builder's Take

This is a leverage story , not a hype story. Here's the first-principles read:

The "jagged frontier" framing is the key concept. Capability isn't monolithic. GPT-4-class models crush small models at open-ended reasoning, creative synthesis, and long-context tasks. But for narrow, structured tasks — "does this code contain a buffer overflow?" or "is this input sanitized?" — a fine-tuned or well-prompted smaller model can close the gap dramatically.

The cost/capability calculation: If a frontier model costs 10 -50x more per token than a small open-source model, and the small model gets you 80-90% of the output quality on your specific task, the math is obvious. You don't need to be a quant to see that running 100K security scans with a small model vs . a frontier model is the difference between a viable product and a money pit.

What moat does this create or destroy?

Destroys: The moat of "only big labs can do AI security tooling." If small models work, a solo dev with a fine-tuned Mistral or Phi-3 instance can ship a competitive security scanner.
Creates: A moat for builders who invest in task-specific evaluation. The edge isn't model size — it's knowing exactly where small models succeed on YOUR specific use case and building reliable pipelines around that .
The real leverage: Fine-tuning a small model on a curated vulnerability dataset is now a defensible strategy, not a compromise. Your competitor paying $500/month in API costs is your opportunity .

Caveat: the source article is thin on specifics ( the HN post links to a blog, not a peer-reviewed benchmark). Treat this as a directional signal, not a guarantee. Run your own e vals before betting your product on it.

Tools & Stack

Here's what a solo builder should actually look at to operationalize this:

Small Models Worth Testing

Mistral 7B / Mist ral 8x7B — Strong code understanding, open weights, runs locally via Ollama. Free to self-host.
Phi-3 Mini / Phi-3 Medium (Microsoft) — Punches above its weight on code tasks. Available via Azure AI or locally. Check current pricing on Azure.
CodeLlama 7B/13B — Purpose -built for code analysis. Free via Ollama or Hugging Face Inference API (check current pricing).
Qwen2.5-Coder — Recent strong performer on code bench marks. Open weights, run locally.

Running Locally

#  Pull and run a code-focused model with Ollama
ollama pull  codellama:7b
ollama run codellama:7b

# Or via API 
curl http://localhost:11434/api/generate -d '{
  "model": "codellama:7b",
  "prompt": "Review  this function for SQL injection vulnerabilities:\n[your code here]",
  "stream ": false
}'

Evaluation Framework

promptf oo — Open-source LLM eval framework. Run structured tests across multiple models and compare outputs. Free, self-hosted.
Inspect (UK AISI) — Python-based eval framework for security/ capability testing. Open source.

# Install promptfoo and  run a quick model comparison
npx promptfoo@latest init 
# Edit promptfooconfig.yaml to add your models  and test cases
npx promptfoo@latest eval

Fine -tuning on a Budget

Unsloth — Fine -tune Llama/Mistral models 2x faster with 70% less VRAM. Free, open source. Runs on a single consumer GPU.
Modal.com — Serverless GPU for fine-tuning runs. Pay per second of compute. Check current pricing — often the cheapest option for one-off fine-tune jobs.
Hu gging Face AutoTrain — No-code fine-tuning UI. Check current pricing per training hour .

Ship It This Week

Build a lightweight code security scanner CLI using a local small model.

Here's the concrete spec:

Install Ollama and pull codellama:7 b or mistral:7b.
Write a Python script that accepts a file path as input, reads the code, chunks it into functions /classes, and sends each chunk to the local model with a structured prompt asking for vulnerability analysis.
Output a structured JSON report: file, function name , risk level (low/medium/high), description, suggested fix.
Add a GitHub Action integration so it runs on every PR.
Charge $9 /month for unlimited scans on a repo. Your marginal cost: near zero (local compute or a $10/ month VPS).

import ollama
import ast,  sys, json

def scan_file(filepath):
    with open(filepath) as f:
        code = f.read()
    
     prompt = f"""You are a security code reviewer. Analyze the following code for vulnerabilities.
 Return a JSON array of issues with fields: severity, description , line_hint, fix_suggestion.
Code:\n{code}"""
    
    response = ollama.generate (model='codellama:7b', prompt=prompt)
    return response['response' ]

if __name__ == "__main__":
    result = scan_file(sys.argv[1])
    print(result)

The moat isn't the model — it's the workflow, the UI, and the integrations you build around it. Start there. Validate that the model output is useful on real codebases before you spend a single dollar on fine-tuning.

The jagged frontier is your friend. Find where small models win. Build there.

Small Models Can Find Real Vulnerabilities: What This Means for Builders

The Signal

Builder's Take

Tools & Stack

Small Models Worth Testing

Running Locally

Evaluation Framework

Fine -tuning on a Budget

Ship It This Week

Related Reading

AI Keeps Forg etting Half Your Docs? DeepSeek Now Reads a Full Book at Once

Quarter the Cost , Same AI Quality : How I Cut Client Bills

AI Tools Move Fast : Workflow Died in 3 Months . A Selection R hythm Saved Me

Qwen3 - 27B on One RTX 3090: 85 TPS, 125K Context , Vision — Overnight

Claude Has a Design Mode Now — My First Thought: "Finally, No More Explaining Myself"

The AI Writing Tool Even Gov't Agencies Use Quietly — We Can Too