The Signal
Claude.ai went down. Anthropic posted an incident report at status.anthropic.com. It hit the Hacker News front page with 100+ points and ~100 comments — meaning a non-trivial number of builders and users were blocked. Exact duration and root cause aren't detailed in the source. What matters: a widely-used production API was unavailable, and anyone routing 100% of their LLM traffic through it got burned.
This isn't a Claude -specific problem. OpenAI has gone down. Groq has gone down. Every hosted API will go down. The question is whether your product degrades gracefully or face- plants.
Builder's Take
Single-provider LLM dependency is the new single point of failure. In the old world, your database going down killed your app. Now your inference provider going down kills your app — and you don't control it.
The leverage calculation is simple:
- Cost of outage: Every minute your product is broken = churn risk, support tickets, reputation damage.
- Cost of fallback routing: ~2-4 hours of engineering, once. Maybe $0 extra at low volume if you're already on free t iers of multiple providers.
That's an asymmetric bet. Build the fallback.
The moat angle: most indie products won't do this. If you ship a resilient multi-provider setup, you can honestly market "99.9% uptime" while competitors are at the mercy of Anthropic's status page. Reliability is underrated as a solo builder differentiator — especially in B2B where downtime has real business cost for your customers.
DHH would say: don 't outsource your reliability to a vendor. He'd run it on his own hardware. You probably can't — but you can hedge across vendors.
Tools & Stack
L LM Router / Fallback Options
- OpenRouter — single API, routes to Claude, GPT-4, Mistral, Llama, etc. Has fallback model configuration. Check current pricing on their site — model costs pass-through with a small markup. This is the fastest path to multi-provider resilience.
- LiteLLM (open source) — drop-in proxy that normalizes API calls across 100+ providers. Self-hostable. Free. Supports fall backs and retries natively.
- OpenAI API — obvious Claude alternative. Keep credentials ready even if Claude is your primary.
- Groq — fast inference for open models (Llama, Mixtral). Free tier available. Good emergency fallback for latency-sensitive apps.
DIY Fallback with L iteLLM
pip install litellmimport litellm
response = litellm.completion(
model="claude-3-5-sonnet-20241022",
messages=[{"role": "user", "content": "Hello "}],
fallbacks=["gpt-4o", "groq/llama-3.1-70b-versatile"]
)
print(response.choices[0].message.content)That's it. If Claude is down, LiteLLM automatically tries GPT-4o, then Groq. Your app keeps running. Your users never see an error.
Status Monitoring
- status.anthropic.com — Anthropic's status page. Subscribe to email /webhook alerts.
- Better Uptime or Upptime ( open source) — monitor the API endpoint yourself, don't rely on the provider to tell you they're down.
Ship It This Week
Build a resilient LLM wrapper for your existing product in one afternoon .
Here's the exact scope:
- Install LiteLLM or sign up for OpenRouter (30 min).
- Refactor your LLM call into a single
llm_complete(prompt)function — if you haven't already (30 min). - Add a fallback chain: Claude → GPT-4o → Groq/ Llama (15 min).
- Add a
try/exceptthat logs which provider served the request — so you have visibility (15 min). - Subscribe to Anthropic + OpenAI status page email alerts (5 min).
Total: under 2 hours. You now have a product that survives any single provider outage. Ship a changelog note: "Improved reliability with automatic failover." B2B customers will notice.
If you want to go further: build a simple health-check cron (runs every 5 min, hits each provider's API with a cheap test prompt, writes status to a Redis key ). Your app reads that key before routing. Zero dependency on third-party status pages.