
Resilience ≠ Slowing Down — Building AI Systems That Survive Contact with Reality
I stood in front of a room full of AI startup founders at AWS Summit Los Angeles and asked: How much does a minute of your downtime cost?
Someone said “A lot.” It’s a very good answer.
Here’s the math: $50M ARR divided by 525,600 minutes in a year. Roughly $95 per minute. A nice dinner. Every minute. Except it’s never just a minute — it’s 45 minutes of confusion, 20 minutes of actual fixing, and 3 hours of damage-control emails.
But here’s the part nobody expects from a talk — I was wrong.
Not all outages are created equal
If your customer-facing storefront goes down, the world is on fire. Revenue stops. Trust erodes. Investors notice. But if your internal recommendation engine hiccups? Nobody dies. Customers don’t even notice. Fix it tomorrow.
The real question isn’t “how much does a minute cost?” It’s “a minute of WHAT going down costs me HOW MUCH?”
For AI startups this matters enormously. You’re running inference endpoints, training pipelines, vector databases, RAG retrieval layers — dozens of components with wildly different business criticality. Treating your fine-tuning pipeline with the same urgency as your real-time inference API is like putting a deadbolt on your bathroom door while leaving the front door open.
Business Impact Analysis: simple, brutal clarity
Three questions per user journey:
- What’s the impact if this breaks? — Revenue loss? Reputation damage? Or… nothing?
- How fast must we restore it? (RTO) — Storefront: 5 minutes. Analytics dashboard: 2 hours.
- How much data can we lose? (RPO) — Transaction database: zero. Yesterday’s training data: probably fine.
BIA gives you permission to be strategic. You cannot over-engineer every component — that kills iteration speed and burns runway. But you can’t ignore resilience and pray either. The answer is targeted investment, driven by actual business impact.
Five stages (none require you to stop shipping)
This is why I co-authored the AWS Startup Resiliency Baseline:
- Set objectives — Run the BIA. Know where your real exposure is.
- Design smart — Managed services as resilience shortcuts. Don’t build redundancy — buy it.
- Test it — Monthly restore tests. Prove backups work before 2 AM proves they don’t.
- Monitor what matters — If your team ignores alerts, you configured too many.
- Learn from incidents — 10-minute blameless retros. Every failure becomes a system improvement.
Five stages. Each one compounds — small decisions early become massive stability later.
Why this matters now
The startups winning right now aren’t the ones that never go down. They’re the ones that know what matters — that have done the BIA, set intentional recovery targets, and invested their limited hours in the components that actually drive customer value.
Resilience isn’t the opposite of speed. Ignorance is the opposite of speed. The time you save by not thinking about this? You pay it back tenfold. At 2 AM. With your hair on fire.
Presented at AWS Summit Los Angeles 2026 as part of ReSCO program