Back to writing
· 5 min read

Before You Promise 99.9% Uptime, Do the Math

sre sla slo uptime
In this article

99.9% isn’t a marketing number. It’s a downtime budget.

Once you promise availability to a customer, you’re not talking about hopes anymore. You need to measure it. You need incident history to back it up. You need recovery capability. You need contract language.

Most teams aren’t ready for any of that.

I’ve watched this play out more than once. A P1 burns through the entire month’s downtime budget in one shot, then someone in the next meeting goes “maybe we should target 99.99% instead.” More nines. That’ll fix it.

No. Stop. Do the math. Check whether the system can actually back the promise.

Start with the budget

In a 30-day month, uptime targets roughly translate to this:

TargetDowntime budget per month
99.0%7h 12m
99.9%43m 12s
99.95%21m 36s
99.99%4m 19s

This is where meetings get real quiet.

“Just one more nine” sounds harmless until you see that 99.99% gives you about four minutes. Total. Not four minutes to find the root cause. Four minutes from the moment things break to the moment everything is fully back. One slow rollback, one DNS propagation delay, one engineer asleep when the page fires… done. Month over.

Quick gut check. Think about your last production incident. From the first alert to “all clear” - how long was that? More than 43 minutes? You just failed 99.9%.

That’s the gap between a slide deck number and what production actually looks like.

Check reality, not confidence

Before promising any number externally, I’d want a simple internal review. Real answers. Not vibes.

How much actual downtime did we have in the last three to six months? Pull your uptime dashboard in Datadog or CloudWatch. Calculate your real availability for the last 90 days. If you can’t find that data… that’s already your answer. You’re not ready to promise a number you can’t even measure.

How fast do we detect incidents? Synthetic checks catching problems in 60 seconds? Or a customer Slack message 20 minutes later? Big difference.

How long does recovery take? Is there a runbook, or does someone figure it out live every time?

If your app is up but your payment provider is down, are you “down”? Who decides?

Fuzzy answers here means a fuzzy SLA. Every time.

SLA, SLO, and the part nobody talks about

Teams mix these up constantly. Quick breakdown.

SLA is the customer-facing promise. Goes in the contract. Miss it and you owe credits. It’s a business commitment. Engineering doesn’t set it, but engineering has to live with it.

SLO is what the engineering team actually aims for internally. It should be tighter than the SLA. If your SLA is 99.9%, your SLO might be 99.95%. That gap is your safety margin. Without it, one bad deploy means a contract breach.

Error budget is the space between your SLO and 100%. When you burn through it in a bad month, the team slows down on risky releases. That’s the deal. It’s what makes the whole model work instead of just being a number on a wiki page nobody reads.

The healthy version: SLO is tighter than SLA. Actual performance supports both. Bad months trigger a slowdown in risky changes.

The broken version: business promises 99.9% in a contract. Engineering can’t even measure actual uptime. Nobody tracks error budget. Risky deploys keep shipping because “we’ve never been called on it.” That’s just luck running on borrowed time.

Two realistic paths

When a customer asks for an uptime commitment, it usually comes down to two options.

Option A -> Promise it now

Sales gets a clean answer. Procurement friction drops. Deal moves faster.

But if the system isn’t ready, you’re signing up for credits and awkward renewal conversations. And a missed SLA does something worse than costing money. It tells the customer your internals are weaker than your contract language. Hard to recover from that.

Option B -> Narrow the promise until you’re ready

Offer a lower number. Define tighter exclusions. Delay the commitment until monitoring and resilience actually support it.

Some deals get harder. That’s real. But you can always upgrade an SLA later. Walking back a broken promise is a different conversation entirely.

I’d rather have a smaller promise I can defend than a bigger one I already know will break.

What I’d verify before signing

Before anyone puts an uptime number in a contract, I’d want these answered clearly.

What counts as “down”? Full outage only, or does degraded performance count? Is a 10-second blip downtime? What about a single region failing when you’re running multi-region?

Who measures it? Internal monitoring? External synthetic checks? The customer’s own tools? Your Datadog dashboard and the customer’s actual experience can tell very different stories. This matters more than people think.

Which maintenance windows are excluded? Weekly rolling deploys on Tuesday nights? Carve those out explicitly.

Third-party dependencies? If you depend on AWS or Stripe, their outages affect your number. In or out?

What happens when the budget is already burned? Is there a process to freeze risky changes? Or does the team just keep shipping?

Who handles the breach conversation? Who talks to the customer? Who decides on credits?

Doesn’t need to be a giant process. Just clear enough that everyone is talking about the same thing when they say “99.9%.”

Reliability is earned

You don’t get more nines because a slide deck says “enterprise.”

You get them by making your deploy pipeline roll back in 2 minutes instead of 15. By adding synthetic checks so you catch outages before customers do. By writing runbooks for the top 3 failure modes so recovery doesn’t depend on one specific person being online.

Slow work. Boring work. The only work that counts.

If the math says you can’t support the promise yet, say so early. Honest sales conversations are cheap. Explaining why your “reasonable” SLA fell apart the first time production had a bad month… that one’s expensive.

Back to writing