Back to writing
· 5 min read ·

Before You Promise 99.9% Uptime, Do the Math

sre sla slo uptime
In this article

Editorial hero illustration for Before You Promise 99.9% Uptime, Do the Math

99.9% sounds friendly until you translate it into minutes.

Once you put an uptime target in front of a customer, the conversation stops being aspirational. Now you need measurement, incident history, recovery capability, exclusions, and somebody who can defend the wording when things go sideways.

Plenty of teams are not ready for that, even when the number sounds reasonable in the room.

I have seen this go sideways more than once. A bad incident burns through the whole month’s budget in one shot, and then the next conversation somehow drifts toward adding another nine as if that solves anything.

No. Stop. Do the math. Check whether the system can actually back the promise.

Start with the budget

In a 30-day month, uptime targets roughly translate to this:

TargetDowntime budget per month
99.0%7h 12m
99.9%43m 12s
99.95%21m 36s
99.99%4m 19s

This is usually the point where the meeting gets quieter.

“Just one more nine” sounds harmless until you see the actual budget. At 99.99%, you get a little over four minutes in a month. Not four minutes to understand the incident. Four minutes for the whole thing, from failure to recovery. One slow rollback or one delayed response and the month is gone.

Check reality, not confidence

Before promising any number externally, I would want a very plain internal review. Real answers, not confidence.

How much downtime did you actually have in the last three to six months? Pull the uptime data. Calculate the real number. If nobody can do that quickly, that is already meaningful.

How fast do incidents get detected? Synthetic checks in under a minute? Or a customer message twenty minutes later? Those are very different worlds.

How long does recovery usually take? Is there a runbook? Is rollback practiced? Or is somebody figuring it out live every time?

If your application is up but a critical dependency is not, are you down? Who decides that, and is it written anywhere?

If the answers are fuzzy, the promise is fuzzy too.

SLA, SLO, and the part nobody talks about

Teams blur these together all the time, so it helps to separate them.

The healthy version is boring in a good way. The internal target is tighter than the external promise. Actual performance supports both. When error budget gets burned, risky changes slow down. The broken version is the one where the contract says 99.9%, nobody can measure it cleanly, and releases keep moving because everyone hopes the bad month is behind them.

Two realistic paths

When a customer asks for an uptime commitment, I think the conversation usually comes down to two real options.

You can promise it now. Sales gets the cleaner answer and the deal may move faster. But if the system is not ready, you are volunteering for credits, messy conversations, and a customer learning that your contract confidence outran your engineering reality.

Or you can narrow the promise until the system grows into it. Offer a lower number. Tighten the exclusions. Delay the commitment until your monitoring and resilience actually support it. That can make deals harder in the short term, but it is still a better conversation than explaining why the promised number fell apart in the first rough month.

I would rather make a smaller promise I can defend than a bigger one I already expect to miss.

What I’d verify before signing

Before anyone puts an uptime number in a contract, I would want these answered in plain language.

What counts as down? Full outage only? Severe degradation too? Does a brief blip count? What if a single region fails in a multi-region setup?

Who measures it? Internal monitoring, external checks, or the customer’s own tools? That matters more than people expect because different sources can tell very different stories.

Which maintenance windows are excluded, if any? If there are regular changes that could affect availability, write that down.

What about third-party dependencies? If AWS or Stripe has a bad day and your service depends on them, does that count against your number?

What happens when the budget is already burned? Is there an agreed slowdown or freeze on risky changes, or does everyone just keep shipping and hope for the best?

And who owns the breach conversation if it happens? Customer communication, credits, contract interpretation, all of that should have an owner before it is urgent.

It does not need to become a giant process document. It just needs to be clear enough that everyone means the same thing when they say “99.9%.”

Reliability is earned

You do not get more nines because a slide deck says “enterprise.”

You get them the slow way. Faster rollback. Better detection. Cleaner runbooks. Fewer single points of human dependency. It is not glamorous work, but it is the work that makes the promise believable.

Key takeaway

If the math says you cannot support the promise yet, say so early. An honest conversation before the contract is much cheaper than a defensive one after a breach.

You made it to the end