DevOps Is Not a Side Task
In this article
Most startups don’t have a DevOps tooling problem. They have an ownership problem.
The ask sounds simple enough. Set up CI/CD. Move infra into code. Add monitoring. Make deployments safe. Support multiple environments. Keep production stable.
But these requests get handed to one engineer as a side task while the rest of the org keeps optimizing for feature speed. That’s how you end up with half-finished Terraform, fragile pipelines, and alerts nobody trusts.
I’ve been that one engineer. The issue was almost never “we picked the wrong tool.” It was that nobody decided this work was worth protecting from the sprint backlog.
If a company wants the outcome people loosely call “DevOps,” it needs to fund the work behind it. That means clear ownership. Time to pay down operational debt. Standards for how changes ship. Not as a side quest. As real product work.
What teams usually ask for
When someone says “we need DevOps,” they usually mean: CI/CD that works. Infra as code that’s reviewable. Monitoring that catches problems before customers do. Rollback paths that hold up under pressure. Secrets management. Environments consistent enough to debug.
Not unreasonable. But people think those things show up because someone created a GitHub Actions workflow and a few YAML files. They don’t.
What that request actually costs
Under the surface, there’s a lot of unglamorous work that never makes it into a sprint board.
Someone has to own the production path end to end. Not “keep an eye on it.” Actually own it. Know why the deploy pipeline is slow. Know why staging drifts from prod. Know why that one service keeps failing health checks on restart.
Someone has to define what “done” means for a safe deployment. Is it “the container started”? Or is it “health checks pass, metrics look normal, no error rate spike for 10 minutes”? Big difference. Especially at 2am.
Then there’s the boring stuff. Write the runbook before the first incident, not during it. Review IAM policies. Rotate secrets. Make sure a bad terraform apply can’t take down the database. Fix the same class of failure permanently instead of treating every page like a fresh mystery.
None of that can keep falling to whoever has the most calendar space that week.
The failure loop
The pattern is so common I could template it.
Startup ships fast early on. SSH into a box, git pull, restart the service, done. Production grows. Customers show up. Something painful happens… maybe a bad deploy with no rollback, maybe a database migration that locks a table for 20 minutes.
Leadership realizes things are fragile. “We need proper DevOps.”
The work lands on one engineer. Usually the most senior person who already has a full plate. Every sprint, platform tasks lose to product deadlines because features are visible and pipelines aren’t. Nothing gets finished deep enough. Terraform is half-migrated. Monitoring covers some services but not others. Secrets are still in .env files on someone’s laptop.
Then the next outage happens. The same org that underfunded reliability asks why reliability is weak.
That’s a prioritization problem. Not a people problem.
A practical version that works
Early-stage teams don’t need a platform team on day one. But they do need honesty about scope.
Phase 1: Make production survivable
Start with the stuff that reduces blast radius when (not if) something goes wrong.
One deployment path that everyone uses. No more “I deployed from my laptop.” Whether it’s GitHub Actions or GitLab CI or even a shell script on a shared runner… everyone uses the same path. If a deploy breaks something, you look at one place to see what changed.
Version control for everything. App code, obviously. But also infra config, env vars, deploy scripts. If it runs in production, it goes in a repo. When someone asks “what changed?” during an incident, the answer should be a git log, not a guess.
Basic monitoring on the critical path. You don’t need a full observability stack on day one. Health check endpoints on your main services, a simple uptime monitor (even UptimeRobot works), some error rate tracking. The bar: can you detect the app is down before a customer tells you?
A tested backup and restore process. Take your RDS snapshot or pg_dump output. Actually restore it somewhere. Verify the data. If you’ve never tested a restore, you don’t have backups. You have hope.
A rollback trigger defined before every risky change. Before you deploy, decide: what would make us roll back? How fast can we do it? If the answer to the second question is “I don’t know,” fix that first.
Phase 2: Make changes safe
Once the basics are stable, add guardrails.
CI checks that block obvious bad changes. Linting, type checks, terraform plan output on PRs. Don’t need to be perfect. Just need to catch the stuff that would’ve been embarrassing in prod.
Reviewable infrastructure changes. No more terraform apply from a local machine. Run plan in CI. Show the diff in the PR. Require approval before apply. If someone can delete a security group without a review, that’s just automation without guardrails.
Secrets management that kills long-lived credentials. Move from .env files to AWS Secrets Manager or SSM Parameter Store or Vault. No engineer should need production database credentials on their laptop. Rotate on a schedule. Log access.
Change logs that explain what happened. Every deploy traceable to a commit and a PR. Ideally a ticket. When someone asks “what changed at 3pm yesterday?” during an incident, the answer should be a link.
Runbooks for the stuff that breaks most. You know what fails. Write down the steps to fix it. Include the commands, the dashboards to check. A mediocre runbook beats no runbook every time.
Phase 3: Make the platform repeatable
Only after Phase 1 and 2 are solid would I invest in broader platform work.
Reusable Terraform modules so new services don’t start from scratch. A standard module for “ECS service with ALB and CloudWatch alarms” saves hours per service.
Service templates (cookiecutter or even a well-documented repo template) so new services ship with CI/CD and monitoring from the start.
Environment parity. Staging doesn’t need to be identical to prod (skip multi-AZ), but close enough that bugs reproduce the same way.
Automated policy checks. OPA, Checkov, or even simple CI scripts that catch public S3 buckets and overly permissive security groups.
Cost visibility. AWS Cost Explorer tags, Infracost on PRs, basic spend alerting. Finding out about a cost spike from the finance team three weeks later is not fun.
This order matters. A lot of teams jump to Phase 3 before Phase 1 is reliable. More automation on top of unstable foundations just means your failures are faster now.
What startups should stop doing
Treating platform work like part-time cleanup. If pipeline maintenance keeps losing to feature tickets every sprint, nobody actually owns it. It’s a favor, not a job.
Asking for enterprise-grade reliability without funding it. One person can’t do 99.9% uptime and zero-downtime deploys and full IaC between feature PRs. The math doesn’t work.
Leaving production knowledge in one person’s head. If only one engineer knows how the deploy works or where the secrets live… that’s a single point of failure wearing a hoodie.
Delaying documentation because “we’ll clean it up later.” You won’t. The details will fade. The original engineer will leave. The next incident will take twice as long because nobody can find the context.
Calling every operational gap a tooling gap. “We need a better monitoring tool” is sometimes true. More often, the real issue is nobody’s looking at the monitoring you already have.
Keep it simple, keep it honest
If budget or headcount is tight, don’t pretend you already have mature DevOps. Keep the stack simpler. Narrow the change surface. Automate the highest-risk steps first. Be clear about what’s not solved yet.
Simple and honest beats sophisticated and fragile.
The goal isn’t to look good in architecture diagrams. It’s to make production boring. If a startup wants that, platform work needs real ownership. Not a side quest attached to whoever still has calendar space.