On-call has a bad reputation in software engineering, and often deservedly so. Being paged at 3am for an alert that didn’t need to wake anyone is demoralising. Being on-call for systems you didn’t build, don’t understand, and can’t fix is terrifying. Being paged multiple times a night for weeks is a health risk.
But on-call done well is a powerful practice. It creates direct feedback between the reliability of what you build and the experience of carrying it. When engineers are responsible for their own systems, they ship more reliable systems.
Here’s the version that worked at the European fintech firm.
The Principles That Made It Sustainable
You build it, you run it — but not alone. Each service had a primary on-call owner (the team or individual most familiar with it), but all pages during off-hours had a documented escalation path. The on-call engineer was not expected to know everything — they were expected to triage, handle what they could, and escalate the rest without hesitation.
No alert should be “known noise.” If an alert fires and the response is “oh that, it always fires, just acknowledge it” — that alert should be deleted or fixed. Every alert that fires at off-hours should represent something that genuinely requires human attention. We did a quarterly alert audit: every alert that fired but was acknowledged without action was reviewed and either fixed, had its threshold adjusted, or was deleted.
Incidents are free. No blame culture around incidents. Every outage is a system failure, not an individual failure. We ran post-mortems for every significant incident, but the tone was investigative, not inquisitorial. The goal was to understand what happened and prevent it from happening again, not to find who to blame.
On-call load is a product quality metric. We tracked pages per on-call rotation. If the number was high, that was a signal that we were shipping unreliable code, not that the on-call engineer wasn’t handling it well. High page volume was a priority on the engineering roadmap, not a personal failing.
The Rotation Design
Rotation length: one week on, several weeks off. One-week rotations meant the on-call burden was predictable and bounded. Engineers could plan around it. Two-week rotations were long enough to become demoralising if the week was bad.
Handoff ritual: every rotation handoff included a 30-minute sync between outgoing and incoming on-call. What was currently in a degraded state, what recent changes to watch, what the current known issues were. This prevented the incoming on-call from being blindsided.
Protected time after a bad night: an engineer who was paged between midnight and 6am got the following morning off — no standups, no meetings, work-from-wherever. This was not a formal policy; it was a norm the team held each other to. Being paged at 3am and then expected to be productive and present at 9am is how you burn people out.
The On-Call Runbooks
Every alerting rule had a corresponding runbook: what the alert means, what the likely causes are, what the diagnostic steps are, and what the resolution steps are. Not a document that assumes you already know the system — a document written for an engineer who has never seen this issue before, at 3am, possibly frightened.
Good runbook structure:
## [Alert Name]: ServiceX ErrorRate > 5%
### What this means
ServiceX is failing more than 5% of requests. Users may be experiencing errors.
### Immediate check
1. Open Grafana dashboard: [link]
2. Check: is error rate rising or stable?
3. Check: which endpoint is erroring? (dashboard panel: "Errors by endpoint")
### Common causes and fixes
**Cause: Database connection exhaustion**
Symptom: errors concentrated on write endpoints, db.connections metric > 90
Fix: restart ServiceX pods to reset connections (kubectl rollout restart deployment/servicex)
Open a P1 ticket to investigate connection leak
**Cause: Upstream ServiceY unavailable**
Symptom: errors on all endpoints, ServiceY health check failing
Fix: Check ServiceY status in monitoring. If down, follow ServiceY runbook.
ServiceX will degrade gracefully (read-only mode) until ServiceY recovers.
### Escalation
If not resolved in 20 minutes: page [ServiceX primary owner]
If ServiceX is fully down: page [engineering lead]
Writing runbooks is unglamorous work. Having them makes 3am incidents significantly less frightening.
The Reliability Feedback Loop
The practice that most improved reliability over time: tracking alert frequency per service and reviewing it in planning.
Once a month, we looked at a simple table:
Service Alerts (last 30d) P1s Trend
──────────────────────────────────────────────────────
order-service 2 0 ↓
pricing-service 18 2 ↑
notification-service 1 0 →
payment-service 34 3 ↑↑
payment-service at 34 alerts and 3 P1s in a month became a reliability project, not just an on-call problem. We allocated engineering time to reducing its alert volume — not because the on-call engineer was struggling, but because the system was clearly unreliable and the team that built it should fix it.
This reframing — “high alert volume is a product quality issue” — changed the conversation. It stopped being “on-call is hard” and started being “we have work to do on these specific systems.”
The Things That Still Went Wrong
Ambiguous escalation paths. A few incidents dragged because it wasn’t clear who to escalate to. The escalation contacts in runbooks were sometimes out of date. We fixed this by making the escalation list a structured document with named individuals, kept current, reviewed at each rotation handoff.
Alert fatigue from flaky tests. Our integration test suite ran in production for some services (monitoring tests). When test infrastructure was flaky, it generated false alerts that diluted attention from real ones. Separated the test alerts from operational alerts.
Off-hours deploys creating incidents. Our deployment process allowed anyone to deploy at any time. Several incidents were self-inflicted: a deploy went out on a Friday evening, something broke over the weekend. Simple fix: required approval from the current on-call before deploying during off-hours.
On-call is not inherently bad. It’s a mechanism for creating accountability and a fast feedback loop between what you ship and what happens in production. The practices that make it sustainable — clear escalation, no-blame culture, runbooks, alert hygiene — are the same practices that make systems more reliable. The goal isn’t to make on-call easier to tolerate. It’s to make the systems good enough that on-call is rarely eventful.