April 16 2025
On-Call Without Becoming a Firefighter
How to think about on-call in a sustainable way, with better signals, process, and learning instead of reactive heroics.
Andrews Ribeiro
Founder & Engineer
6 min Intermediate Systems
The problem
In some places, on-call means one simple thing:
- someone is responsible for responding to incidents when they happen
And in other places, on-call means this:
- the pager goes off because of noise
- the alert has no context
- another night is lost to a recurring incident
- the system depends on a few specific people
- everyone feels the system is always close to breaking
When that happens, the team starts treating on-call as if on-call itself were the problem.
But most of the time, the real problem is somewhere else:
- bad alerts
- fragile systems
- no runbook
- unclear ownership
- no structural correction after incidents
In other words:
the team becomes firefighters because the rest of the operation is already on fire all the time.
Mental model
Think about it like this:
healthy on-call is not about enduring more incidents. It is about responding well today and reducing the chance of responding to the same thing tomorrow.
That definition helps a lot.
Because it moves the topic out of personal toughness and into reliability.
Good on-call does not depend on one heroic person.
It depends on systems, process, and learning that make the response:
- clearer
- faster
- less chaotic
- less repetitive
Breaking it down
Bad on-call is usually a symptom of bad operations
This is the first important point.
If on-call is hurting too much, it is worth asking:
- do the alerts represent real problems?
- is there enough context to respond?
- do the same incidents return without correction?
- can everyone mitigate, or does it depend on tribal memory?
When the answers to those questions are weak, the rotation turns into predictable overload.
That is not bad luck.
It is a badly designed operation.
Good alerts protect human attention
A lot of on-call pain starts here.
If everything alerts:
- nothing gets prioritized
- people stop trusting the signal
- noise eats energy
Good alerts usually answer:
- does this need human action right now?
- what impact does it suggest?
- with what minimum context can someone start the investigation?
If the alert only says “something seems wrong,” it pushes the whole investigation onto the on-call person.
A runbook is not bureaucracy. It is panic reduction
A missing or weak runbook makes every incident feel brand new.
That is expensive.
Especially at night or under pressure.
A useful runbook does not have to be huge.
But it should help with things like:
- where to look first
- how to mitigate safely
- what was already tried before
- when to escalate
- when rollback makes sense
It does not replace judgment.
But it reduces the time wasted reinventing the start of the response.
Good on-call closes the loop with postmortem and correction
If the team responds to an incident and then moves on without fixing the weakness, the pager will come back.
This is the center of the problem.
Sustainable on-call depends on:
- reducing recurrence
- improving detection
- documenting the response better
- simplifying fragile operations
Without that, on-call becomes a queue of repeated pain.
Heroics usually hide a system deficiency
Every team knows someone who:
- knows the shortcuts
- finds the problem quickly
- saves the night
That looks impressive in the short term.
But it may also hide a dangerous dependency.
If the system is only operable with that person, the team is not safe.
A mature interview answer recognizes this.
It does not romanticize the hero.
It asks how the team reduces the need for one.
On-call is also about human load
This point is not “soft.” It is operational.
Chronically bad on-call creates:
- fatigue
- loss of context
- worse response under pressure
- turnover
- more human error
So talking about healthy on-call also means talking about:
- a viable rotation
- clear escalation
- sustainable volume
- continuous improvement to reduce load
That is not a secondary detail.
It is decent operational engineering.
In interviews, a good answer combines reaction and prevention
Many people answer on-call questions by focusing only on:
- reacting fast
- investigating well
- escalating when necessary
All of that matters.
But a stronger answer takes the next step:
- how to avoid repetition
- how to improve the alert
- how to reduce human dependency
- how to make the operation more sustainable
That is what separates firefighting from reliability engineering.
Simple example
Imagine a service that triggers a latency alert every night, but there is almost never real user impact.
Weak answer:
“While on call, I would monitor it more closely and try to respond fast when it fires.”
That still accepts the bad system as given.
Better answer:
“If the alert fires often without requiring real action, I would treat that as a reliability problem in the on-call process itself. In the short term I would respond and confirm impact. In the medium term I would review the alert condition, the context it sends, and the threshold it uses, because on-call cannot depend on waking people up for recurring noise. If the team keeps receiving useless pages, the operation is wasting human attention and making the response worse for the incident that actually matters.”
That answer works because it shows:
- immediate response
- structural improvement
- respect for operational cost
- a sustainability mindset
Common mistakes
- treating on-call like a heroism test
- accepting bad alerts as a natural part of life
- focusing only on reaction and never on reducing recurrence
- depending on a few people to answer difficult incidents
- ignoring human cost as if it did not affect operational quality
How a senior thinks
Engineers who are mature in operations often think like this:
“Every page is an expensive interruption. If it does not correspond to a real problem, or if it keeps coming back for the same reason, the on-call system is failing.”
That lens is very strong.
Because it makes you respect human attention as a finite resource.
Seniority here is not looking calm in chaos.
It is building an environment where chaos appears less often and is more manageable when it does appear.
What the interviewer wants to see
When this topic comes up, the evaluator usually wants to understand whether you:
- see on-call as a system, not as inevitable suffering
- know how to combine incident response with structural improvement
- understand the role of alerts, runbooks, and postmortems
- think about operational sustainability, not only about putting out fires
- avoid romanticizing a heavy rotation as proof of maturity
A strong answer usually shows:
- how to respond well when the incident happens
- how to reduce on-call load and noise
- how to turn recurrence into improvement work
- how to make the team less dependent on heroes
If that appears, the answer already gets much stronger.
Healthy on-call is not a team that suffers quietly. It is a team that learns enough to suffer less.
When the rotation becomes a routine of fires, the problem stopped being only response. It became system design.
Quick summary
What to keep in your head
- Healthy on-call depends less on individual heroics and more on good alerts, useful runbooks, and less fragile systems.
- If the rotation keeps fighting the same fires every week, the problem stopped being only operational and became structural.
- Responding to incidents is only part of on-call. Learning and reducing recurrence is what stops the team from becoming firefighters.
- In interviews, a strong answer shows how you think about sustainability, not only reaction.
Practice checklist
Use this when you answer
- Can I explain what separates healthy on-call from chaotic on-call?
- Do I know how alerts, runbooks, and postmortems reduce operational load?
- Can I show that on-call is not only reaction, but also closing the learning loop?
- Can I answer without romanticizing suffering or ignoring operational responsibility?
You finished this article
Share this page
Copy the link manually from the field below.