April 16 2025

On-Call Without Becoming a Firefighter

How to think about on-call in a sustainable way, with better signals, process, and learning instead of reactive heroics.

Andrews Ribeiro

Founder & Engineer

6 min Intermediate Systems

#debugging-production#debugging#production#on-call#reliability#incidents

The problem

In some places, on-call means one simple thing:

someone is responsible for responding to incidents when they happen

And in other places, on-call means this:

the pager goes off because of noise
the alert has no context
another night is lost to a recurring incident
the system depends on a few specific people
everyone feels the system is always close to breaking

When that happens, the team starts treating on-call as if on-call itself were the problem.

But most of the time, the real problem is somewhere else:

bad alerts
fragile systems
no runbook
unclear ownership
no structural correction after incidents

In other words:

the team becomes firefighters because the rest of the operation is already on fire all the time.

Mental model

Think about it like this:

healthy on-call is not about enduring more incidents. It is about responding well today and reducing the chance of responding to the same thing tomorrow.

That definition helps a lot.

Because it moves the topic out of personal toughness and into reliability.

Good on-call does not depend on one heroic person.

It depends on systems, process, and learning that make the response:

clearer
faster
less chaotic
less repetitive

Breaking it down

Bad on-call is usually a symptom of bad operations

This is the first important point.

If on-call is hurting too much, it is worth asking:

do the alerts represent real problems?
is there enough context to respond?
do the same incidents return without correction?
can everyone mitigate, or does it depend on tribal memory?

When the answers to those questions are weak, the rotation turns into predictable overload.

That is not bad luck.

It is a badly designed operation.

Good alerts protect human attention

A lot of on-call pain starts here.

If everything alerts:

nothing gets prioritized
people stop trusting the signal
noise eats energy

Good alerts usually answer:

does this need human action right now?
what impact does it suggest?
with what minimum context can someone start the investigation?

If the alert only says “something seems wrong,” it pushes the whole investigation onto the on-call person.

A runbook is not bureaucracy. It is panic reduction

A missing or weak runbook makes every incident feel brand new.

That is expensive.

Especially at night or under pressure.

A useful runbook does not have to be huge.

But it should help with things like:

where to look first
how to mitigate safely
what was already tried before
when to escalate
when rollback makes sense

It does not replace judgment.

But it reduces the time wasted reinventing the start of the response.

Good on-call closes the loop with postmortem and correction

If the team responds to an incident and then moves on without fixing the weakness, the pager will come back.

This is the center of the problem.

Sustainable on-call depends on:

reducing recurrence
improving detection
documenting the response better
simplifying fragile operations

Without that, on-call becomes a queue of repeated pain.

Heroics usually hide a system deficiency

Every team knows someone who:

knows the shortcuts
finds the problem quickly
saves the night

That looks impressive in the short term.

But it may also hide a dangerous dependency.

If the system is only operable with that person, the team is not safe.

A mature interview answer recognizes this.

It does not romanticize the hero.

It asks how the team reduces the need for one.

On-call is also about human load

This point is not “soft.” It is operational.

Chronically bad on-call creates:

fatigue
loss of context
worse response under pressure
turnover
more human error

So talking about healthy on-call also means talking about:

a viable rotation
clear escalation
sustainable volume
continuous improvement to reduce load

That is not a secondary detail.

It is decent operational engineering.

In interviews, a good answer combines reaction and prevention

Many people answer on-call questions by focusing only on:

reacting fast
investigating well
escalating when necessary

All of that matters.

But a stronger answer takes the next step:

how to avoid repetition
how to improve the alert
how to reduce human dependency
how to make the operation more sustainable

That is what separates firefighting from reliability engineering.

Simple example

Imagine a service that triggers a latency alert every night, but there is almost never real user impact.

Weak answer:

“While on call, I would monitor it more closely and try to respond fast when it fires.”

That still accepts the bad system as given.

Better answer:

“If the alert fires often without requiring real action, I would treat that as a reliability problem in the on-call process itself. In the short term I would respond and confirm impact. In the medium term I would review the alert condition, the context it sends, and the threshold it uses, because on-call cannot depend on waking people up for recurring noise. If the team keeps receiving useless pages, the operation is wasting human attention and making the response worse for the incident that actually matters.”

That answer works because it shows:

immediate response
structural improvement
respect for operational cost
a sustainability mindset

Common mistakes

treating on-call like a heroism test
accepting bad alerts as a natural part of life
focusing only on reaction and never on reducing recurrence
depending on a few people to answer difficult incidents
ignoring human cost as if it did not affect operational quality

How a senior thinks

Engineers who are mature in operations often think like this:

“Every page is an expensive interruption. If it does not correspond to a real problem, or if it keeps coming back for the same reason, the on-call system is failing.”

That lens is very strong.

Because it makes you respect human attention as a finite resource.

Seniority here is not looking calm in chaos.

It is building an environment where chaos appears less often and is more manageable when it does appear.

What the interviewer wants to see

When this topic comes up, the evaluator usually wants to understand whether you:

see on-call as a system, not as inevitable suffering
know how to combine incident response with structural improvement
understand the role of alerts, runbooks, and postmortems
think about operational sustainability, not only about putting out fires
avoid romanticizing a heavy rotation as proof of maturity

A strong answer usually shows:

how to respond well when the incident happens
how to reduce on-call load and noise
how to turn recurrence into improvement work
how to make the team less dependent on heroes

If that appears, the answer already gets much stronger.

Healthy on-call is not a team that suffers quietly. It is a team that learns enough to suffer less.

When the rotation becomes a routine of fires, the problem stopped being only response. It became system design.

Quick summary

What to keep in your head

Healthy on-call depends less on individual heroics and more on good alerts, useful runbooks, and less fragile systems.
If the rotation keeps fighting the same fires every week, the problem stopped being only operational and became structural.
Responding to incidents is only part of on-call. Learning and reducing recurrence is what stops the team from becoming firefighters.
In interviews, a strong answer shows how you think about sustainability, not only reaction.

Practice checklist

Use this when you answer

Can I explain what separates healthy on-call from chaotic on-call?
Do I know how alerts, runbooks, and postmortems reduce operational load?
Can I show that on-call is not only reaction, but also closing the learning loop?
Can I answer without romanticizing suffering or ignoring operational responsibility?

You finished this article

Next step

Investigating Production Failures Next step →