Skip to main content

On-Call Without Becoming a Firefighter

How to think about on-call in a sustainable way, with better signals, process, and learning instead of reactive heroics.

Andrews Ribeiro

Andrews Ribeiro

Founder & Engineer

The problem

In some places, on-call means one simple thing:

  • someone is responsible for responding to incidents when they happen

And in other places, on-call means this:

  • the pager goes off because of noise
  • the alert has no context
  • another night is lost to a recurring incident
  • the system depends on a few specific people
  • everyone feels the system is always close to breaking

When that happens, the team starts treating on-call as if on-call itself were the problem.

But most of the time, the real problem is somewhere else:

  • bad alerts
  • fragile systems
  • no runbook
  • unclear ownership
  • no structural correction after incidents

In other words:

the team becomes firefighters because the rest of the operation is already on fire all the time.

Mental model

Think about it like this:

healthy on-call is not about enduring more incidents. It is about responding well today and reducing the chance of responding to the same thing tomorrow.

That definition helps a lot.

Because it moves the topic out of personal toughness and into reliability.

Good on-call does not depend on one heroic person.

It depends on systems, process, and learning that make the response:

  • clearer
  • faster
  • less chaotic
  • less repetitive

Breaking it down

Bad on-call is usually a symptom of bad operations

This is the first important point.

If on-call is hurting too much, it is worth asking:

  • do the alerts represent real problems?
  • is there enough context to respond?
  • do the same incidents return without correction?
  • can everyone mitigate, or does it depend on tribal memory?

When the answers to those questions are weak, the rotation turns into predictable overload.

That is not bad luck.

It is a badly designed operation.

Good alerts protect human attention

A lot of on-call pain starts here.

If everything alerts:

  • nothing gets prioritized
  • people stop trusting the signal
  • noise eats energy

Good alerts usually answer:

  • does this need human action right now?
  • what impact does it suggest?
  • with what minimum context can someone start the investigation?

If the alert only says “something seems wrong,” it pushes the whole investigation onto the on-call person.

A runbook is not bureaucracy. It is panic reduction

A missing or weak runbook makes every incident feel brand new.

That is expensive.

Especially at night or under pressure.

A useful runbook does not have to be huge.

But it should help with things like:

  • where to look first
  • how to mitigate safely
  • what was already tried before
  • when to escalate
  • when rollback makes sense

It does not replace judgment.

But it reduces the time wasted reinventing the start of the response.

Good on-call closes the loop with postmortem and correction

If the team responds to an incident and then moves on without fixing the weakness, the pager will come back.

This is the center of the problem.

Sustainable on-call depends on:

  • reducing recurrence
  • improving detection
  • documenting the response better
  • simplifying fragile operations

Without that, on-call becomes a queue of repeated pain.

Heroics usually hide a system deficiency

Every team knows someone who:

  • knows the shortcuts
  • finds the problem quickly
  • saves the night

That looks impressive in the short term.

But it may also hide a dangerous dependency.

If the system is only operable with that person, the team is not safe.

A mature interview answer recognizes this.

It does not romanticize the hero.

It asks how the team reduces the need for one.

On-call is also about human load

This point is not “soft.” It is operational.

Chronically bad on-call creates:

  • fatigue
  • loss of context
  • worse response under pressure
  • turnover
  • more human error

So talking about healthy on-call also means talking about:

  • a viable rotation
  • clear escalation
  • sustainable volume
  • continuous improvement to reduce load

That is not a secondary detail.

It is decent operational engineering.

In interviews, a good answer combines reaction and prevention

Many people answer on-call questions by focusing only on:

  • reacting fast
  • investigating well
  • escalating when necessary

All of that matters.

But a stronger answer takes the next step:

  • how to avoid repetition
  • how to improve the alert
  • how to reduce human dependency
  • how to make the operation more sustainable

That is what separates firefighting from reliability engineering.

Simple example

Imagine a service that triggers a latency alert every night, but there is almost never real user impact.

Weak answer:

“While on call, I would monitor it more closely and try to respond fast when it fires.”

That still accepts the bad system as given.

Better answer:

“If the alert fires often without requiring real action, I would treat that as a reliability problem in the on-call process itself. In the short term I would respond and confirm impact. In the medium term I would review the alert condition, the context it sends, and the threshold it uses, because on-call cannot depend on waking people up for recurring noise. If the team keeps receiving useless pages, the operation is wasting human attention and making the response worse for the incident that actually matters.”

That answer works because it shows:

  • immediate response
  • structural improvement
  • respect for operational cost
  • a sustainability mindset

Common mistakes

  • treating on-call like a heroism test
  • accepting bad alerts as a natural part of life
  • focusing only on reaction and never on reducing recurrence
  • depending on a few people to answer difficult incidents
  • ignoring human cost as if it did not affect operational quality

How a senior thinks

Engineers who are mature in operations often think like this:

“Every page is an expensive interruption. If it does not correspond to a real problem, or if it keeps coming back for the same reason, the on-call system is failing.”

That lens is very strong.

Because it makes you respect human attention as a finite resource.

Seniority here is not looking calm in chaos.

It is building an environment where chaos appears less often and is more manageable when it does appear.

What the interviewer wants to see

When this topic comes up, the evaluator usually wants to understand whether you:

  • see on-call as a system, not as inevitable suffering
  • know how to combine incident response with structural improvement
  • understand the role of alerts, runbooks, and postmortems
  • think about operational sustainability, not only about putting out fires
  • avoid romanticizing a heavy rotation as proof of maturity

A strong answer usually shows:

  1. how to respond well when the incident happens
  2. how to reduce on-call load and noise
  3. how to turn recurrence into improvement work
  4. how to make the team less dependent on heroes

If that appears, the answer already gets much stronger.

Healthy on-call is not a team that suffers quietly. It is a team that learns enough to suffer less.

When the rotation becomes a routine of fires, the problem stopped being only response. It became system design.

Quick summary

What to keep in your head

Practice checklist

Use this when you answer

You finished this article

Next article Writing Postmortems the Team Respects Previous article Hypothesis, Isolation, and Confirmation

Keep exploring

Related articles