April 2 2025

A/B Tests for Engineers: How to Experiment Without Pretending Perfect Science

How to think about product experiments in a way that is useful for engineering, without treating A/B tests like empty statistical ritual or a magic truth button.

Andrews Ribeiro

Founder & Engineer

5 min Intermediate Thinking

#senior-thinking#experiments#ab-testing#product#delivery

The problem

When the topic is experimentation, teams usually get it wrong in one of two directions.

One side says:

“let’s run an A/B test for any change”

The other says:

“this is too complex, let’s just ship it”

Both can be bad.

Experimentation is not a maturity ritual.

It is a tool for reducing uncertainty in a specific kind of decision.

If you use it outside that context, you only create delay with the appearance of method.

Mental model

Think about it like this:

a good experiment exists to compare plausible hypotheses under sufficiently controlled conditions.

Three parts matter:

hypothesis
variation
interpretation

If one of those three is weak, the whole test loses value.

Breaking the problem down

Start with the hypothesis, not the tool

Bad question:

“can we run an A/B test on this?”

Better question:

“what exactly are we trying to learn?”

Examples of hypotheses:

reducing onboarding steps increases activation
changing the order of plans increases conversion
showing feedback earlier reduces abandonment

Without a clear hypothesis, the experiment becomes well-instrumented lottery.

Variants need to isolate the relevant change

This is a common mistake.

The team changes:

copy
layout
screen order
backend rule
loading time

all in the same experiment.

Then they have no idea what caused the result.

If you really want to learn, you need to reduce what is being changed.

You will not always be able to do that perfectly.

But if the variant is a whole bundle of changes, the test is already weaker than it looks.

A primary metric without guardrails becomes a trap

If the only question is “did conversion go up?”, the experiment is incomplete.

You still need to look at things like:

errors
cancellations
latency
support load
later retention

Otherwise the team improves the top of the funnel and pushes the problem downstream.

Not every context supports a serious test

Sometimes traffic is low.

Sometimes the change is mostly operational.

Sometimes the feature depends on a few large customers.

Sometimes behavior varies too much by segment.

In those cases, pretending scientific rigor can be worse than admitting the limitation.

The better answer may be:

gradual rollout
observational measurement
guardrail tracking
complementary qualitative analysis

That is not methodological weakness.

It is honesty about the context.

Experimentation also has an engineering cost

A lot of people ignore that.

To test properly, you need:

segmentation
coherent allocation
stable tracking
per-variant readouts
care with rollback and exposure

If the change is small and reversible, sometimes the cost of the experiment does not pay for itself.

The point is not to be anti-test.

The point is to recognize that experimentation also consumes product and engineering capacity.

Simple example

Imagine an upgrade flow.

Hypothesis:

highlighting the recommended plan increases paid conversion

A reasonable experiment:

variant A: neutral cards
variant B: one plan with visual emphasis and a short explanation
primary metric: completed upgrade
guardrails: cancellation soon after upgrade, billing tickets, checkout errors

Now imagine a bad experiment:

you change the emphasis
you change the displayed price
you change the order of plans
you change the CTA copy

If conversion goes up, you do not know what moved it.

If it goes down, you do not know either.

You spent energy to learn very little.

What usually goes wrong

Running an experiment without an explicit hypothesis.
Testing too many changes at once and calling it comparison.
Looking only at the main metric and ignoring guardrails.
Running a test when volume does not support even a minimal read.
Confusing gradual rollout with controlled experimentation.
Leaving a variant live for too long just because nobody wanted to end the discussion.

How someone more senior thinks

A more mature engineer usually asks:

is an experiment really worth it here?
what are we trying to learn, not just prove?
what needs to stay constant for the result to remain useful?
what would be an honest decision if the result is inconclusive?

That last question is great.

Because a lot of tests do not end in absolute truth, but in partial evidence.

And maturity shows up exactly there:

in the ability to decide without pretending the data supports more certainty than it actually does.

Interview angle

This topic can show up in questions like:

“how would you validate this change?”
“would you run an A/B test or a rollout?”
“how would you measure impact without fooling yourself?”

The interviewer usually wants to see whether you:

understand the difference between experimenting and just releasing
know how to design a hypothesis, metric, and guardrail
recognize contextual limitations

Weak answer:

I would run an A/B test and see which version performs better.

Strong answer:

I would only run an A/B test if I could isolate the variation and measure impact with some confidence. Otherwise, I would prefer gradual rollout with solid instrumentation and clear guardrails. The important part is not using the most sophisticated tool. It is learning something reliable enough to make a decision.

Closing

Good experimentation does not try to look like a lab.

It tries to reduce uncertainty without misleading the team.

When there is a clear hypothesis, a controlled variant, and a useful read, A/B testing helps a lot.

When those things are missing, the stronger move is often to admit the limitation and measure another way.

That looks less impressive on a slide.

But it is usually better for making decisions.

Quick summary

What to keep in your head

An A/B test makes sense when there is a clear hypothesis, a controlled variation, and a minimally reliable metric.
Not every change needs an experiment. Sometimes rollout with serious observation is the more honest choice.
A bad experiment is not neutral. It burns time, delays decisions, and can legitimize weak interpretation.
Mature engineering helps design the test mechanics and also recognize when the context does not support too much rigor.

Practice checklist

Use this when you answer

Can I explain which hypothesis this experiment is trying to validate?
Are the variants really comparable, or am I mixing too many changes together?
Do I have a main metric and guardrails that are good enough to interpret the result?
If I cannot test this properly, can I defend a rollout with measurement instead of theatrical experimentation?

You finished this article

Next step

Feature Flags vs Deploy: When to Use Each One Next step →