Skip to main content

A/B Tests for Engineers: How to Experiment Without Pretending Perfect Science

How to think about product experiments in a way that is useful for engineering, without treating A/B tests like empty statistical ritual or a magic truth button.

Andrews Ribeiro

Andrews Ribeiro

Founder & Engineer

The problem

When the topic is experimentation, teams usually get it wrong in one of two directions.

One side says:

“let’s run an A/B test for any change”

The other says:

“this is too complex, let’s just ship it”

Both can be bad.

Experimentation is not a maturity ritual.

It is a tool for reducing uncertainty in a specific kind of decision.

If you use it outside that context, you only create delay with the appearance of method.

Mental model

Think about it like this:

a good experiment exists to compare plausible hypotheses under sufficiently controlled conditions.

Three parts matter:

  1. hypothesis
  2. variation
  3. interpretation

If one of those three is weak, the whole test loses value.

Breaking the problem down

Start with the hypothesis, not the tool

Bad question:

  • “can we run an A/B test on this?”

Better question:

  • “what exactly are we trying to learn?”

Examples of hypotheses:

  • reducing onboarding steps increases activation
  • changing the order of plans increases conversion
  • showing feedback earlier reduces abandonment

Without a clear hypothesis, the experiment becomes well-instrumented lottery.

Variants need to isolate the relevant change

This is a common mistake.

The team changes:

  • copy
  • layout
  • screen order
  • backend rule
  • loading time

all in the same experiment.

Then they have no idea what caused the result.

If you really want to learn, you need to reduce what is being changed.

You will not always be able to do that perfectly.

But if the variant is a whole bundle of changes, the test is already weaker than it looks.

A primary metric without guardrails becomes a trap

If the only question is “did conversion go up?”, the experiment is incomplete.

You still need to look at things like:

  • errors
  • cancellations
  • latency
  • support load
  • later retention

Otherwise the team improves the top of the funnel and pushes the problem downstream.

Not every context supports a serious test

Sometimes traffic is low.

Sometimes the change is mostly operational.

Sometimes the feature depends on a few large customers.

Sometimes behavior varies too much by segment.

In those cases, pretending scientific rigor can be worse than admitting the limitation.

The better answer may be:

  • gradual rollout
  • observational measurement
  • guardrail tracking
  • complementary qualitative analysis

That is not methodological weakness.

It is honesty about the context.

Experimentation also has an engineering cost

A lot of people ignore that.

To test properly, you need:

  • segmentation
  • coherent allocation
  • stable tracking
  • per-variant readouts
  • care with rollback and exposure

If the change is small and reversible, sometimes the cost of the experiment does not pay for itself.

The point is not to be anti-test.

The point is to recognize that experimentation also consumes product and engineering capacity.

Simple example

Imagine an upgrade flow.

Hypothesis:

  • highlighting the recommended plan increases paid conversion

A reasonable experiment:

  • variant A: neutral cards
  • variant B: one plan with visual emphasis and a short explanation
  • primary metric: completed upgrade
  • guardrails: cancellation soon after upgrade, billing tickets, checkout errors

Now imagine a bad experiment:

  • you change the emphasis
  • you change the displayed price
  • you change the order of plans
  • you change the CTA copy

If conversion goes up, you do not know what moved it.

If it goes down, you do not know either.

You spent energy to learn very little.

What usually goes wrong

  • Running an experiment without an explicit hypothesis.
  • Testing too many changes at once and calling it comparison.
  • Looking only at the main metric and ignoring guardrails.
  • Running a test when volume does not support even a minimal read.
  • Confusing gradual rollout with controlled experimentation.
  • Leaving a variant live for too long just because nobody wanted to end the discussion.

How someone more senior thinks

A more mature engineer usually asks:

  • is an experiment really worth it here?
  • what are we trying to learn, not just prove?
  • what needs to stay constant for the result to remain useful?
  • what would be an honest decision if the result is inconclusive?

That last question is great.

Because a lot of tests do not end in absolute truth, but in partial evidence.

And maturity shows up exactly there:

in the ability to decide without pretending the data supports more certainty than it actually does.

Interview angle

This topic can show up in questions like:

  • “how would you validate this change?”
  • “would you run an A/B test or a rollout?”
  • “how would you measure impact without fooling yourself?”

The interviewer usually wants to see whether you:

  • understand the difference between experimenting and just releasing
  • know how to design a hypothesis, metric, and guardrail
  • recognize contextual limitations

Weak answer:

I would run an A/B test and see which version performs better.

Strong answer:

I would only run an A/B test if I could isolate the variation and measure impact with some confidence. Otherwise, I would prefer gradual rollout with solid instrumentation and clear guardrails. The important part is not using the most sophisticated tool. It is learning something reliable enough to make a decision.

Closing

Good experimentation does not try to look like a lab.

It tries to reduce uncertainty without misleading the team.

When there is a clear hypothesis, a controlled variant, and a useful read, A/B testing helps a lot.

When those things are missing, the stronger move is often to admit the limitation and measure another way.

That looks less impressive on a slide.

But it is usually better for making decisions.

Quick summary

What to keep in your head

Practice checklist

Use this when you answer

You finished this article

Next article Rollout vs Experiment: When to Measure, When to Compare, and When to Just Release Previous article Product Funnels Without Fooling Yourself With Pretty Percentages

Keep exploring

Related articles