April 2 2025
A/B Tests for Engineers: How to Experiment Without Pretending Perfect Science
How to think about product experiments in a way that is useful for engineering, without treating A/B tests like empty statistical ritual or a magic truth button.
Andrews Ribeiro
Founder & Engineer
5 min Intermediate Thinking
The problem
When the topic is experimentation, teams usually get it wrong in one of two directions.
One side says:
“let’s run an A/B test for any change”
The other says:
“this is too complex, let’s just ship it”
Both can be bad.
Experimentation is not a maturity ritual.
It is a tool for reducing uncertainty in a specific kind of decision.
If you use it outside that context, you only create delay with the appearance of method.
Mental model
Think about it like this:
a good experiment exists to compare plausible hypotheses under sufficiently controlled conditions.
Three parts matter:
- hypothesis
- variation
- interpretation
If one of those three is weak, the whole test loses value.
Breaking the problem down
Start with the hypothesis, not the tool
Bad question:
- “can we run an A/B test on this?”
Better question:
- “what exactly are we trying to learn?”
Examples of hypotheses:
- reducing onboarding steps increases activation
- changing the order of plans increases conversion
- showing feedback earlier reduces abandonment
Without a clear hypothesis, the experiment becomes well-instrumented lottery.
Variants need to isolate the relevant change
This is a common mistake.
The team changes:
- copy
- layout
- screen order
- backend rule
- loading time
all in the same experiment.
Then they have no idea what caused the result.
If you really want to learn, you need to reduce what is being changed.
You will not always be able to do that perfectly.
But if the variant is a whole bundle of changes, the test is already weaker than it looks.
A primary metric without guardrails becomes a trap
If the only question is “did conversion go up?”, the experiment is incomplete.
You still need to look at things like:
- errors
- cancellations
- latency
- support load
- later retention
Otherwise the team improves the top of the funnel and pushes the problem downstream.
Not every context supports a serious test
Sometimes traffic is low.
Sometimes the change is mostly operational.
Sometimes the feature depends on a few large customers.
Sometimes behavior varies too much by segment.
In those cases, pretending scientific rigor can be worse than admitting the limitation.
The better answer may be:
- gradual rollout
- observational measurement
- guardrail tracking
- complementary qualitative analysis
That is not methodological weakness.
It is honesty about the context.
Experimentation also has an engineering cost
A lot of people ignore that.
To test properly, you need:
- segmentation
- coherent allocation
- stable tracking
- per-variant readouts
- care with rollback and exposure
If the change is small and reversible, sometimes the cost of the experiment does not pay for itself.
The point is not to be anti-test.
The point is to recognize that experimentation also consumes product and engineering capacity.
Simple example
Imagine an upgrade flow.
Hypothesis:
- highlighting the recommended plan increases paid conversion
A reasonable experiment:
- variant A: neutral cards
- variant B: one plan with visual emphasis and a short explanation
- primary metric: completed upgrade
- guardrails: cancellation soon after upgrade, billing tickets, checkout errors
Now imagine a bad experiment:
- you change the emphasis
- you change the displayed price
- you change the order of plans
- you change the CTA copy
If conversion goes up, you do not know what moved it.
If it goes down, you do not know either.
You spent energy to learn very little.
What usually goes wrong
- Running an experiment without an explicit hypothesis.
- Testing too many changes at once and calling it comparison.
- Looking only at the main metric and ignoring guardrails.
- Running a test when volume does not support even a minimal read.
- Confusing gradual rollout with controlled experimentation.
- Leaving a variant live for too long just because nobody wanted to end the discussion.
How someone more senior thinks
A more mature engineer usually asks:
- is an experiment really worth it here?
- what are we trying to learn, not just prove?
- what needs to stay constant for the result to remain useful?
- what would be an honest decision if the result is inconclusive?
That last question is great.
Because a lot of tests do not end in absolute truth, but in partial evidence.
And maturity shows up exactly there:
in the ability to decide without pretending the data supports more certainty than it actually does.
Interview angle
This topic can show up in questions like:
- “how would you validate this change?”
- “would you run an A/B test or a rollout?”
- “how would you measure impact without fooling yourself?”
The interviewer usually wants to see whether you:
- understand the difference between experimenting and just releasing
- know how to design a hypothesis, metric, and guardrail
- recognize contextual limitations
Weak answer:
I would run an A/B test and see which version performs better.
Strong answer:
I would only run an A/B test if I could isolate the variation and measure impact with some confidence. Otherwise, I would prefer gradual rollout with solid instrumentation and clear guardrails. The important part is not using the most sophisticated tool. It is learning something reliable enough to make a decision.
Closing
Good experimentation does not try to look like a lab.
It tries to reduce uncertainty without misleading the team.
When there is a clear hypothesis, a controlled variant, and a useful read, A/B testing helps a lot.
When those things are missing, the stronger move is often to admit the limitation and measure another way.
That looks less impressive on a slide.
But it is usually better for making decisions.
Quick summary
What to keep in your head
- An A/B test makes sense when there is a clear hypothesis, a controlled variation, and a minimally reliable metric.
- Not every change needs an experiment. Sometimes rollout with serious observation is the more honest choice.
- A bad experiment is not neutral. It burns time, delays decisions, and can legitimize weak interpretation.
- Mature engineering helps design the test mechanics and also recognize when the context does not support too much rigor.
Practice checklist
Use this when you answer
- Can I explain which hypothesis this experiment is trying to validate?
- Are the variants really comparable, or am I mixing too many changes together?
- Do I have a main metric and guardrails that are good enough to interpret the result?
- If I cannot test this properly, can I defend a rollout with measurement instead of theatrical experimentation?
You finished this article
Share this page
Copy the link manually from the field below.