Scale Experiment Decision-Making with Programmatic Decision Rules
Decide what to do with experiment results in code
Photo by Cytonn Photography on Unsplash
The experiment lifecycle is like the human lifecycle. First, a person or idea is born, then it develops, then it is tested, then its test ends, and then the Gods (or Product Managers) decide its worth.
But a lot of things happen during a life or an experiment. Sometimes, a person or idea is good in one way but bad in another. How are the Gods supposed to decide? They have to make some tradeoffs. There’s no avoiding it.
The key is to make these tradeoffs before the experiment and before we see the results. We do not want to decide on the rules based on our pre-existing biases about which ideas deserve to go to heaven (err… launch — I think I’ve stretched the metaphor far enough). We want to write our scripture (okay, one more) before the experiment starts.
The point of this blog is to propose that we should write how we will make decisions explicitly—not in English, which permits vague language, e.g., “we’ll consider the effect on engagement as well, balancing against revenue” and similar wishy-washy, unquantified statements — but in code.
I’m proposing an “Analysis Contract,” which enforces how we will make decisions.
A contract is a function in your favorite programming language. The contract takes the “basic results” of an experiment as arguments. Determining which basic results matter for decision-making is part of defining the contract. Usually, in an experiment, the basic results are treatment effects, the standard errors of treatment effects, and configuration parameters like the number of peeks. Given these results, the contract returns an arm or a variant of the experiment as the variant that will launch. For example, it would return either ‘A’ or ‘B’ in a standard A/B test.
It might look something like this:
int
analysis_contract(double te1, double te1_se, ….)
{
if ((te1/se1 < 1.96) && (…conditions…))
return 0 /* for variant 0 */
if (…conditions…)
return 1 /* for variant 1 */
/* and so on */
}
The Experimentation Platform would then associate the contract with the particular experiment. When the experiment ends, the platform processes the contract and ships the winning variant according to the rules specified in the contract.
I’ll add the caveat here that this is an idea. It’s not a story about a technique I’ve seen implemented in practice, so there may be practical issues with various details that would be ironed out in a real-world deployment. I think Analysis Contracts would mitigate the problem of ad-hoc decision-making and force us to think deeply about and pre-register how we will deal with the most common scenario in experimentation: effects that we thought we would move a lot are insignificant.
By using Analysis Contracts, we can…
Make decisions upfront
We do not want to change how we make decisions because of the particular dataset our experiment happened to generate.
There’s no (good) reason why we should wait until after the experiment to say whether we would ship in Scenario X. We should be able to say it before the experiment. If we are unwilling to, it suggests that we are relying on something else outside the data and the experiment results. That information might be useful, but information that doesn’t depend on the experiment results was available before the experiment. Why didn’t we commit to using it then?
Statistical inference is based on a model of behavior. In that model, we know exactly how we would make decisions — if only we knew certain parameters. We gather data to estimate those parameters and then decide what to do based on our estimates. Not specifying our decision function breaks this model, and many of the statistical properties we take for granted are just not true if we change how we call an experiment based on the data we see.
We might say: “We promise not to make decisions this way.” But then, after the experiment, the results aren’t very clear. A lot of things are insignificant. So, we cut the data in a million ways, find a few “significant” results, and tell a story from them. It’s hard to keep our promises.
The cure isn’t to make a promise we can’t keep. The cure is to make a promise the system won’t let us (quietly) break.
Be consistent, clear, and precise about how we make decisions
English is a vague language, and writing our guidelines in it leaves a lot of room for interpretation. Code forces us to decide what we will do explicitly and, to say, quantitatively, e.g., how much revenue we will give up in the short run to improve our subscription product in the long run, for example.
Code improves communication enormously because I don’t have to interpret what you mean. I can plug in different results and see what decisions you would have made if the results had differed. This can be incredibly useful for retrospective analysis of past experiments as well. Because we have an actual function mapping to decisions, we can run various simulations, bootstraps, etc, and re-decide the experiment based on that data.
But what if I disagree with the Analysis Contract’s decision?
One of the primary objections to Analysis Contracts is that after the experiment, we might decide we had the wrong decision function. Usually, the problem is that we didn’t realize what the experiment would do to metric Y, and our contract ignores it.
Given that, there are two roads to go down:
If we have 1000 metrics and the true effect of an experiment on each metric is 0, some metrics will likely have large magnitude effects. One solution is to go with the Analysis Contract this time and remember to consider the metric next time in the contract. Over time, our contract will evolve to better represent our true goals. We shouldn’t put too much weight on what happens to the 20th most important metric. It could just be noise.If the effect is truly outsized and we can’t get comfortable with ignoring it, the other solution is to override the contract, making sure to log somewhere prominent that this happened. Then, update the contract because we clearly care a lot about this metric. Over time, the number of times we override should be logged as a KPI of our experimentation system. As we get the decision-making function closer and closer to the best representation of our values, we should stop overriding. This can be a good way to monitor how much ad-hoc, nonstatistical decision-making goes on. If we frequently override the contract, then we know the contract doesn’t mean much, and we are not following good statistical practices. It’s built-in accountability, and it creates a cost to overriding the contract.
Contracts as Predicates
Contracts do not need to be fully flexible code (there are probably security issues with allowing that to be specified directly into an Experimentation Platform, even if it’s conceptually nice). But we can have a system that enables experimenters to specify predicates, i.e., IF TStat(Revenue) ≤ 1.96 AND Tstat(Engagement) > 1.96 THEN X, etc. We can expose standard comparison operations alongside Tstat’s and effect magnitudes and specify decisions that way.
Thanks for reading! Does your org use anything similar to an Analysis Contract? I think it’s a great solution to a tricky human problem in experimentation, but I’d love to hear anyone’s real-world experience with a more automated approach to experiment decision-making.
Zach
Connect at LinkedIn: https://linkedin.com/in/zlflynn
Scale Experiment Decision-Making with Programmatic Decision Rules was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.