A/B testing is a powerful tool for learning about your users, understanding your features’ impact, and making informed business decisions. To ensure you make the best decisions and are extracting the most insights from your experiments, some experimental design guidelines are essential. These guidelines can be cumbersome or confusing at times, which can lead to re-tests, which take even more time – or even lead you to make the wrong decisions! Luckily, there are statistical techniques that can take care of some of these issues so that you are protected from common A/B testing pitfalls.
The beauty of frequentist statistics, such as the t-test, which underpins Split’s statistical analyses, is that it can provide strong and clearly defined guarantees on error rates. By setting the desired level of statistical significance (ex: .05, or 95% confidence), the experimenter has complete control over the chances of seeing a statistically significant result when there was no real impact (i.e., seeing a false positive). This allows you to choose the balance between confidence and time to significance that is right for you. The lower the error rate, or equivalently the higher the confidence which you require, the longer your metrics will take to reach significance. On the other hand, if getting results fast is important and you can tolerate a lower level of confidence, choosing a lower threshold for statistical significance (ex: 0.1 or 90% confidence) will decrease your metrics’ time to significance and allow you to detect smaller impacts.
It’s essential to keep in mind that this error rate applies to each metric. The more metrics you have, the higher your chances of seeing a false positive result. In some cases, this can make decision making difficult, and it’s not always clear how testing multiple metrics impacts the chances of seeing a false positive.
A standard recommendation is to select one metric to be the key metric for your experiment and to conclude the success of your test primarily on this metric alone (ex: a primary key performance indicator, or primary KPI.) If other metrics appear significantly impacted, a cautious approach would be to re-run the experiment again, but with those metrics acting as a key metric the next time. This allows the experimenter to ensure they have no more than a fixed chance of making an incorrect decision because the error rate is tightly controlled when concluding on a single metric. (At Split, this chance is 5% by default.) While this approach is academically sound, in practice, this can be time-consuming and overly conservative.
Experimenters more commonly limit the overall number of metrics for analysis or consider a subset of metrics to be more relevant to the tested change. The experimenter must then exercise their judgment on whether a highly unexpected impact may be a false positive and simply due to randomness in the data.
At Split, we intentionally calculate the results for every single one of your metrics for every test, so that you do not need to choose a subset or manually attach metrics to experiments. This is because features can have unexpected impacts, perhaps in a different part of the funnel or on performance metrics (such as page load times.). This broad coverage can be hugely valuable, and experimenters shouldn’t need to limit the number of metrics they’re testing.
Multiple Comparison Corrections
Multiple comparison corrections are a set of statistical techniques that automatically adjust your results to control the overall chances of error, regardless of how many metrics you are testing. This means there’s one less thing to worry about – you don’t need to limit the number of metrics you’re testing or to make any subjective judgment calls about whether or not an unexpected change may be a false positive.
At Split, we support a type of multiple comparison correction which controls the false discovery rate. Unlike false positive rate, which refers to the chance of a metric deemed statistically significant when there was no true underlying impact, the false discovery rate instead refers to a statistically significant metric being a false positive. If you consider each statistically significant metric a ‘discovery,’ then false discovery rate control limits the proportion of all of your discoveries which are false.
We chose to implement this type of correction because we believe it is more tightly coupled to how experimenters make decisions. It is only the statistically significant metrics which are reported as meaningful detected impacts of your test, hence it is most valuable to know that you can be confident in each of those reported discoveries.
When the multiple comparison correction is applied, using the default significance threshold of 5%, you can be confident that at least 95% of all the statistically significant metrics you see reflect a real, meaningful impact on your metric. This guarantee applies regardless of how many metrics you have. If you need more confidence or faster results, you can adjust the significance threshold setting to higher or lower confidence, respectively.
To provide these strong guarantees on the error rate, you may find some metrics take longer to reach significance. For this reason, we treat your ‘key’ metrics and your ‘organizational’ metrics separately to give you additional flexibility. If you care about a specific small number of metrics, you can set these as key metrics for your experiment. They will not be penalized for having a large number of other metrics calculated for your organization, and they will reach significance faster when set as a key metric. If you chose a single metric to be a key metric, the correction will not affect it’s time to significance.
Calibrating Statistical Rigor to Business Needs
One of the most valuable aspects of the experimentation statistics we use at Split is the ability to control the likelihood of a false positive in your results. There isn’t a one size fits all answer. Different businesses may find they need different confidence levels depending on factors like risk tolerance, typical traffic sizes, and acceptable lengths of time to wait for results.
However, as online experimentation advances and with our customers measuring more and more metrics every day, there is a need to support solutions more tailored to making business decisions and to remove any concern over measuring too many things.
With multiple comparison corrections, you can add as many metrics as you can come up with, and know that you can still report any statistically significant changes with the same fixed level of confidence.
For more quick reads on A/B testing and experimentation, check out the posts below.
- Know the difference between A/B testing and multivariate testing
- Get pointers on how to experiment during extreme seasonality
- Dig deeper into A/B/n testing and multivariate testing
Stay up to date
Don’t miss out! Subscribe to our digest to get the latest about feature flags, continuous delivery, experimentation, and more.
At Split, we “dogfood” our own product in so many ways. Our engineering and product teams are using Split nearly every day. It’s how we make Split better.
We all know a good digital experience from a bad one. Great digital experiences make our lives easier, and negative experiences drive us, and our business, elsewhere. What many users don’t realize is how difficult it is for product and engineering teams to know that the features they’re building will…
Feature flags provide so much for software organizations: they allow teams to separate code deployment from feature release, test in production, run experiments, and more. However, some rules apply to the feature flagging process that are easy for teams to overlook. I’ve gathered the best practices of feature flags from…