This post is part of a series of learning topics presented by technology experts. Today’s post comes courtesy of Jin Xiaocheng, Data Scientist and Data Analyst, who gives us a primer on A/B testing usage and pitfalls. The content comes from a recent training session delivered by Jin in our workplace.
What is A/B testing?
A/B testing means running a controlled experiment to compare results from two options. In the world of product management it is often used to test user experience and user engagement on websites and applications.
You have probably heard of A/B testing done by Netflix and Facebook, where a new feature is rolled out to a limited group of users or the way content is served up is rolled out differently to different groups, and the results in user behaviour are then compared, to decide whether that change is worth rolling out to everyone.
A/B testing can be simple or complex, and can be effective or not, depending on the design.
Two common uses of A/B testing are:
- Evaluating Return on Investment. We spent $$$ on a product or feature and believe it is successful. But is it being used?
- Testing a new feature or web design by comparing results between groups of users
Some pitfalls of A/B testing
Jin gave us a simple case study:
Case Study 1: website upgrade A company updates its website to increase user engagement. After the upgrade, the product owner reports that users are spending more time on the website, so “user engagement” is up. The upgrade is deemed a success. But is it really?
There are multiple reasons why users could be spending more time on the new website, including:
- The new website is more interesting and engaging – e.g. it is more fun to use, or content is surfaced better and users are staying longer to do more things
- The new website is harder to navigate and users are taking longer to find what they need
Based on available information from Case Study 1, there is not enough information to know.
Lesson One of A/B testing: Correlation is not Causation
Let’s look again at Case Study 1. Say you wanted to examine whether the fact that users are spending longer on the site is evidence of a better user experience, what could you measure?
You could measure Click-Through Rate (CTR). If users are spending longer on the site, what are they clicking? Are they constantly going back and forth in menus, or are they moving through stages of a process logically?
You could test this by comparing the CTR of users on the old and new website when completing the same task.
But what happens if your two groups do not vary greatly on that measurement?
In fact this happens quite often in A/B testing: after measuring, you may find there is little difference between the two groups. What do you do then?
Can you tweak your tests? Can you change what is being tested? Should you test something different?
The answer could be any of those, but you have to be wary of “hacking” your way to the desired result through changing values and tests.
It could also be that what you are testing is not as significant as you thought it might be.
Let’s look at another hypothetical example, for a holiday booking website.
Case Study 2: testing highlighting A travel booking website wants to test whether adding a highlight to the selection button on search results encourages users to select the highlighted option. The site adds highlights to different search results, when users look for a hotel in a certain area. The site randomly assigns users to four groups A, B, C and D: A: 10% of users: search results shown without any highlighting B: 40% of users: rule-based: top ten hotels have highlighting C: 40% of users: highlight based on machine learning algorithm D: 10% of users: random 10% of results have highlighting
If there is no discernible difference in the selection of hotels across these four user groups, then it can be reasonably decided that the highlighting feature is not important and isn’t needed.
In fact, in most cases, highlighting works. It helps users make decisions and can help boost the site’s revenue. This has been verified by businesses using A/B testing1. But importantly, you need good design in testing, and you need to measure multiple things in the A/B tests: both Click-Through Rate (CTR) and revenue.
Lesson Two of A/B testing: Comparing performance (e.g. CTR) is not always enough
Sometimes, what can appear to be a significant test result, might not be. The results you see could be occurring from other factors.
This opens the statistical can of worms that is p-value. P-value is the probability of obtaining the most extreme test result observed, even if there is no relationship between the two groups. In this case, if you keep testing, you can get different results.
Similarly, beware of false positives, and beware of ‘null’ results. “Absence of evidence is not evidence of absence”.
Jin recommends viewing ‘Dance of the P-Values’ to understand these concepts.
Lesson Three of A/B testing: Know how to interpret P-value. Beware of ‘the dance of P-values’ and p-value hacking
Sometimes A/B testing is not helpful.
Sometimes, A/B testing is not the right tool for what you want to find out. To illustrate this, Jin showed us another case study.
Case Study Three: testing ROI You are a data scientist in e-commerce, working on a virtual credit card product similar to Afterpay. Your target is to get 1 million new users in a month. You have a budget of $2m. What is your “growth hack” plan to meet your target? How can you work out whether the plan is likely to succeed?
I’ll be honest here. I had ZERO suggestions for this one. I wouldn’t even know where to start. But some of the team’s suggestions were:
- Offer a discount over existing providers
- Use Google analytics to try to understand trends in the market, then search for businesses in industries that might be open to targeted promotion
- Look at other providers and existing offers and promotions, and test if our budget allows us to compete
The important factor in this case is Return on Investment (ROI).
You have limited time and an ambitious target.
You could come up with a plan, and you could try different approaches and A/B test them.
But what if you spent your $2m budget and didn’t meet your 1m user target?
Is this a good project, if you can’t test whether you can reach your target?
This is the kind of question that A/B testing does not help solve. In this particular case, what was important was not the quality of tests or data, but timing. A product like this launched in 2016 had a good chance of reaching its target, given Afterpay was new and didn’t have a lot of users at that time. A similar product launched in 2020, in a saturated market, would have zero chance of meeting the same target, even with a good plan and budget.
Understand the limits of A/B testing. It is not always the best evaluation tool for the situation.
If you want to learn more about statistics, check out:
- Be careful about P hacking https://www.youtube.com/watch?v=HDCOUXE3HMM
- Power analysis https://www.youtube.com/watch?v=VX_M3tIyiYk
- Dance of P values https://www.youtube.com/watch?v=5OL1RqHrZQ8
If you want to play with some very simple comparisons before and after for your website, the ‘wayback machine’ provides a simple start: https://web.archive.org/
1Real-world examples of A/B testing: Booking.com has published academic papers about using advanced data science techniques, such as this one in the journal KDD, which shows how A/B testing is used to evaluate results of various experiments: 150 Successful Machine Learning Models: 6 Lessons Learned at Booking.com