Statistical Significance at Mutiny
In this doc, we’ll take a closer look at what statistical significance is, how you should interpret different metrics, and then walk you through where statistical significance appears in the platform.
What is Statistical Significance?
Statistical significance is a measure of how likely it is that the results of a test or experiment are due to chance. In the context of Mutiny, it helps to determine if any observed differences between two or more versions of your website are real and meaningful, or if they are just due to random chance. In order to fully understand statistical significance we’re going to break down a few different concepts and see how each of them impact your experiments in Mutiny.
When we compare the results of personalization or experiment we typically use a measure called conversion lift. We define conversion lift to be the percent difference in conversion rates for all visitors who have seen 1 or more active personalized experiences compared to the conversion rate for all visitors who have not seen any active personalized experiences.
You can interpret conversion lift as the difference between conversion rates for two different groups: those who have seen the personalized experience at least once versus those who have only seen the control.
In some cases, conversion lift may be negative, which would indicate that your control experiment has a higher conversion rate than your personalized version.
Its important to remember that conversion lift does not necessarily mean that your test was statistically significant. It simply shows the percentage difference in conversion rate between the two groups.
Significance level is a measure of how confident we are that the results of a test or experiment are accurate. It tells us how likely it is that the results we see in our sample are representative of the entire population. To help explain this, lets set up a thought experiment.
Let’s assume that there is no actual difference between the personalized and control variations we are testing. In other words, the data that we see as the personalized and control variants are just two different random samples from the same underlying, complicated, processes governing website conversion. We can now ask the question:
What is the probability that, under this assumption, we’d see the conversion lift that was observed?
This is the notion that is referred to as “significance”. In other words, if the results from an experiment make it seem unlikely that they came from the same underlying process, then we say that the results appear significant. We typically measure this probability using a p-value and choose a threshold value below which we say that the result is “statistically significant” (0.05 is a standard choice).
Said precisely, a p-value is the chance, under the assumption that the personalization has no effect, that the observed number of conversions and non-conversions in each group was due to random chance.
In the screenshot above, we’re only 31% confident that the results of your experiment are reliable. We won’t consider an experience as statistically significant until your confidence level reaches 80%.
Sample size indicates the total number of people that are part of our experiment. In order for us to be confident that results of an experiment are not do to random chance, we need a large enough sample size.
The statistics under the hood of our experiment are making assumptions about having a “large enough” sample size. But, what does that mean? How do we know when a sample size is “large enough”? The basic issue with small sample sizes is that rare looking outcomes are actually fairly common. So, we may get a very significant looking result, but in fact we are being tricked because using a small sample size breaks the assumptions that were being made behind the scenes. Luckily, we can estimate the chances of being tricked to determine how large a sample size is needed in order to be fairly certain we are not being tricked. The chance that we are not being tricked is typically called statistical power.
Combining the notions of significance and power we can answer the question of how large a sample size needs to be in order to achieve statistical significance for a given conversion lift. The larger the conversion lift, the smaller a sample size you will need in order to reach statistical significance.
You can use the estimates shown in the screenshots above to understand how long your experiment will likely need to run in order to reach statistical significance. If a test duration is too long, you can increase your sample size to make it shorter.
Pulling it all together
You’ll encounter different aspects of statistical significance throughout your process of building, launching, and reviewing your experiences. Below, we’ll show where in the app you may encounter these metrics, and what they all mean.
This chart shows the conversion lift between visitors that have seen personalized experiments compared to those that haven’t. A 38% conversion lift means that the conversion rate for personalized visitors is 38% higher than the conversion rate for non-personalized visitors.
Mutiny uses conversion lift to determine how many additional leads you drove compared to what would be expected in the absence of personalization. Statistical significance influences this metric since conversion lift Mutiny uses is based on the state of the experience:
For promoted experiences, Mutiny uses the conversion lift at the time of promotion
For experiments, Mutiny uses the conversion lift once the experiment maintains statistical significance for at least two weeks.
500 additional leads were driven compared to what would be expected in the absence of personalization.
When building a segment, you’ll be able to take advantage of how confidence, statistical power and sample size play together.
After selecting some attributes for your segment, you’ll see estimates on the right side of the screen showing you how long we expect it will take for your test to reach statistical significance based on the conversion lift.
In this case, assuming a 20% lift in conversion rate, we expect that you’ll gain an additional 276 leads per month and that it will take you 1 week to reach statistical significance.
If we assume a smaller conversion lift, we’ll need a larger sample size (and thus, more time) to reach statistical significance:
In this case, assuming a 10% lift in conversion rate, we expect that you’ll gain an additional 138 leads per month and that it will take you 4 weeks to reach statistical significance.
Blank Conversion Lift
We won't start showing the confidence level calculation until we have "enough data". We require at least 5 conversions in control and personalized variants before we can start calculating confidence levels.
Not Yet Statistically Significant
Once we’re able to calculate conversion lift, you’ll likely see a state where you experience is not yet statistically significant. You can hover over the “i” to read a tooltip with more information.
Depending on your experience, you’ll eventually reach statistical significance. This may be indicated with a positive or negative conversion lift. In either case, you’ve learned something and can move forward on iterating on your experience!
Don't be a stranger
If you have any questions, we’re here to help! Please feel free to contact us at any time, either through intercom chat or via firstname.lastname@example.org.