Before we begin on this one, a bit of a disclaimer. I know a lot of people know me as the A-B testing person, and one of the number one reasons people get this course is so that they can learn how to run an A-B test. And we'll get there. Don't worry. But if you skipped to this part wanting to learn how to run an A-B test.
And we'll get there. Don't worry. But if you skipped to this part wanting to learn how to run an A-B test and didn't look at any of the cool research bits that I spent a lot of time working on, I am going to suggest you are doing so at your own significant economic peril. A-B testing is only as good as the ideas that you put into it, and your ideas, unresearched, are not as good as researched ideas. By good, I mean they are less likely to win, they are less likely to make you as much money as you think you can, and you are playing with fire.
That's it. That's the full essay here. So keeping in mind that I don't know how to run an A-B testing program that is unresearched. If you skip to this part, you should go back and listen to everything else and then follow those directions and get to this point. Because this is the last lesson for a reason.
It's the last major section for a reason and ultimately you've come in with a lot of really good researched ideas and you want to figure out how to de-risk those when you put them in front of paying customers. That's why we're here. So for the rest of you, congratulations. You've gotten to the last part. Very, very excited.
Experimentation is the process of basically applying the scientific method to design decisions. Is it going to work or is it not going to work by whatever definition of work you so desire? Will it improve conversion rate? Will it increase customer satisfaction? Will it improve upsell take rate?
Will it reduce churn? All these are valid questions. You probably come in with one of those questions, thinking that this decision is going to move the needle on this metric and this way for these reasons. And you can either make that change and then measure analytical changes over time. That happens with redesigns or big dynamic content changes.
Those are still experiments. They're not A-B tests. An A-B test is a very specific kind of experiment. And there's no wrong way to do that. You can basically make a change, write it down as an annotation, remember to come back in two weeks, and then understand what the period over period change is, or year over year change is.
That's a perfectly fine way to go about running an experiment. I don't hate that. You can go off and run that. If you want to run an A-B test, the data will end up quite a bit more clean because you're running it over materially the same time periods. You're dealing with any sort of macroeconomic fluctuations, day of the week fluctuations, I don't know, Whatever phases of the moon are happening that are creating a certain sort of customer behavior, it's being accounted for.
And you do this with what's called an experimentation framework, where They send a JavaScript widget to a customer and random you into either the control, which is the original version of whatever your product was, or the variant, which is the thing that you're trying to change. So those are specific sorts of terms. And you build that variant. And if it's, you know, then you get random into the variant, then you're showing that change. If you're in the control, you're showing the original.
Either way, you're being tracked through the conversion funnel to get a clear sense of how somebody is actually behaving. So there's a change, and then there are goals being measured. And both of those matter a lot. You sign up for one of these frameworks. They're usually paid software so that we can actually support these businesses.
And there are a lot of different options. I'll provide some resources after this for the ones that I recommend because the landscape changes quite a bit over time. And then you turn your design decision into what's called a hypothesis. A hypothesis has three given components to it. The first is the change you're making, so we'll update the headline from this to this, control to variant.
The second is a given metric, so it will make conversion rate change in this particular way. And then the third is the minimum detectable effect of that. So we want it to go up by 5%, x%. And you come to those conclusions of what numbers need to be in here by calculating what's called a minimum sample size for the Experiment you usually don't want an experiment to run longer than a month because it ceases to be profitable So you want like statistically significant results to come out of it and as a result you need to know how many people actually can come into each branch of this control and variant for how long. Maybe you're running it for, I would say, the minimum is about one week, so you can account for any sort of weekend-based fluctuations, that sort of thing.
And max is one month. So you have a site. We'll go over to this Palace of Times New Roman. This is an A-B testing sample size calculator that I've been using for over 13 years. You can tell it's made by a statistician, which is great.
So what I do normally is I take this, move it up to 95% because we want really confident tests, we want a lot of certainty. You plug in your overall conversion rate here. So let's say you're an online store, and it's a 5% conversion rate. Now, you can see here, this 0 to 10 means it would be an extremely massive change in everything that's going on. So let's say it's like more of a 10% lift.
So 5 divided by 10, 0.5. So you have results between 4.5% and 5.5%. And then you see here, you need 50, 435 people to come in per variation. So you double that number, and you get 100, 870 in order to get tests around this. Now, What happens when we play with this number?
Well talk a little bit about what happens here when you increase the minimum detectable effect and you say it's gonna be like a 20% change you need way fewer people to come in. And that's because it's a giant home run of an experiment, or it's a disaster of an experiment. Anything outside of this band is one or the other, right? So if we change it to this, it gets cut by one-fourth, and as a result, you're able to effectively run fewer experiments because you're only really able to leverage big home runs. Whereas if you do something like this and you get something really subtle, this won't wave off somebody in an enterprise business.
They get this many at Best Buy probably every day. But you, looking at this, are probably horrified by it. And you want to retreat back to this security blanket, which makes perfect sense. So this allows more people to run more tests and get a higher win rate because they are able to get more subtle wins more confidently and this is how the rich get richer right because they already have that traffic and they're able to leverage that sort of subtlety. They're also, and this is key, able to segment that traffic.
So if you're running tests exclusively for new visitors or for mobile or for people coming in from a specific traffic source like Facebook ads or something like that, then you're able to really get like subtler results out of everything that's happening here. But for us, you know, the kind of stores that I work for and the kind of clients that I work for, they're usually somewhere around here, right? And so you end up getting this as what you think your minimum detectable effect is, and you can go back and create your minimum sample size that goes back and feeds into your hypothesis. Ultimately, her hypothesis creates the fundamental thing that your experiment is meant to answer, which is does rolling out this design decision make sense for our business? Is it likely to move the needle for us?
Is it likely to be low-risk? And ultimately you've launched the thing or you haven't launched the thing. It's a pretty binaristic question, right? And maybe it's an inconclusive test and it shows that it's not supposed to change things materially and you get a pretty decent result, But you roll it out anyway because it's usability improvement great. You just de-risked the possibility of doing nothing and Getting a you know lower quality result.
So with that in mind, this is convert comm I've built basically a Dummy project for myself and there's no goals or anything on here. But basically, you go in, and so here's my website. And I'm just going to change some key bit of the headline or insert a headline. And then I go in and hit A-B test here. A-A testing is something that you use basically to determine what the overall deviation is in your primary metrics.
I don't typically recommend using it because even A-A testing can be pretty variant, and it also reduces the amount of time that you have to be actively testing things. A split URL test is something that happens across many different URLs. Multivariate is when you are testing many, many different sections. And so when you have something like this, testing three different things on a multivariate test is eight different variations, because it's 2 times 2 times 2. And so as a result, this number goes up to 400, 000, which is a lot.
400, 000, which is a lot. So keep that in mind there. And then, yeah, so this is more of a redirect test. This is what we're going to be doing today. So you go through, and then it's just going to load my website.
So great, lovely. It's going to take a little bit of time to load. And then you're going to get this curious hover state that's happening here as it's loading. And there we go. So you hover these different elements.
And you can tap different things and make changes. So if I wanted to change a different element or reorder different things, this is basically a what-you-see-is-what-you-get or WYSIWYG sort of editor. And most A-B testing frameworks have this functionality. You can hide elements and unhide them with CSS. You can inject CSS and JavaScript using this little guy here for both the entire experience and that given variant.
But for here, we're just going to change this to something like, great. Really dumb test. I don't think this is actually going to make any appreciable difference. But if I hit Save on this, and then I create a variant out of it, then I'm able to launch this. And so half of the people that view this will see we're the best consultancy.
And if I want this to be my primary metric, then I go, because obviously it's people who are applying. So it takes this ID, hits this, and creates a goal out of it. So we have this change, this thing, and really you want to be testing as few changes as possible in your experimentation so that you know what change created what effect. You really, really don't want to do three or four things at once to try and juice this number. Remember when we were doing a multivariate test and it 8x'd all the results?
You would rather do one test, then one test, then one test, and keep what works and throw away what doesn't. That's what you want to be doing methodologically here. So we go back to Summary. We have this JavaScript widget installed on all of our pages. You see view code here.
Fine, lovely. And then I go and change this to active and then we have an A-B test. We wait a little while and then we analyze the results. Great! How do you analyze the results?
So when I go into Convert.com it will tell me a little bit about what the results are. If I want to do the old school statistics way, then I can use what's called a chi-squared test. What is the difference between a chi-squared test and what your framework is reporting for you? Usually, your framework is using predictive, like what they call Bayesian A-B testing statistics. And so they're able to kind of juice the numbers by working with less sample size, right?
But the old-school statistics way of going about doing it is calculating your sample size, weighting a specific amount of time based on how much traffic you're currently getting in, and then running a chi-square test on what we have. So trials is the number of visitors, and then this successes is the number of conversions. So if we have this, if we have for some reason the exact, I don't know, 5, 6, whatever, Great. So it's roughly the same. And then we change it to, I don't know, you have this.
So at 99% confidence, having this go up, sample two is more successful. That means that our variant 1, 600 people clicked this cool link that I have at the top because I changed my headline to something preposterous. And then we call that, write up a report. Usually A-B testing frameworks like Convert allow you to export a PDF or take screenshots of what the statistics look like, and then you roll the thing out. That's it.
That is essentially what an A-B test is. It is brainlessly easy to be running this, and it's an incredibly powerful tool. You do not want to peek as you're running the A-B test, because that can affect the statistics for reasons that if you want to go to Evan Miller's page, how not to run an A-B test, you can see that things change from statistically insignificant to significant before and you might stop the trial ahead of time. So I really encourage not peaking on an A-B test except to bug fix it and make sure that things are working properly. When you are going and bug fixing these experiments, you can go in and hit this preview link here, and it gives you two different kinds of preview links.
This one basically re-renders it in the existing preview. This one forces a cookie on the live site when you do forced variation. Different frameworks have different ways of going about doing this, but ultimately this is how you end up launching it on a framework like Convert.com. Ultimately, the answer is clarity on what you should be doing next. It's the most forceful way to be doing this is by running an A-B test and getting this experimental data in.
And if you've done it right, and you've got all the data reporting in, and you have a decent level of confidence, 95% or above confidence, then you'll be able to roll out your A-B test. I think a lot of people are really mystified about this process. Ultimately, it's a matter of getting this tool, configuring your goals, and getting the right hypotheses in. And that involves researching the right design decisions. And ultimately, you will want to try and do this as frequently as possible.
You want to maximize your active testing time. So as I'm building out these tests and running them, while they're running, I'm building out the next run of tests so that I can stop these tests and start the next one and I will be able to really leverage my experimentation time as much as humanly possible. You do this through research, you do this through prioritization, and then once you've got everything prioritized you work on the execution side of things. This is possible. And I think it's possible for a lot of people.
It's really, really simple to be putting these things together. But it's prone to bugs. It's prone to errors in data analysis. And usually, the problem is human. With that in mind, that is how to be running A-B tests, and we'll be wrapping up with our next video and talking about next steps for you.
lesson
A/B Testing and Experimentation
Nick Disabato
A/B testing is only as good as the ideas you put into it. Unresearched
ideas are not as good as researched ideas– where "not as good" means
they are less likely to win and less likely to make money.
Experimentation is a powerful tool for validating design decisions and
measuring their impact on important metrics.
By applying the scientific method to your product, you can make informed
choices backed by real data.
Before we begin on this one, a bit of a disclaimer. I know a lot of people know me as the A-B testing person, and one of the number one reasons people get this course is so that they can learn how to run an A-B test. And we'll get there. Don't worry. But if you skipped to this part wanting to learn how to run an A-B test.
And we'll get there. Don't worry. But if you skipped to this part wanting to learn how to run an A-B test and didn't look at any of the cool research bits that I spent a lot of time working on, I am going to suggest you are doing so at your own significant economic peril. A-B testing is only as good as the ideas that you put into it, and your ideas, unresearched, are not as good as researched ideas. By good, I mean they are less likely to win, they are less likely to make you as much money as you think you can, and you are playing with fire.
That's it. That's the full essay here. So keeping in mind that I don't know how to run an A-B testing program that is unresearched. If you skip to this part, you should go back and listen to everything else and then follow those directions and get to this point. Because this is the last lesson for a reason.
It's the last major section for a reason and ultimately you've come in with a lot of really good researched ideas and you want to figure out how to de-risk those when you put them in front of paying customers. That's why we're here. So for the rest of you, congratulations. You've gotten to the last part. Very, very excited.
Experimentation is the process of basically applying the scientific method to design decisions. Is it going to work or is it not going to work by whatever definition of work you so desire? Will it improve conversion rate? Will it increase customer satisfaction? Will it improve upsell take rate?
Will it reduce churn? All these are valid questions. You probably come in with one of those questions, thinking that this decision is going to move the needle on this metric and this way for these reasons. And you can either make that change and then measure analytical changes over time. That happens with redesigns or big dynamic content changes.
Those are still experiments. They're not A-B tests. An A-B test is a very specific kind of experiment. And there's no wrong way to do that. You can basically make a change, write it down as an annotation, remember to come back in two weeks, and then understand what the period over period change is, or year over year change is.
That's a perfectly fine way to go about running an experiment. I don't hate that. You can go off and run that. If you want to run an A-B test, the data will end up quite a bit more clean because you're running it over materially the same time periods. You're dealing with any sort of macroeconomic fluctuations, day of the week fluctuations, I don't know, Whatever phases of the moon are happening that are creating a certain sort of customer behavior, it's being accounted for.
And you do this with what's called an experimentation framework, where They send a JavaScript widget to a customer and random you into either the control, which is the original version of whatever your product was, or the variant, which is the thing that you're trying to change. So those are specific sorts of terms. And you build that variant. And if it's, you know, then you get random into the variant, then you're showing that change. If you're in the control, you're showing the original.
Either way, you're being tracked through the conversion funnel to get a clear sense of how somebody is actually behaving. So there's a change, and then there are goals being measured. And both of those matter a lot. You sign up for one of these frameworks. They're usually paid software so that we can actually support these businesses.
And there are a lot of different options. I'll provide some resources after this for the ones that I recommend because the landscape changes quite a bit over time. And then you turn your design decision into what's called a hypothesis. A hypothesis has three given components to it. The first is the change you're making, so we'll update the headline from this to this, control to variant.
The second is a given metric, so it will make conversion rate change in this particular way. And then the third is the minimum detectable effect of that. So we want it to go up by 5%, x%. And you come to those conclusions of what numbers need to be in here by calculating what's called a minimum sample size for the Experiment you usually don't want an experiment to run longer than a month because it ceases to be profitable So you want like statistically significant results to come out of it and as a result you need to know how many people actually can come into each branch of this control and variant for how long. Maybe you're running it for, I would say, the minimum is about one week, so you can account for any sort of weekend-based fluctuations, that sort of thing.
And max is one month. So you have a site. We'll go over to this Palace of Times New Roman. This is an A-B testing sample size calculator that I've been using for over 13 years. You can tell it's made by a statistician, which is great.
So what I do normally is I take this, move it up to 95% because we want really confident tests, we want a lot of certainty. You plug in your overall conversion rate here. So let's say you're an online store, and it's a 5% conversion rate. Now, you can see here, this 0 to 10 means it would be an extremely massive change in everything that's going on. So let's say it's like more of a 10% lift.
So 5 divided by 10, 0.5. So you have results between 4.5% and 5.5%. And then you see here, you need 50, 435 people to come in per variation. So you double that number, and you get 100, 870 in order to get tests around this. Now, What happens when we play with this number?
Well talk a little bit about what happens here when you increase the minimum detectable effect and you say it's gonna be like a 20% change you need way fewer people to come in. And that's because it's a giant home run of an experiment, or it's a disaster of an experiment. Anything outside of this band is one or the other, right? So if we change it to this, it gets cut by one-fourth, and as a result, you're able to effectively run fewer experiments because you're only really able to leverage big home runs. Whereas if you do something like this and you get something really subtle, this won't wave off somebody in an enterprise business.
They get this many at Best Buy probably every day. But you, looking at this, are probably horrified by it. And you want to retreat back to this security blanket, which makes perfect sense. So this allows more people to run more tests and get a higher win rate because they are able to get more subtle wins more confidently and this is how the rich get richer right because they already have that traffic and they're able to leverage that sort of subtlety. They're also, and this is key, able to segment that traffic.
So if you're running tests exclusively for new visitors or for mobile or for people coming in from a specific traffic source like Facebook ads or something like that, then you're able to really get like subtler results out of everything that's happening here. But for us, you know, the kind of stores that I work for and the kind of clients that I work for, they're usually somewhere around here, right? And so you end up getting this as what you think your minimum detectable effect is, and you can go back and create your minimum sample size that goes back and feeds into your hypothesis. Ultimately, her hypothesis creates the fundamental thing that your experiment is meant to answer, which is does rolling out this design decision make sense for our business? Is it likely to move the needle for us?
Is it likely to be low-risk? And ultimately you've launched the thing or you haven't launched the thing. It's a pretty binaristic question, right? And maybe it's an inconclusive test and it shows that it's not supposed to change things materially and you get a pretty decent result, But you roll it out anyway because it's usability improvement great. You just de-risked the possibility of doing nothing and Getting a you know lower quality result.
So with that in mind, this is convert comm I've built basically a Dummy project for myself and there's no goals or anything on here. But basically, you go in, and so here's my website. And I'm just going to change some key bit of the headline or insert a headline. And then I go in and hit A-B test here. A-A testing is something that you use basically to determine what the overall deviation is in your primary metrics.
I don't typically recommend using it because even A-A testing can be pretty variant, and it also reduces the amount of time that you have to be actively testing things. A split URL test is something that happens across many different URLs. Multivariate is when you are testing many, many different sections. And so when you have something like this, testing three different things on a multivariate test is eight different variations, because it's 2 times 2 times 2. And so as a result, this number goes up to 400, 000, which is a lot.
400, 000, which is a lot. So keep that in mind there. And then, yeah, so this is more of a redirect test. This is what we're going to be doing today. So you go through, and then it's just going to load my website.
So great, lovely. It's going to take a little bit of time to load. And then you're going to get this curious hover state that's happening here as it's loading. And there we go. So you hover these different elements.
And you can tap different things and make changes. So if I wanted to change a different element or reorder different things, this is basically a what-you-see-is-what-you-get or WYSIWYG sort of editor. And most A-B testing frameworks have this functionality. You can hide elements and unhide them with CSS. You can inject CSS and JavaScript using this little guy here for both the entire experience and that given variant.
But for here, we're just going to change this to something like, great. Really dumb test. I don't think this is actually going to make any appreciable difference. But if I hit Save on this, and then I create a variant out of it, then I'm able to launch this. And so half of the people that view this will see we're the best consultancy.
And if I want this to be my primary metric, then I go, because obviously it's people who are applying. So it takes this ID, hits this, and creates a goal out of it. So we have this change, this thing, and really you want to be testing as few changes as possible in your experimentation so that you know what change created what effect. You really, really don't want to do three or four things at once to try and juice this number. Remember when we were doing a multivariate test and it 8x'd all the results?
You would rather do one test, then one test, then one test, and keep what works and throw away what doesn't. That's what you want to be doing methodologically here. So we go back to Summary. We have this JavaScript widget installed on all of our pages. You see view code here.
Fine, lovely. And then I go and change this to active and then we have an A-B test. We wait a little while and then we analyze the results. Great! How do you analyze the results?
So when I go into Convert.com it will tell me a little bit about what the results are. If I want to do the old school statistics way, then I can use what's called a chi-squared test. What is the difference between a chi-squared test and what your framework is reporting for you? Usually, your framework is using predictive, like what they call Bayesian A-B testing statistics. And so they're able to kind of juice the numbers by working with less sample size, right?
But the old-school statistics way of going about doing it is calculating your sample size, weighting a specific amount of time based on how much traffic you're currently getting in, and then running a chi-square test on what we have. So trials is the number of visitors, and then this successes is the number of conversions. So if we have this, if we have for some reason the exact, I don't know, 5, 6, whatever, Great. So it's roughly the same. And then we change it to, I don't know, you have this.
So at 99% confidence, having this go up, sample two is more successful. That means that our variant 1, 600 people clicked this cool link that I have at the top because I changed my headline to something preposterous. And then we call that, write up a report. Usually A-B testing frameworks like Convert allow you to export a PDF or take screenshots of what the statistics look like, and then you roll the thing out. That's it.
That is essentially what an A-B test is. It is brainlessly easy to be running this, and it's an incredibly powerful tool. You do not want to peek as you're running the A-B test, because that can affect the statistics for reasons that if you want to go to Evan Miller's page, how not to run an A-B test, you can see that things change from statistically insignificant to significant before and you might stop the trial ahead of time. So I really encourage not peaking on an A-B test except to bug fix it and make sure that things are working properly. When you are going and bug fixing these experiments, you can go in and hit this preview link here, and it gives you two different kinds of preview links.
This one basically re-renders it in the existing preview. This one forces a cookie on the live site when you do forced variation. Different frameworks have different ways of going about doing this, but ultimately this is how you end up launching it on a framework like Convert.com. Ultimately, the answer is clarity on what you should be doing next. It's the most forceful way to be doing this is by running an A-B test and getting this experimental data in.
And if you've done it right, and you've got all the data reporting in, and you have a decent level of confidence, 95% or above confidence, then you'll be able to roll out your A-B test. I think a lot of people are really mystified about this process. Ultimately, it's a matter of getting this tool, configuring your goals, and getting the right hypotheses in. And that involves researching the right design decisions. And ultimately, you will want to try and do this as frequently as possible.
You want to maximize your active testing time. So as I'm building out these tests and running them, while they're running, I'm building out the next run of tests so that I can stop these tests and start the next one and I will be able to really leverage my experimentation time as much as humanly possible. You do this through research, you do this through prioritization, and then once you've got everything prioritized you work on the execution side of things. This is possible. And I think it's possible for a lot of people.
It's really, really simple to be putting these things together. But it's prone to bugs. It's prone to errors in data analysis. And usually, the problem is human. With that in mind, that is how to be running A-B tests, and we'll be wrapping up with our next video and talking about next steps for you.