This is the blog post that led to Myna. Sign up now and help us beta test the world’s fastest
                        A/B testing product! 
 
Were I a betting man, I would wager this: the supermarket
                      nearest to you is laid out with fresh fruit and vegetables
                      near the entrance, and dairy and bread towards the back of
                      the shop. I’m quite certain I’d win this bet enough times
                      to make it worthwhile. This layout is, of course, no
                      accident. By placing essentials in the corners, the store
                      forces shoppers to traverse the entire floor to get their
                      weekly shop. This increases the chance of an impulse
                      purchase and hence the store’s revenue.
 
I don’t know who developed this layout, but at some point
                      someone must have tested it and it obviously worked. The
                      same idea applies online, where it is incredibly easy to
                      change the “layout” of a store. Where the supermarket
                      might shuffle around displays or change the lighting, the
                      online retailer might change the navigational structure or
                      wording of their landing page. I call this process content optimisation.
 
Any prospective change should be tested to ensure it has a
                      positive effect on revenue (or some other measure, such as
                      clickthroughs). The industry standard method for doing
                      this is A/B testing. However, it is well known in the academic community
                      that A/B testing is significantly suboptimal. In
                      this post I’m going to explain why, and how you can do
                      better.
 There are several problems with A/B testing:
  - 
A/B testing is suboptimal. It simply doesn’t increase
                        revenue as much as better methods.
  - 
A/B testing is inflexible. You can’t, for example, add a
                        new choice to an already running test.
  - 
A/B testing has a tedious workflow. To do it correctly,
                        you have to make lots of seemingly arbitrary choices
                        (p-value, experiment size) to run an experiment.
  
 
The methods I’m going to describe, which are known as bandit algorithms, solve all these problems. But first, let’s look at the
                      problems of A/B testing in more detail.
 Suboptimal Performance
 
Explaining the suboptimal performance of A/B testing is
                      tricky without getting into a bit of statistics. Instead
                      of doing that, I’m going to describe the essence of the
                      problem in a (hopefully) intuitive way. Let’s start by
                      outlining the basic A/B testing scenario, so there is no
                      confusion. In the simplest situation are two choices, A
                      and B, under test. Normally one of them is already running
                      on the site (let’s call that one A), and the other (B) is
                      what we’re considering replacing A with. We run an
                      experiment and then look for a significant difference,
                      where I mean significance in the statistical sense. If B is significantly better
                      we replace A with B, otherwise we keep A on the site.
 
The key problem with A/B testing is it doesn’t respect
                      what the significance test is actually saying. When a test
                      shows B is significantly better than A, it is right to
                      throw out A. However, when there is no significant
                      difference the test is not saying that B is no
                      better than A, but rather that the data does not support
                      any conclusion. A might be better than B, B might be
                      better than A, or they might be the same. We just can’t
                      tell with the data that is available*. It might seem we
                      could just run the test until a significant result
                      appears, but that runs into the problem of repeated significance testing errors. Oh dear! Whatever we do, if we stick exclusively with
                      A/B testing we’re going to make mistakes, and probably
                      more than we realise.
 
A/B testing is also suboptimal in another way — it doesn’t
                      take advantage of information gained during the trial.
                      Every time you display a choice you get information, such
                      as a click, a purchase, or an indifferent user leaving
                      your site. This information is really valuable, and you
                      could make use of it in your test, but A/B testing simply
                      discards it. There are good statistical reasons to not use
                      information gained during a trial within the A/B testing
                      framework, but if we step outside that framework we can.
 
* Technically, the reason for this is that the probability
                      of a type II error increases as the probability of a type
                      I error decreases. We control the probability of a type I
                      error with the p-value, and this is typically set low. So
                      if we drop option B when the test is not significant we
                      have a high probability of making a type II error.
 Inflexible
 
The A/B testing setup is very rigid. You can’t add new
                      choices to the test, so you can’t, say, test the best news
                      item to display on the front page of a site. You can’t
                      dynamically adjust what you display based on information
                      you have about the user — say, what they purchased last
                      time they visited. You also can’t easily test more than
                      two choices.
 Workflow
 
To setup an A/B experiment you need to choose the
                      significance level and the number of trials. These choices
                      are often arbitrary, but they can have a major impact on
                      results. You then need to monitor the experiment and, when
                      it concludes, implement the results. There are a lot of
                      manual steps in this workflow.
 Make out like a Bandit
 
Algorithms for solving the so-called bandit problem
                      address all the problems with A/B testing. To summarise,
                      they give optimal results (to within constant factors),
                      they are very flexible, and they have a fire-and-forget
                      workflow.
 
So, what is the bandit problem? You have a set of choices
                      you can make. On the web these could be different images
                      to display, or different wordings for a button, and so on.
                      Each time you make a choice you get a reward. For example,
                      you might get a reward of 1 if a button is clicked, and
                      reward of 0 otherwise. Your goal is to maximise your total
                      reward over time. This clearly fits the content
                      optimisation problem.
 
The bandit problem has been studied for over 50 years, but
                      only in the last ten years have practical algorithms been
                      developed. We studied one such paper in UU. The particular details of the algorithm we studied are
                      not important (if you are interested, read the paper –
                      it’s very simple); here I want to focus on the general
                      principles of bandit algorithms.
 
The first point is that the bandit problem explicitly
                      includes the idea that we make use of information as it
                      arrives. This leads to what is called the
                      exploration-exploitation dilemma: do we try many different
                      choices to gain a better estimate of their reward
                      (exploration) or try the choices that have worked well in
                      the past (exploitation)?
 
The performance of an algorithm is typically measured by
                      its regret, which is the average difference between its actual
                      performance and the best possible performance. It has been shown that the best possible regret increases logarithmically
                      with the number of choices made, and modern bandit algorithms are optimal (see the UU paper, for instance).
 
Bandit algorithms are very flexible. They can deal with as
                      many choices as necessary. Variants of the basic
                      algorithms can handle addition and removal of choices,
                      selection of the best k choices, and exploitation
                      of information known about the visitor.
 
Bandits are also simple to use. Many of the algorithms
                      have no parameters to set, and unlike A/B testing there is
                      no need to monitor them — they will continue working
                      indefinitely.
 
Finally, we know bandits work on the web, as much of the
                      current research on them is coming out of Google, Microsoft, Yahoo!, and other big Internet companies.
 
So there you have it. Stop wasting time on A/B testing and
                      make out like a bandit!
 Join Our Merry Band
 
Finally, you probably won’t be surprised to hear we are
                      developing a content optimisation system based on bandit
                      algorithms. I am giving a talk on this at the Multipack
                      Show and Tell in Birmingham this Saturday.
 
We are currently building a prototype, and are looking for
                      people to help us evaluate it. If you want more
                      information, or would like to get involved, get in touch and we’ll let you know when we’re ready to go.
  Update: In case you missed it at the top, Myna is our content optimisation system based on bandit
                        algorithms and we’re accepting beta users right now!