Our friends at Loho are looking for an intern over summer. We’ve previously blogged about the (fun and interesting) work we’ve done for Loho: using Scala, Lift, and a heavy does of Javascript we’ve created a really simple interface for setting up complex VoIP services. (All credit goes to Alex for the great concept — our work is simply implementing his vision.) The internship could extend this work, or address other parts of the stack right down to mucking around with VoIP hardware. It’s an awesome chance to play with a lot of exciting technology. If you fit the bill (student, free for 8-10 weeks over summer, can relocate to Cambridge) get in touch with Alex at Loho. If you get the internship and decide to focus on the Scala/Lift parts we’ll certainly help you get to grips with the code base.
Posts in the ‘General’ category
Internship This Summer at Loho
Monday, May 23rd, 2011Stop A/B Testing and Make Out Like a Bandit
Friday, February 11th, 2011Update: The system described below is now live, and we’re accepting beta users!
Were I a betting man, I would wager this: the supermarket nearest to you is laid out with fresh fruit and vegetables near the entrance, and dairy and bread towards the back of the shop. I’m quite certain I’d win this bet enough times to make it worthwhile. This layout is, of course, no accident. By placing essentials in the corners, the store forces shoppers to traverse the entire floor to get their weekly shop. This increases the chance of an impulse purchase and hence the store’s revenue.
I don’t know who developed this layout, but at some point someone must have tested it and it obviously worked. The same idea applies online, where it is incredibly easy to change the “layout” of a store. Where the supermarket might shuffle around displays or change the lighting, the online retailer might change the navigational structure or wording of their landing page. I call this process content optimisation.
Any prospective change should be tested to ensure it has a positive effect on revenue (or some other measure, such as clickthroughs). The industry standard method for doing this is A/B testing. However, it is well known in the academic community that A/B testing is significantly suboptimal. In this post I’m going to explain why, and how you can do better.
There are several problems with A/B testing:
- A/B testing is suboptimal. It simply doesn’t increase revenue as much as better methods.
- A/B testing is inflexible. You can’t, for example, add a new choice to an already running test.
- A/B testing has a tedious workflow. To do it correctly, you have to make lots of seemingly arbitrary choices (p-value, experiment size) to run an experiment.
The methods I’m going to describe, which are known as bandit algorithms, solve all these problems. But first, let’s look at the problems of A/B testing in more detail.
Suboptimal Performance
Explaining the suboptimal performance of A/B testing is tricky without getting into a bit of statistics. Instead of doing that, I’m going to describe the essence of the problem in a (hopefully) intuitive way. Let’s start by outlining the basic A/B testing scenario, so there is no confusion. In the simplest situation are two choices, A and B, under test. Normally one of them is already running on the site (let’s call that one A), and the other (B) is what we’re considering replacing A with. We run an experiment and then look for a significant difference, where I mean significance in the statistical sense. If B is significantly better we replace A with B, otherwise we keep A on the site.
The key problem with A/B testing is it doesn’t respect what the significance test is actually saying. When a test shows B is significantly better than A, it is right to throw out A. However, when there is no significant difference the test is not saying that B is no better than A, but rather that the data does not support any conclusion. A might be better than B, B might be better than A, or they might be the same. We just can’t tell with the data that is available*. It might seem we could just run the test until a significant result appears, but that runs into the problem of repeated significance testing errors. Oh dear! Whatever we do, if we stick exclusively with A/B testing we’re going to make mistakes, and probably more than we realise.
A/B testing is also suboptimal in another way — it doesn’t take advantage of information gained during the trial. Every time you display a choice you get information, such as a click, a purchase, or an indifferent user leaving your site. This information is really valuable, and you could make use of it in your test, but A/B testing simply discards it. There are good statistical reasons to not use information gained during a trial within the A/B testing framework, but if we step outside that framework we can.
* Technically, the reason for this is that the probability of a type II error increases as the probability of a type I error decreases. We control the probability of a type I error with the p-value, and this is typically set low. So if we drop option B when the test is not significant we have a high probability of making a type II error.
Inflexible
The A/B testing setup is very rigid. You can’t add new choices to the test, so you can’t, say, test the best news item to display on the front page of a site. You can’t dynamically adjust what you display based on information you have about the user — say, what they purchased last time they visited. You also can’t easily test more than two choices.
Workflow
To setup an A/B experiment you need to choose the significance level and the number of trials. These choices are often arbitrary, but they can have a major impact on results. You then need to monitor the experiment and, when it concludes, implement the results. There are a lot of manual steps in this workflow.
Make out like a Bandit
Algorithms for solving the so-called bandit problem address all the problems with A/B testing. To summarise, they give optimal results (to within constant factors), they are very flexible, and they have a fire-and-forget workflow.
So, what is the bandit problem? You have a set of choices you can make. On the web these could be different images to display, or different wordings for a button, and so on. Each time you make a choice you get a reward. For example, you might get a reward of 1 if a button is clicked, and reward of 0 otherwise. Your goal is to maximise your total reward over time. This clearly fits the content optimisation problem.
The bandit problem has been studied for over 50 years, but only in the last ten years have practical algorithms been developed. We studied one such paper in UU. The particular details of the algorithm we studied are not important (if you are interested, read the paper – it’s very simple); here I want to focus on the general principles of bandit algorithms.
The first point is that the bandit problem explicitly includes the idea that we make use of information as it arrives. This leads to what is called the exploration-exploitation dilemma: do we try many different choices to gain a better estimate of their reward (exploration) or try the choices that have worked well in the past (exploitation)?
The performance of an algorithm is typically measured by its regret, which is the average difference between its actual performance and the best possible performance. It has been shown that the best possible regret increases logarithmically with the number of choices made, and modern bandit algorithms are optimal (see the UU paper, for instance).
Bandit algorithms are very flexible. They can deal with as many choices as necessary. Variants of the basic algorithms can handle addition and removal of choices, selection of the best k choices, and exploitation of information known about the visitor.
Bandits are also simple to use. Many of the algorithms have no parameters to set, and unlike A/B testing there is no need to monitor them — they will continue working indefinitely.
Finally, we know bandits work on the web, as much of the current research on them is coming out of Google, Microsoft, Yahoo!, and other big Internet companies.
So there you have it. Stop wasting time on A/B testing and make out like a bandit!
Join Our Merry Band
Finally, you probably won’t be surprised to hear we are developing a content optimisation system based on bandit algorithms. I am giving a talk on this at the Multipack Show and Tell in Birmingham this Saturday.
We are currently building a prototype, and are looking for people to help us evaluate it. If you want more information, or would like to get involved, get in touch and we’ll let you know when we’re ready to go.
Update: In case you missed it at the top, Myna is our content optimisation system based on bandit algorithms and we’re accepting beta users right now!
The University of Untyped
Monday, January 10th, 2011We’ve recently started a reading group at Untyped. As consultants we need to maintain our expertise, so every Friday we tackle something new for a few hours. Given our love of Universities (we average three degrees per Untypist) and our even greater love of grandiose corporate training (hello, Hamburger University!) we have named this program Untyped University.
Broadly, we’re covering the business of the web and the business of building the web. The online business is, from certain angles, quite simple. The vast majority of businesses can be viewed as a big pipeline, sucking in visitors from the Internet-at-large, presenting some message to the user, and then hoping they click “Buy”. At each stage of the pipeline people drop out. They drop out right at the beginning if the site isn’t ranked high enough on search terms or has poorly targetted ads. They abandon the website if the design is wrong, or the site is slow, or the offer isn’t targeted correctly. Each step of this pipeline has tools and techniques that can be used to retain users, which we’ll be covering. The flipside of this is the pipeline that delivers the site, starting with data stores, going through application servers, and finishing at the browser or other client interface. Here we’ll be looking at the technologies and patterns for building great sites.
So far we’ve run a couple of sessions. The first covered bandit algorithms, and the second Amazon’s Dynamo. We’ll blog about these soon. We’ve started a Mendeley group to store our reading (though not everything we cover in future will be in published form.) Do join in if it takes your fancy!
Formalising Bonds with the Informal
Tuesday, April 20th, 2010There is an interesting move underway by the US Securities and Exchange Commission (SEC) to more precisely define the meaning of certain asset backed securities (like the now infamous mortgage-backed securities that were central to the recent crash). The NY times has covered the story from a high level, but what of particular interest to me is the proposal to specify the meaning of the bonds in Python. This is a step is the right direction but Python is not the answer.
The core problem here is to give a clear and unambiguous meaning to a bond. This requires the language in which it is written is precisely defined. Python is not precisely defined. There is only a prose definition of the language, which is inadequate in the same way that the prose definitions of bonds are inadequate, and of course there are differences between various versions and implementation of Python. Since Python is not precisely defined the only meaning one can give to a program in Python is whatever the particular implementation one uses does with that program.
In contrast there are languages that are formally defined, such Standard ML and Scheme. These would make a sound basis for the formal definition of bonds. In turns out that functional languages also make a good (meaning expressive and convenient) basis for the formal definition of bonds. There is a great paper on expressing contracts in Haskell and at least one company has implemented this idea in a commercial system (in O’Caml, I believe). So my advice to the SEC: use an appropriate subset of Scheme or Standard ML, or hire someone to create a formally defined DSL, but don’t use a language without a formal definition if precision is your goal.
Siesta time
Tuesday, November 4th, 2008When I went travelling in Spain I had a siesta just about every day. There are very practical reasons for doing so: it is so damn hot in the middle of the day, and, despite being very close to the Prime Meridian, Spain is on +2GMT in summer so the evenings last forever. Another benefit of siestas: I felt great!
This little anecdote is designed to entice to view
this graphic from The Boston Globe. Within you’ll find everything you ever wanted to know about napping. Now a lot of it shouldn’t come as a surprise. Everyone knows the mid-afternoon lull (at Untyped Midlands it tends to lead to a frenzy of piano playing or drumming, for reasons I don’t understand) but few of us heed the urge to sleep. Perhaps we should. Remember to plan your naps: either get a full cycle (1.5 hours) or stop your nap after about 45 minutes. If you wake in the middle of deep sleep you’ll feel terrible.
If you have problems getting to sleep, I recommend a cat as a snoozing companion. They’re always ready for a nap and purring is very relaxing. Furthermore, a good alarm cat will stop your afternoon nap extending too close to dinner time.
Of Interest 09/09/2008
Tuesday, September 9th, 2008- Vroom! Vroom! (read if you know some Haskell)
- The user interface documentation for Chrome, Google’s browser, goes into some detail justifying the design choices, and so makes for interesting reading. The design claims to gain inspiration from the games WipEout and WipEout 2097.
Of Interest 14/08/2008
Thursday, August 14th, 2008- Gorgeous pictures of Afghanistan, taken in the 1970s and 2000-2003.
- I found this article on path finding in games interesting. I was amazed to see how broken path finding using way-points is, yet in my (quite limited) experience playing many of the games featured I can’t recall running into any problems. It really does show that bar for good enough is quite low, at least for the casual gamer. I was also surprised that the better technique, a navigation mesh, isn’t used more frequently. It is so simple.
- How much would you pay for an Olympic medal? For the Australian sports program, a bronze costs $15 million and silver or gold cost a cool $40 million. The (very limited) data suggests medals scale linearly with investment.
Undeleting Files on the Mac
Thursday, July 24th, 2008I spent a good portion of last week attempting to recover about 30GB of movies that had been deleted from a Mac with a 60GB hard disk. When a file is deleted its normally left intact on the hard disk except for a marker saying its space can be reused. This means that deleted files can be fairly reliably recovered, so long as the space hasn’t since been used for other purposes. We found the movies were missing only a few days after they were deleted, and they took up half the hard disk, so I was fairly confident they could be in part recovered.
Of course that’s great in theory but in practice how I was I going to recover those files? A quick bit of Googling discovered three programs that will attempt to recover deleted files on the Mac: Boomerang, FileSalvage, and Data Rescue II. I downloaded a trial copy of each and set to work. Here’s how they performed:
- Boomerang ran very quickly but only found some 29MB of the missing 30GB of movies. Of the three programs I tested it is the easiest to use, with only a few options for the most common problems. It also does a better job of sticking to the Mac interface conventions than the other two.
- FileSalvage took many hours to search the hard disk. It found lots of files, but it didn’t identify many as fragments of movies. Additionally the interface is very clunky. It doesn’t use the standard Mac widgets and selecting file types in Expert mode is a real pain.
- Data Rescue II ran fairly quickly and found almost all the lost data. Success! It is fairly easy to use. The guided standard mode does a good job of leading you through the recovery process, and the many options in Expert mode are explained well. Like the other programs it uses non-standard widgets, and this needlessly detracts from its usability.
So in my testing Data Rescue II was the clear winner. Don’t read too much into this, as I was only looking for movie data; one of the other programs might work better for a different type of file. However, if you’ve deleted some files that you want to recover I would start with Data Rescue II, then try Boomerang, and only then try FileSalvage (and go to bed while it’s running). Finally, if you have two Macs a firewire cable and target disk mode will make the whole recovery process a bit simpler.
Now what I want to know is: why would a Mac developer invent their own user interface widgets unless they really want that amateur feel to their product? Is there something about Cocoa programming that makes it easier to create, say, your own tab component than use the system one?
Of Interest 06/06/2008
Friday, June 6th, 2008- Ravelry, the knitting social network, raises $71K from its users. First amazing thing (to me, a non-knitter) is that Ravelry even exists. There truly is a place for everyone on the big ol’ Internet, and with each community a corresponding business opportunity. Second interesting thing is that the donation drive was user initiated. I wonder if it will be a regular occurrence. I like the idea of community supported social networks, but I doubt it is a sustainable model. Indeed Ravelry’s primary sources of income are advertising and affiliation fees
- Lessons on building online communities from the people behind Flickr. Two key points: Firstly, it takes a lot of effort to grow a community, and the creators have to get involved in that early stage. Later on, it’s best to get out of the way and let the members decide for themselves what the community is like. The discussion of Flickr’s design, intentionally personal but unobtrusive, is a great point and reminds me of the service at the best restaurants.
- If I understand Clay Shirky correctly, drinking gin is the original form of blogging. Bottoms up! Seriously, at least skim read the text. Note that some evidence suggests at least part of thesis does not hold.
Update: More thoughts on Ravelry
Of Interest 07/03/2008
Friday, March 7th, 2008- TRIZ is a methodology for creativity. Worth looking at.
- The iPhone SDK has got every geek drooling over his keyboard, and for good reason. I think this model will work.
- Is 1000 True Fans the path to happiness? The argument goes that 1000 true fans will provide sufficient income for an artist, and that isn’t that many people. Assuming your work requires only you to produce it this seems a plausible number, but I think the scarcity of true fans is a big roadblock. You might need 100,000 fans before you get 1000 who are really into what you do.
- Looking at what you can do with rather more than 1000 true fans, interesting workplace experiments from 37Signals. I like the four-day week; funding passions is too normative; the release of the iPhone SDK means discretionary spending accounts gets the thumbs up from Dave.