Cliff’s plugin lets you test and optimise the content of your WordPress posts and pages. Why not try it on your web site?
We want to try something a bit more interactive with Untyped University, our preteniously titled training program. Instead of just posting papers to Mendeley we’re going to hangout on G+. This should allow for easy discussion with our peers, which is to say: you.
For this session we’re going to hack on Opa. Normally we read through a paper, but we think hacking will work better over the medium. The goal is not (necessarily) to write something useful in Opa but rather to understand it’s model for web development. We’re not seeking to advocate Opa, nor are we experts on the language.
As G+ doesn’t yet support organisations, get in touch with me (email noel at untyped, or message Noel Welsh on G+) and ask to be my UU circle. We’ll be online on Friday 9 Sept from 13:37PM (+1 GMT), and will invite everyone in the circle to the hangout. See you there!
Twelve thousand hits, some thirty emails, and over a dozen new beta testers. That’s what happened when a blog post of ours spent ten hours on the Hacker News frontpage. It was definitely fun getting all that attention, despite the rush of traffic taking our little server off the web for a while. (Installing WP-Cache brought it back.)
Myna is the system described in the blog post, and we’re accepting beta users right now. If you’re interested in content optimisation on your website, and want better results than A/B testing will deliver, do take a look. Obviously getting this surge of traffic from HN is incredibly valuable to us. However I don’t have any suggestions for repeating the event: when I submitted the blog post to HN some months ago it disappeared without a trace. Certainly being active answering questions on HN helped keep it on the front page, and that position netted us a fairly steady thousand hits an hour.
If you’re one of the people who read our blog post, thanks for the interest! It’s very exciting for us to know that our idea for improving content optimisation resonates with so many people, and we’re looking forward to getting Myna out of beta and seeing where it takes us.
We’ve recently completed a very fun and interesting job working on a new interface for managing VoIP phone systems. We have a VoIP phone, provided by Loho, who were also our client for this project. It’s great — we can forward calls to our mobiles, cart the phone around with us (plug it into a network connections and it just works), and it even emails us our voice messages. The only thing not great about our phone is the configuration interface. Luckily, that’s what this project set out to solve.
The brief was to implement an elegant online phone configuration system. Alex, Director at Loho, provided the vision. We provided two weeks of development time, which was enough to create a working prototype. Alex has asked us to not give away too many details about the system, but I can show you a few screenshots. First up, here’s the main screen:
The very stylish main menu of the VoIP administration tool we’ve built for Loho.
Doesn’t give away much, does it? A bit more interesting is a detail of editing a configuration:
Also very stylish: editing the configuration of a voice menu
Here I’m editing a voice menu — one of those “Press 1 if you’re interested in giving us all your money” type things.
We think we’ve created a very nice system. Loho tell us they were overwhelmed with interest at a recent tradefair, suggesting we’re not alone in our opinion. While the interface is an important aspect of the work, the backend (which I can talk about!) is just as important. The main task was defining a data model to capture the rich feature set that Loho provide. This turned out to be very similar to designing a programming language and its intermediate representation. For example, we use a continuation-passing style representation to avoid maintaining a stack on the server side. Our representation distinguishes between tail calls and normal function calls to avoid excessive resource consumption on the VoIP side. Relational databases don’t do a very good job of storing recursive datastructures, like the AST of a programming language, so we used Mongo for the data store. In addition to its flexible data model, Mongo is web scale which has given us an immediate status boost at local programmer meetups.
The backend code is implemented in Scala and Lift. There are actually two interfaces to the service. One is the nice interface the users see, and the other is a REST interface that is called by the Asterisk AGI scripts that implement the VoIP functionality. The Asterisk system doesn’t handle all the functionality we represent internally, so the REST interface includes a small interpreter that executes intermediate steps till we arrive at something Asterisk deals with.
This is the blog post that led to Myna. Sign up now and help us beta test the world’s fastest A/B testing product!
Were I a betting man, I would wager this: the supermarket nearest to you is laid out with fresh fruit and vegetables near the entrance, and dairy and bread towards the back of the shop. I’m quite certain I’d win this bet enough times to make it worthwhile. This layout is, of course, no accident. By placing essentials in the corners, the store forces shoppers to traverse the entire floor to get their weekly shop. This increases the chance of an impulse purchase and hence the store’s revenue.
I don’t know who developed this layout, but at some point someone must have tested it and it obviously worked. The same idea applies online, where it is incredibly easy to change the “layout” of a store. Where the supermarket might shuffle around displays or change the lighting, the online retailer might change the navigational structure or wording of their landing page. I call this process content optimisation.
Any prospective change should be tested to ensure it has a positive effect on revenue (or some other measure, such as clickthroughs). The industry standard method for doing this is A/B testing. However, it is well known in the academic community that A/B testing is significantly suboptimal. In this post I’m going to explain why, and how you can do better.
There are several problems with A/B testing:
The methods I’m going to describe, which are known as bandit algorithms, solve all these problems. But first, let’s look at the problems of A/B testing in more detail.
Explaining the suboptimal performance of A/B testing is tricky without getting into a bit of statistics. Instead of doing that, I’m going to describe the essence of the problem in a (hopefully) intuitive way. Let’s start by outlining the basic A/B testing scenario, so there is no confusion. In the simplest situation are two choices, A and B, under test. Normally one of them is already running on the site (let’s call that one A), and the other (B) is what we’re considering replacing A with. We run an experiment and then look for a significant difference, where I mean significance in the statistical sense. If B is significantly better we replace A with B, otherwise we keep A on the site.
The key problem with A/B testing is it doesn’t respect what the significance test is actually saying. When a test shows B is significantly better than A, it is right to throw out A. However, when there is no significant difference the test is not saying that B is no better than A, but rather that the data does not support any conclusion. A might be better than B, B might be better than A, or they might be the same. We just can’t tell with the data that is available*. It might seem we could just run the test until a significant result appears, but that runs into the problem of repeated significance testing errors. Oh dear! Whatever we do, if we stick exclusively with A/B testing we’re going to make mistakes, and probably more than we realise.
A/B testing is also suboptimal in another way — it doesn’t take advantage of information gained during the trial. Every time you display a choice you get information, such as a click, a purchase, or an indifferent user leaving your site. This information is really valuable, and you could make use of it in your test, but A/B testing simply discards it. There are good statistical reasons to not use information gained during a trial within the A/B testing framework, but if we step outside that framework we can.
* Technically, the reason for this is that the probability of a type II error increases as the probability of a type I error decreases. We control the probability of a type I error with the p-value, and this is typically set low. So if we drop option B when the test is not significant we have a high probability of making a type II error.
The A/B testing setup is very rigid. You can’t add new choices to the test, so you can’t, say, test the best news item to display on the front page of a site. You can’t dynamically adjust what you display based on information you have about the user — say, what they purchased last time they visited. You also can’t easily test more than two choices.
To setup an A/B experiment you need to choose the significance level and the number of trials. These choices are often arbitrary, but they can have a major impact on results. You then need to monitor the experiment and, when it concludes, implement the results. There are a lot of manual steps in this workflow.
Algorithms for solving the so-called bandit problem address all the problems with A/B testing. To summarise, they give optimal results (to within constant factors), they are very flexible, and they have a fire-and-forget workflow.
So, what is the bandit problem? You have a set of choices you can make. On the web these could be different images to display, or different wordings for a button, and so on. Each time you make a choice you get a reward. For example, you might get a reward of 1 if a button is clicked, and reward of 0 otherwise. Your goal is to maximise your total reward over time. This clearly fits the content optimisation problem.
The bandit problem has been studied for over 50 years, but only in the last ten years have practical algorithms been developed. We studied one such paper in UU. The particular details of the algorithm we studied are not important (if you are interested, read the paper – it’s very simple); here I want to focus on the general principles of bandit algorithms.
The first point is that the bandit problem explicitly includes the idea that we make use of information as it arrives. This leads to what is called the exploration-exploitation dilemma: do we try many different choices to gain a better estimate of their reward (exploration) or try the choices that have worked well in the past (exploitation)?
The performance of an algorithm is typically measured by its regret, which is the average difference between its actual performance and the best possible performance. It has been shown that the best possible regret increases logarithmically with the number of choices made, and modern bandit algorithms are optimal (see the UU paper, for instance).
Bandit algorithms are very flexible. They can deal with as many choices as necessary. Variants of the basic algorithms can handle addition and removal of choices, selection of the best k choices, and exploitation of information known about the visitor.
Bandits are also simple to use. Many of the algorithms have no parameters to set, and unlike A/B testing there is no need to monitor them — they will continue working indefinitely.
So there you have it. Stop wasting time on A/B testing and make out like a bandit!
Finally, you probably won’t be surprised to hear we are developing a content optimisation system based on bandit algorithms. I am giving a talk on this at the Multipack Show and Tell in Birmingham this Saturday.
We are currently building a prototype, and are looking for people to help us evaluate it. If you want more information, or would like to get involved, get in touch and we’ll let you know when we’re ready to go.
Update: In case you missed it at the top, Myna is our content optimisation system based on bandit algorithms and we’re accepting beta users right now!
The second paper we looked at in UU is Amazon’s 2007 paper onDynamo. Dynamo is an example of a new type of database dubbed NoSQL and Riak is an open-source implementation of the Dynamo architecture. Studying Dynamo is worthwhile for a number of reasons:
In the old days everyone used relational databases and it was good. Then along came the web, and with the web a tidal wave of data, and things were not good. The tradeoffs made by relational databases (maintaining the famous ACID properties) made them unsuitable for tasks where response time and availability were paramount. This is the case for many web applications. For example, it doesn’t really matter if my Facebook status updates aren’t immediately visible to all my friends, but it does matter if my browser hangs for a minute while the back-end tries to get a write lock on the status table.
NoSQL databases make a different set of tradeoffs, and achieve different performance characteristics as a result. Typically, NoSQL databases focus on scalability, fast response times, and availability, and give up atomicity and consistency. This tradeoff is formalised via the CAP Theorem, which states that a distributed system cannot provide consistency, availability, and partition tolerance all at the same time (although two out of three of these properties are achievable at once). Dynamo provides availability and partition tolerance at the expense of consistency. Other NoSQL databases may make different tradeoffs. SQL databases typically provide consistency and availability at the expense of partition tolerance.
The Dynamo paper can be difficult to read. The main issue we had is that the authors don’t always motivate the different components of the system. For example, consistent hashing is one of the earlier concepts introduced in the paper, but it is difficult to see why it is used and how it contributes to increased availability until later on. It is best to approach each section of the article as a self-contained idea, and wait until the end to see how they are combined. It took us two sessions to get through the paper, so don’t be surprised if you find it slow going.
The paper starts by laying out the properties required of Dynamo. We’ve talked about the tradeoff between consistency, availability, and partition tolerance above. Some of the other properties are:
Dynamo is the fusion of a lot of ideas that are have developed in the field of distributed systems. Rather than duplicate the paper I want to discuss four points that I found interesting:
If you take one point from Dynamo, let it be the usefulness ofconsistent hashing. The basic idea of consistent hashing is to decouple the value of a key from the machine it is stored on. If you do this you can add and remove machines from your data store without breaking anything. If you don’t, you’re in a world of pain.
Consistent hashing is best explained via an example of doing it wrong. Say you have
N machines serving as your data store. Given a key you want to work out which machine stores the data. A simple way to do so (which is what Reddit did) is to calculate
key mod N. Now suppose due to increased load you want to add a machine in your data store. Now
key mod (N+1) won’t give the same result, so you can’t find your data any more. To fix this you have to flush out the data and reinsert it, which will take a long time. Or you can use consistent hashing from the outset.
In consistent hashing you arrange the space of hash keys into a ring. Each server inserts a token into the ring, and is responsible for keys that lie in the range from it’s token to the nearest preceding token. This is illustrated in the image to the left. The small circles indicate the tokens, and the colours the segments of the hash ring allocated to each server.
Adding a new server only requires coordination with the server that previously occupied that part of the hash space. In the original consistent hashing paper tokens were inserted at random. For Dynamo it was found that a more structured system worked better. I’ll leave the details of this and other issues (in particular, routing and replication) to the paper.
Although it isn’t part of the main thrust of the paper, I found it interesting that Amazon measure response time and other variables at the 99.9% percentile. Amazon have a very good reputation, and for other companies looking to achieve the same stature it is good to know the goal to aim for.
I’ve recently implemented feedback control (in particular, proportional error control) for a database connection pool. (I’ll blog about this in a bit.) It’s interesting that Dynamo uses a similar method to balance tasks within each node (Section 6.5). I think we’re going to see more self-regulating systems in the future. The work atRADLab is a good example of what might make it into production in a few years.
By scheduling tasks itself Dynamo is performing a task typically handled by the operating system. I think in the future this will be more commonplace, with the distinction between operating system and application program becoming increasingly blurred. TheManaged Runtime Initiative is one project that aims to do this.
We’ve recently started a reading group at Untyped. As consultants we need to maintain our expertise, so every Friday we tackle something new for a few hours. Given our love of Universities (we average three degrees per Untypist) and our even greater love of grandiose corporate training (hello, Hamburger University!) we have named this program Untyped University.
Broadly, we’re covering the business of the web and the business of building the web. The online business is, from certain angles, quite simple. The vast majority of businesses can be viewed as a big pipeline, sucking in visitors from the Internet-at-large, presenting some message to the user, and then hoping they click “Buy”. At each stage of the pipeline people drop out. They drop out right at the beginning if the site isn’t ranked high enough on search terms or has poorly targetted ads. They abandon the website if the design is wrong, or the site is slow, or the offer isn’t targeted correctly. Each step of this pipeline has tools and techniques that can be used to retain users, which we’ll be covering. The flipside of this is the pipeline that delivers the site, starting with data stores, going through application servers, and finishing at the browser or other client interface. Here we’ll be looking at the technologies and patterns for building great sites.
So far we’ve run a couple of sessions. The first covered bandit algorithms, and the second Amazon’s Dynamo. We’ll blog about these soon. We’ve started a Mendeley group to store our reading (though not everything we cover in future will be in published form.) Do join in if it takes your fancy!
Being in the office more frequently has given me more opportunity to work in a team; for better or worse (and mostly worse, in my opinion) the PhD students in my lab tended to work alone. An interesting example of the benefits of team working arose a few days ago. We are in the process of shifting Kahu from its current colo server to a VM. As part of this move we want to benchmark the two servers, to make sure the new VM isn’t going to give us an unexpected performance hit. To do the benchmarking we needed to log more performance statistics, and adding logging to our code was the task Dave and I were working on. I started with a grand plan: we would write three libraries, one for logging, one for bechmarking, and one for logging benchmark data. I remember feeling frustrated as we discussed it as the plan was clear in my mind and I thought I could easily get on and code it alone. However, as we talked more it became apparent the existing tools we had were good enough for most of what we wanted. In the end we wrote a single macro and were done. From three libraries down to four lines of code is a pretty good reduction in effort.