Go to Yahoo and you’ll see a riot of news stories, links to services such as Yahoo Messenger, and right at the bottom of the page… a link to the service that launched the company, the Yahoo Directory. It is laughable now to think that a human maintained directory could keep pace with an powerful search engines and exponentially growing web, but for a long time Yahoo’s directory was the web’s best way to find information.
Early search engines considered only the text of a web page when searching. They could find all the pages that contained the words you were looking for, but they could not tell you which ones were important, leaving you to sift through the hundreds or thousands of results. In this situation going to Yahoo’s directory was a short-cut to finding the most important pages on any topic. It might take you six or seven clicks to drill down through the hierarchy but you’d get there in the end.
Then along came Google. Enter some keywords and Google will give you the all pages that match those keywords, ordered by importance. In most cases, Google is sufficient. The key innovation in Google is that recognises that people indicate the pages they consider important by linking to them. Google uses that information to rank pages via the PageRank algorithm. With millions of web pages individual variations don’t count for much, and what you end up with is the pages generally considered best.
There is a big idea behind the PageRank algorithm, which is that people implicitly tell you what is important to them. The Yahoo model was to explicitly ask people what is important to them. This doesn’t scale. Implicit measures scale better, and in fact perform better as size grows, as noise in the measurements cancels out.
Fast-forward to today. One of the hottest things on the web at the moment are services like del.icio.us that asks people to tag pages or assign ratings to them. This is another example of explicit feedback. You don’t have to look long at the most popular tags on del.icio.us to see that the demographics are skewed heavily towards the geek end. Now explicit feedback is great if you can get it, but while you can expect obsessive-compulsive geeks like myself to meticulously organise their links, but you can’t expect the man-on-the-street to do the same. What is needed is implicit feedback.
For services like del.icio.us that aggregating pages on a common theme, there are a number of algorithms that will apply PageRank in a topic-sensitive manner (for example, the aptly named topic-sensitive PageRank). For services that maintain rating of pages, there has been quite a lot of work on implicit ratings, collated in this excellent recent survey. Several studies show that reading time is correlated with interest. If you run a web site, this is one statistic you can easily gather, and one that the Attention.XML
spec includes. If you write web browsers there are bunch of other measurements you can gather, such as scrolling and bookmarking. Of course for most of us these measurements are out of reach!
Explicit measures have one advantage: they are much easier to get started with. If I rate a blog post as “good”, the meaning is unambiguous. If I spend 30 seconds reading it you need some sort of model to convert that into a rating, and to build that model you need a fair amount of data, knowledge of machine learning techniques (we can help), and probably some beefy hardware to handle large numbers of users. For small services this may be too much, but for large services it is the way forward.