Go to Yahoo and you’ll
see a riot of news stories, links to services such
as Yahoo Messenger, and right at the bottom of the page… a
link to the service that launched the company, the
Yahoo Directory. It is laughable now to think that a
human maintained directory could keep pace with an
powerful search engines and exponentially growing web, but
for a long time Yahoo’s directory was the web’s best way
to find information.
Early search engines considered only the text of a
web page when searching. They could find all the pages
that contained the words you were looking for, but they
could not tell you which ones were important, leaving you
to sift through the hundreds or thousands of results. In
this situation going to Yahoo’s directory was a short-cut
to finding the most important pages on any topic. It
might take you six or seven clicks to drill down through
the hierarchy but you’d get there in the end.
Then along came Google. Enter some keywords and Google will give you the all
pages that match those keywords, ordered by importance. In
most cases, Google is sufficient. The key innovation in
Google is that recognises that people indicate the pages
they consider important by linking to them. Google uses
that information to rank pages via the
PageRank algorithm. With millions of web pages individual variations don’t
count for much, and what you end up with is the
pages generally considered best.
There is a big idea behind the PageRank algorithm,
which is that people implicitly tell you what is important
to them. The Yahoo model was to explicitly ask people what
is important to them. This doesn’t scale. Implicit
measures scale better, and in fact perform better as size
grows, as noise in the measurements cancels out.
Fast-forward to today. One of the hottest things on
the web at the moment are services like del.icio.us
that asks people to tag pages or assign ratings to them. This
is another example of explicit feedback. You don’t have to
look long at the most popular tags on del.icio.us to see that the demographics are skewed heavily towards
the geek end. Now explicit feedback is great if you can
get it, but while you can expect obsessive-compulsive
geeks like myself to meticulously organise their links,
but you can’t expect the man-on-the-street to do the same.
What is needed is implicit feedback.
For services like del.icio.us that aggregating pages on
a common theme, there are a number of algorithms that
will apply PageRank in a topic-sensitive manner (for
example, the aptly named topic-sensitive PageRank). For services that maintain rating of pages, there has
been quite a lot of work on implicit ratings, collated in
this excellent recent survey. Several studies show that reading time is correlated
with interest. If you run a web site, this is one
statistic you can easily gather, and one that the
Attention.XML
spec includes. If you write web browsers there are bunch of other
measurements you can gather, such as scrolling and bookmarking.
Of course for most of us these measurements are out of reach!
Explicit measures have one advantage: they are much easier
to get started with. If I rate a blog post
as “good”, the meaning is unambiguous. If I
spend 30 seconds reading it you need some sort of model to
convert that into a rating, and to build that model you
need a fair amount of data, knowledge of machine learning
techniques (we can help), and probably some beefy hardware
to handle large numbers of users. For small services this
may be too much, but for large services it is the
way forward.