How google does what it does

October 29, 2010

This post is inspired by a friend of mine, who very late one night recently made a valiant attempt (given the circumstances) to explain to me how the google website-ranking system works.  I was surprised to hear that at its heart it is a simple – although quite clever – application of linear algebra.  A couple of days later, in one of those funny occurrences of suddenly encountering the same concept/thing multiple times after having spent a lifetime never hearing about it, I actually came across a very similar piece of theory in my research, and read a bit more about it.  So I thought I’d explain it for the interested among you.  However, I am swiftly learning that attempting to write an entire introduction to a mathematical subject in one blog post is a foolish thing to do!  Usually what seems to happen is that I spend the entire post writing about some fundamental aspect of the subject, and then give up and explain the rest very quickly and inadequately.  So I’m afraid I’m going to have to assume you know basic linear algebra…if not, then you might want to stop reading now.

Now, other than indexing websites and providing a portal through which to access them, clearly the most crucial aspect of a search engine is the ordering system it uses to list the sites.  There needs to be some way of assigning an “importance score” to each webpage, such that the ones which people are most likely to want view come first.  Arguably the sole reason google are as successful as they are is a very effective method of doing this invented by Larry Page while he was at university, conveniently called PageRank.  The system uses the links to a page to determine its score, and crucially, it measures not just the number of these links but their “quality”;  that is, it assigns higher importance to links coming from pages which themselves have a high score.

In his Guardian column last Saturday, after having unleashed the full extent of his fury at some poor unsuspecting tabloid for getting a statistic slightly wrong (don’t get me wrong, the media needs more people like him) Ben Goldacre mentions in passing that if there are at least 23 people in a room, the probability that 2 of them will have the same birthday is over 50%.  This is known as “the birthday paradox”, and while not technically being an actual paradox, it is highly counterintuitive, as probabilistic results often tend to be.  The counterintuitivity comes from the fact that people tend to assume the question is: “if I am in a room with some people, what is the probability of someone having the same birthday as me?” If there are 22 other people, then this gives only 22 possibilities.  But if we don’t specify the actual birth-date, the number of pairs of people, and hence the number of possible birthday-matches, becomes$23\choose 2$(the number of ways of choosing 2 things from 23 things), which is 253.  It is quite easy to believe that there is some likelihood of one of these pairs of people having the same birthday.