How google does what it does

This post is inspired by a friend of mine, who very late one night recently made a valiant attempt (given the circumstances) to explain to me how the google website-ranking system works.  I was surprised to hear that at its heart it is a simple – although quite clever – application of linear algebra.  A couple of days later, in one of those funny occurrences of suddenly encountering the same concept/thing multiple times after having spent a lifetime never hearing about it, I actually came across a very similar piece of theory in my research, and read a bit more about it.  So I thought I’d explain it for the interested among you.  However, I am swiftly learning that attempting to write an entire introduction to a mathematical subject in one blog post is a foolish thing to do!  Usually what seems to happen is that I spend the entire post writing about some fundamental aspect of the subject, and then give up and explain the rest very quickly and inadequately.  So I’m afraid I’m going to have to assume you know basic linear algebra…if not, then you might want to stop reading now.

Now, other than indexing websites and providing a portal through which to access them, clearly the most crucial aspect of a search engine is the ordering system it uses to list the sites.  There needs to be some way of assigning an “importance score” to each webpage, such that the ones which people are most likely to want view come first.  Arguably the sole reason google are as successful as they are is a very effective method of doing this invented by Larry Page while he was at university, conveniently called PageRank.  The system uses the links to a page to determine its score, and crucially, it measures not just the number of these links but their “quality”;  that is, it assigns higher importance to links coming from pages which themselves have a high score.

PageRank is a modified version of the notion of centrality in a network, so I shall first explain this concept.  A network is just another name for a graph, which I have talked about in the past: they just consist of a bunch of nodes, connected by lines.  Because of this simple formulation, there are various ways in which they can be encoded, other than by the standard one of a pretty picture.  One of these is as an adjacency matrix: label the nodes 1 ton, then put a 1 in the(i,j)position if nodeiis joined to nodej, and a0otherwise.      For example, the (4-cycle) network on the left has the following adjacency matrix:

A=\left( \begin{array}{cccc} 0 & 1 & 0 & 1 \\ 1 & 0 & 1 & 0 \\ 0 & 1 & 0 & 1 \\ 1 & 0 & 1 & 0 \\ \end{array} \right)

This is very useful, as to a certain extent it reduces problems about networks – which can be very complicated, and are still the object of a great deal of research – to linear algebra, which is very well understood (in fact it is one of the few subjects in mathematics which could be called complete in any way).

What we want is to find a way to assign a relative importance (centrality) to each node, which takes into account not only the number of links to it, but the centrality of each of the nodes it is linked to.    Linear algebra simplifies this problem considerably.  As a reminder, an eigenvalue and corresponding eigenvector of a matrixAare, respectively, a constant value\lambdaand a vector\mathbf{v}such that:

A\mathbf v = \lambda \mathbf v

The prefix -eigen comes from German (as it seems does most mathematical notation!), and means “own”, or “innate”.  This is slightly misleading (different matrices can have the same eigenvalues), but it reflects the fact that the eigenvalues/vectors contain a lot of information about a matrix, as does the fact that the eigenvalues of a matrix are often referred to as its spectrum.  Geometrically, if we think of a matrix as representing some linear transformation (a rotation, shifting, or stretching of space in some way), then an eigenvector is a vector whose direction is fixed by the action of the transformation.  The corresponding eigenvalue is the factor by which the magnitude of the vector is increased or decreased.

But that isn’t really relevant to us at present, as we are not considering the matrix to be a transformation, but simply an encoding of the network’s structure.  So what significance do eigen-things have here?  Well, let’s carry on using the example of a 4-cycle.   An eigenvector in this case will be of the form:

\bf v =\left( \begin{array}{c} v_1 \\ v_2 \\ v_3 \\ v_4 \\ \end{array} \right)

So, putting this and the adjacency matrixAinto the above formula, we get:

\left( \begin{array}{c} v_2 + v_4 \\ v_1 + v_3 \\ v_2 + v_4 \\ v_1 + v_3 \\ \end{array} \right)=\left( \begin{array}{c} \lambda v_1 \\ \lambda v_2 \\ \lambda v_3 \\ \lambda v_4 \\ \end{array} \right) ,

Which simply represents the connections in the network.  Let the valuev_irepresent the centrality of the nodei.  Then, for example, the line:

v_2+v_4=\lambda v_1

tells us that the centrality of node 1 is proportional to the sum of the centralities of the two nodes connected to it.

(At this point you might be wondering which eigenvalue we are using.  Luckily we don’t need to worry about this!  The Perron-Frobenius theorem says, among other things, that if a matrixAhas only positive entries, then it has a unique largest eigenvalue, for which there is a corresponding eigenvector which itself has only positive entries.  This also does away with the need to speculate as to what a negative centrality score might signify).

Now, back to google.  Instead of nodes imagine webpages.  Instead of lines, imagine links between webpages.  Then the centrality score of a page is a very crude version of the situation we wanted: each webpage has a centrality which is proportional to the sum of that of each of the webpages it is linked to.  So not only are we taking into account how many links a website has, but how significant each of these links are.

Of course, PageRank is a bit more complicated than this.  For a start it matters which way the links are pointing.  So let’s take a new picture of the internet – this time with arrows.  Technically what PageRank measures is the probability that a person randomly clicking on links will end up at that page after an infinity of time has passed.  This sounds rather complicated, but it isn’t really.   It is just the result of a “random walk” through the internet; the PageRank algorithm simulates this random walk.

It works like this: start off by assigning an equal value to each page (to simplify things I will just use 1).  We then divide each page’s value equally among its links.  What will happen when we simulate the process is that each page will confer its value to those that it is linked to, and with enough iterations we will reach a steady state.  So instead of a 1 if two websites are linked, we put the fraction1/nin the(i,j)position if websitejlinks to websitei, wherenis the number of links coming out of websitej.  In this case we get the following:

A=\left( \begin{array}{cccc} 0 & 0 & 1 & 1/2 \\ 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 1/2 \\ 0 & 0 & 0  & 0 \\ \end{array} \right)

Note that the entries in the fourth column are 1/2, because there are two links coming out of website 4; the more links a page has on it, the less important they are considered to be.  We then give each page an equal starting PageRank , which appears in the corresponding position in the (not yet eigen-) vector.  Again in this case I will use 1, so we have:

\bf v= \left( \begin{array}{c} 1 \\ 1 \\ 1 \\ 1 \\ \end{array} \right)

The eigenvalue in this case will eventually be 1 (see below), so to simulate the random surfing action, we just multiply these together.  After the first iteration we have the vector:

\left( \begin{array}{c} 3/2 \\ 1 \\ 3/2 \\ 0 \\ \end{array} \right) ,

which already gives us a good idea of relatively how “important” the pages are: page 4 has no incoming links, so it has a PageRank of 0, and pages 1 and 3 have a higher score than page 2 as they have more links.  If I were doing this properly, then I would continue to multipy by the same matrix, and after enough iterations we would have an eigenvector\bf v satisfying:

A\mathbf v = \mathbf v

which would be our desired PageRank values.

However, I’m not doing it properly!  It would be too long and complicated to write here.   In reality, a “damping factor” is included; this is to simulate people getting bored of clicking on links, and starting off at another unrelated page (randomly clicking is fairly boring after all), and it assures that the system reaches a steady state.  Also, in reality the matrix has probabilities for entries.  In fact it is a stochastic matrix, in which the rows and columns all sum to 1 .   Stochastic matrices happen to have the nice property that the highest eigenvalue assured by the Perron-Frobenius Theorem is always 1, and this is why we don’t need\lambdain the above equation.

There are, of course, various problems with this system.  To name but three: it is based on the assumption that people surf the internet randomly, which of course they don’t; it is unfairly biased against new websites; and it could theoretically be possible to artificially increase a site’s rankings with devious linkage.  But it is currently the best we’ve got (or so google’s world dominance seems to suggest).

Interestingly, it has been suggested that a PageRank-like system should be implemented to measure the importance – or “impact factor” – of academic journals.  LetIbe the impact of a journal in a given year, letNbe the number of times that articles published in the journal in the previous two years were cited elsewhere, and letTbe the total number of articles published by the journal in the previous two years.  The current system uses the simple equation:

I=N/T,

which is clearly rather crude, and currently the subject of much academic grumbling.  Could JournalRank do a better job?

Advertisements

4 Responses to How google does what it does

  1. Hi Adam,

    Excellent stuff once again, apparently Google tweak the algorithm every couple of weeks to try to maintain the value of the results, which keeps the search engine ranking optimisation specialists vigilant to best practices.

    Personally I have noticed that Google is not as good at finding answers to questions as it used to be.

    On top of the PageRank model which you describe here are other models applied to influence the values based on some of the issues you mention.

    3 key factors add weight to the score –
    1: the actual text in the link description, (“click here” is bad); will for a search phrase
    2: the actual name of the page being requested
    3: the domain name

    Duplicate content and interestingly from your perspective of this model, the quality of the links away from the page are also key factors.

    Google wants the page to be part of the journey of discovery to quality content, in case its not the actual destination.

    Funnily enough, I was wondering about discussing the mathematical modelling of some networking concepts with you, with reference to a social network project I am working on, (www.CloudClever.com), perhaps I can suggest this for an upcoming article?

    • Adam Bohn says:

      Hi Dan,

      very interesting…as I read about it I became more and more aware of how complicated it actually was, and almost gave up trying to explain it! But the rough gist is there. Funny that they count outgoing links…would seem to suggest that you could boost your rankings by just including lots of links? I’m sure they’ve thought of that anyway.

      That looks like an intriguing project you’ve got on the go. I’ll keep in mind the post idea, although they’re rather few and far between these days. Or how about this for a deal: I’ll write about social networking, and you tell me how to optimise my coffee website!? Those damn Adwords are bleeding me dry…

  2. Grrr Adwords! Certainly not for the uninitiated! Yes let me help you, send me your account I will take a quick look to see if I can spot anything obvious.

    Did you know that Google now uses a form of PageRank on the Adwords pages too? This caused outcry amongst advertisers when it was applied, since the last thing they want to do is add links to other sites for traffic which they have paid for…

    In terms of adding links out to your site, general rule is to have more coming in then going out, and I would say, try to link to pages with a an equivalent or higher Page Rank than your page.

    In terms of the maths behind the social network I am building, we should discuss this privately for now.

  3. Teylor says:

    Thank you. nice article…

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: