This post is inspired by a friend of mine, who very late one night recently made a valiant attempt (given the circumstances) to explain to me how the google website-ranking system works. I was surprised to hear that at its heart it is a simple – although quite clever – application of linear algebra. A couple of days later, in one of those funny occurrences of suddenly encountering the same concept/thing multiple times after having spent a lifetime never hearing about it, I actually came across a very similar piece of theory in my research, and read a bit more about it. So I thought I’d explain it for the interested among you. However, I am swiftly learning that attempting to write an entire introduction to a mathematical subject in one blog post is a foolish thing to do! Usually what seems to happen is that I spend the entire post writing about some fundamental aspect of the subject, and then give up and explain the rest very quickly and inadequately. So I’m afraid I’m going to have to assume you know basic linear algebra…if not, then you might want to stop reading now.
Now, other than indexing websites and providing a portal through which to access them, clearly the most crucial aspect of a search engine is the ordering system it uses to list the sites. There needs to be some way of assigning an “importance score” to each webpage, such that the ones which people are most likely to want view come first. Arguably the sole reason google are as successful as they are is a very effective method of doing this invented by Larry Page while he was at university, conveniently called PageRank. The system uses the links to a page to determine its score, and crucially, it measures not just the number of these links but their “quality”; that is, it assigns higher importance to links coming from pages which themselves have a high score.
PageRank is a modified version of the notion of centrality in a network, so I shall first explain this concept. A network is just another name for a graph, which I have talked about in the past: they just consist of a bunch of nodes, connected by lines. Because of this simple formulation, there are various ways in which they can be encoded, other than by the standard one of a pretty picture. One of these is as an adjacency matrix: label the nodes 1 to, then put a 1 in theposition if nodeis joined to node, and aotherwise. For example, the (4-cycle) network on the left has the following adjacency matrix:
This is very useful, as to a certain extent it reduces problems about networks – which can be very complicated, and are still the object of a great deal of research – to linear algebra, which is very well understood (in fact it is one of the few subjects in mathematics which could be called complete in any way).
What we want is to find a way to assign a relative importance (centrality) to each node, which takes into account not only the number of links to it, but the centrality of each of the nodes it is linked to. Linear algebra simplifies this problem considerably. As a reminder, an eigenvalue and corresponding eigenvector of a matrixare, respectively, a constant valueand a vectorsuch that:
The prefix -eigen comes from German (as it seems does most mathematical notation!), and means “own”, or “innate”. This is slightly misleading (different matrices can have the same eigenvalues), but it reflects the fact that the eigenvalues/vectors contain a lot of information about a matrix, as does the fact that the eigenvalues of a matrix are often referred to as its spectrum. Geometrically, if we think of a matrix as representing some linear transformation (a rotation, shifting, or stretching of space in some way), then an eigenvector is a vector whose direction is fixed by the action of the transformation. The corresponding eigenvalue is the factor by which the magnitude of the vector is increased or decreased.
But that isn’t really relevant to us at present, as we are not considering the matrix to be a transformation, but simply an encoding of the network’s structure. So what significance do eigen-things have here? Well, let’s carry on using the example of a 4-cycle. An eigenvector in this case will be of the form:
So, putting this and the adjacency matrixinto the above formula, we get:
Which simply represents the connections in the network. Let the valuerepresent the centrality of the node. Then, for example, the line:
tells us that the centrality of node 1 is proportional to the sum of the centralities of the two nodes connected to it.
(At this point you might be wondering which eigenvalue we are using. Luckily we don’t need to worry about this! The Perron-Frobenius theorem says, among other things, that if a matrixhas only positive entries, then it has a unique largest eigenvalue, for which there is a corresponding eigenvector which itself has only positive entries. This also does away with the need to speculate as to what a negative centrality score might signify).
Now, back to google. Instead of nodes imagine webpages. Instead of lines, imagine links between webpages. Then the centrality score of a page is a very crude version of the situation we wanted: each webpage has a centrality which is proportional to the sum of that of each of the webpages it is linked to. So not only are we taking into account how many links a website has, but how significant each of these links are.
Of course, PageRank is a bit more complicated than this. For a start it matters which way the links are pointing. So let’s take a new picture of the internet – this time with arrows. Technically what PageRank measures is the probability that a person randomly clicking on links will end up at that page after an infinity of time has passed. This sounds rather complicated, but it isn’t really. It is just the result of a “random walk” through the internet; the PageRank algorithm simulates this random walk.
It works like this: start off by assigning an equal value to each page (to simplify things I will just use 1). We then divide each page’s value equally among its links. What will happen when we simulate the process is that each page will confer its value to those that it is linked to, and with enough iterations we will reach a steady state. So instead of a 1 if two websites are linked, we put the fractionin theposition if websitelinks to website, whereis the number of links coming out of website. In this case we get the following:
Note that the entries in the fourth column are 1/2, because there are two links coming out of website 4; the more links a page has on it, the less important they are considered to be. We then give each page an equal starting PageRank , which appears in the corresponding position in the (not yet eigen-) vector. Again in this case I will use 1, so we have:
The eigenvalue in this case will eventually be 1 (see below), so to simulate the random surfing action, we just multiply these together. After the first iteration we have the vector:
which already gives us a good idea of relatively how “important” the pages are: page 4 has no incoming links, so it has a PageRank of 0, and pages 1 and 3 have a higher score than page 2 as they have more links. If I were doing this properly, then I would continue to multipy by the same matrix, and after enough iterations we would have an eigenvector satisfying:
which would be our desired PageRank values.
However, I’m not doing it properly! It would be too long and complicated to write here. In reality, a “damping factor” is included; this is to simulate people getting bored of clicking on links, and starting off at another unrelated page (randomly clicking is fairly boring after all), and it assures that the system reaches a steady state. Also, in reality the matrix has probabilities for entries. In fact it is a stochastic matrix, in which the rows and columns all sum to 1 . Stochastic matrices happen to have the nice property that the highest eigenvalue assured by the Perron-Frobenius Theorem is always 1, and this is why we don’t needin the above equation.
There are, of course, various problems with this system. To name but three: it is based on the assumption that people surf the internet randomly, which of course they don’t; it is unfairly biased against new websites; and it could theoretically be possible to artificially increase a site’s rankings with devious linkage. But it is currently the best we’ve got (or so google’s world dominance seems to suggest).
Interestingly, it has been suggested that a PageRank-like system should be implemented to measure the importance – or “impact factor” – of academic journals. Letbe the impact of a journal in a given year, letbe the number of times that articles published in the journal in the previous two years were cited elsewhere, and letbe the total number of articles published by the journal in the previous two years. The current system uses the simple equation:
which is clearly rather crude, and currently the subject of much academic grumbling. Could JournalRank do a better job?