All models are wrong (or: lies, damned lies and statistics)*

November 10, 2010

The statistician is seen with a certain amount of disdain (or possibly sympathy) by their pure mathematical brethren. And it is with that firmly in mind that I (as a fledgling statistician) take the reins of this worthy blog.

We have some idea of what mathematics is from Adam’s posts; but what is statistics?  Statistics is applied maths with uncertainty. In statistics mathematical techniques are used to model and quantify our uncertainty about reality. Modelling climate change, predicting the outcome of elections, wrecking the financial system and ensuring the casino always wins: statistics is everywhere. And uncertainty is the key to statistics.

In order to get across an understanding of what uncertainty is I will try to describe some of the different kinds we face and how statistics deals with them.  The five levels in the following taxonomy lie on a continuum running from complete certainty to complete uncertainty, and provide a means of measuring the range and limitations of statistics in different situations.** The further we go along this continuum the less effective statistics is at prediction and inference, and many problems in statistics and quantitative social sciences like economics come from not recognising just how far along the continuum we are.

How google does what it does

October 29, 2010

This post is inspired by a friend of mine, who very late one night recently made a valiant attempt (given the circumstances) to explain to me how the google website-ranking system works.  I was surprised to hear that at its heart it is a simple – although quite clever – application of linear algebra.  A couple of days later, in one of those funny occurrences of suddenly encountering the same concept/thing multiple times after having spent a lifetime never hearing about it, I actually came across a very similar piece of theory in my research, and read a bit more about it.  So I thought I’d explain it for the interested among you.  However, I am swiftly learning that attempting to write an entire introduction to a mathematical subject in one blog post is a foolish thing to do!  Usually what seems to happen is that I spend the entire post writing about some fundamental aspect of the subject, and then give up and explain the rest very quickly and inadequately.  So I’m afraid I’m going to have to assume you know basic linear algebra…if not, then you might want to stop reading now.

Now, other than indexing websites and providing a portal through which to access them, clearly the most crucial aspect of a search engine is the ordering system it uses to list the sites.  There needs to be some way of assigning an “importance score” to each webpage, such that the ones which people are most likely to want view come first.  Arguably the sole reason google are as successful as they are is a very effective method of doing this invented by Larry Page while he was at university, conveniently called PageRank.  The system uses the links to a page to determine its score, and crucially, it measures not just the number of these links but their “quality”;  that is, it assigns higher importance to links coming from pages which themselves have a high score.

A bunch of people, in a room

February 16, 2010

Partly in a bid to keep the interest of the small band of readers I appear to have gained since my last post, and partly out of sheer laziness, I am again going to dispense with serious mathematics this week, and instead discuss what interesting things can be said about: some people, in a room.

In his Guardian column last Saturday, after having unleashed the full extent of his fury at some poor unsuspecting tabloid for getting a statistic slightly wrong (don’t get me wrong, the media needs more people like him) Ben Goldacre mentions in passing that if there are at least 23 people in a room, the probability that 2 of them will have the same birthday is over 50%.  This is known as “the birthday paradox”, and while not technically being an actual paradox, it is highly counterintuitive, as probabilistic results often tend to be.  The counterintuitivity comes from the fact that people tend to assume the question is: “if I am in a room with some people, what is the probability of someone having the same birthday as me?” If there are 22 other people, then this gives only 22 possibilities.  But if we don’t specify the actual birth-date, the number of pairs of people, and hence the number of possible birthday-matches, becomes$23\choose 2$(the number of ways of choosing 2 things from 23 things), which is 253.  It is quite easy to believe that there is some likelihood of one of these pairs of people having the same birthday.