Thursday, 1 December 2011

Correlation is not causation, and other easy statistics

The local free paper's property section had an article a reprinted press release from a property website, which noted that odd-numbered houses were more valuable than even-numbered houses, and that lower numbers are more valuable than higher numbers. The article never explicitly claims that the difference in house prices is caused by the numbering - but on the other hand the existence of the article, and much of its wording, strongly implies that they would like their readers to believe this.

Anyway, it makes a good uncontentious example of "correlation is not causation" (and a few other statistical lessons), so I'm going to step through their claims in detail in case I need such an example later. Yes, clichés involving aquatic creatures and round wooden containers may come to mind.

So, the figures:

House typeAverage price (£)
Odd number207,202
Even number206,664
Number 1229,411
Number 2222,273
Number 13203,892
1 to 20, except 13"near the top of the list"
No number - name only"an average 90,000 more than numbered homes"

So, obviously - with one exception - the number of a house is not directly affecting its price. The reduction in price - by around 1-2% - for being the famously "unlucky" number 13 is understandable: many people won't buy a house numbered 13, so the number of potential buyers is lower, so the number of offers will be lower, which will (slightly) reduce the average price.

That's that one explained. What about the rest?

Firstly, note the fairly small difference between odd and even houses: around 500 pounds in 200,000 - or around 0.25%. They don't say how big the sample size was in total, but that's unlikely to be a statistically significant difference. The numbers have all been quoted to the ridiculous precision only available by taking a mean of a large sample (which lets us conclude which sort of average was in use, too) - both round to 207,000.

In so far as there is a real effect, the relatively large difference in price between 1 and 2 is likely to explain it.

So what about lower numbers being higher priced? This is a real effect, but it's not (of course) as simple as lower numbers being intrinsically more valuable. It's a combination of two factors, and their relation with the allocation algorithms for house numbers.

House numbering is most commonly either done sequentially with odd numbers on one side of the street and even on the other side, ascending in the same direction - or for less linear housing developments, sometimes as a single sequence of numbers along the street. Numbers are generally not skipped unless there's a "gap where a house could be".

So, as a consequence, on an average terraced street, house 1 (and often, but not as often, house 2) will be end-terrace houses, which are much more valuable. There will also be another set of end-terrace houses at the other end of the terrace, but there is no standardisation whatsoever in terrace lengths, so the increase in value there is scattered throughout the house numbers.

Then there's street length. In general, shorter streets have better houses. 138 Smith Street is almost certainly going to be an urban street, or maybe a main road in the suburbs with lots of traffic. Either way, not the greatest for house prices. 6 Jones Street, on the other hand, could be on a similar street, or it could equally be on a side street in an expensive rural village. So that averaging explains lower numbers being higher priced.

The final figure: un-numbered houses are significantly - over 40% - more expensive than numbered houses has much the same sort of obvious real cause. Un-numbered houses will include mansions, converted farms, actual farms, and other large isolated buildings, with asking prices potentially into the millions of pounds.

This is why "correlation is causation" is so tempting - the house numbers don't cause the price differences; position on the street and type of street do. However, position and type (plus local standard numbering algorithm) also cause the numbers - so there's a very noticeable correlation on a statistical level. Additionally, house number and house price are really easy to measure quantitatively - "type" and "numbering algorithm" are much more qualitative.

Note that the correlation is therefore no use as an individual predictor: what street a house is on will tell you far more about its price than its position on the street, and the effect is sufficient subtle that - apart from end terrace effects - it probably won't be reliably visible on a single street. This is another useful statistical reminder: a difference in averages can be virtually irrelevant if you're looking at individual cases.