Thursday, March 3, 2011

The Trouble With Statistics

It is often said that statistics can be made to prove anything. In the words of Gregg Easterbrook: "torture numbers, and they'll confess to anything."

This highlights what is probably the most dangerous aspect of working with statistics. Suppose you want to know who will win an upcoming match, say, Blackburn v Chelsea. One person says Chelsea are the better team and will clearly win. Another person says, no, Blackburn are good at home and Chelsea are poor away, so Blackburn wins. Yet another person says that Zhirkov is back for Chelsea and their away record is much better when he's in the lineup. Then someone else points out that that Chelsea just had a Champions League match, and their record after such midweek matches is... And on and on. Pick whichever outcome you want, and there are some statistics that will "prove" you are right.

This anecdotal problem is well known to statisticians as overfitting. If you look at too many different variables, you will find some that appear to predict the outcome purely by coincidence. Indeed, even in completely random data, some patterns are bound to appear.

What may be news to non-statisticians, however, is that there is also a well known solution to this problem: regularization. Rather than comparing possible explanations simply by how well they fit the data, you introduce a "cost" for each variable that is used in the explanation. The best explanation will strike the optimal balance between the quality of the fit and the simplicity of the explanation. In essence, regularization is just the application of Occam's razor.

This raises another question, however: how much should each variable cost? We can determine the right cost by looking at last years' data. For example, before week 29 in the Premier League, we find the best explanations for the first 28 weeks of last year's data using various different costs per variable. Then we compare those explanations based on how well they predict the results of the remaining 10 weeks' matches. (For statistics geeks, this technique is called "cross-validation".)

It stands to reason that the cost that works best last year will also work best this year. The reason is that the ideal cost should simply be a function of how much data we have. It is much easier to get accidental correlations over 5 weeks of data than it is over 25 weeks (for the same reason that it is easier to get 5 heads in a row than 25 heads). By doing the experiment with last years data, we can figure out what cost is best with 28 weeks of data and then use it again this year.

Regularization makes playing with statistics much, much safer. We can introduce all of the variables that our friends tell us will predict the result: league record, home record, whether Zhirkov is playing, whether they just had a mid-week match, and anything else you can think of. We add a cost per variable, and the best explanation will include those variables that matter and exclude those that do not.

No comments:

Post a Comment