Monday, January 21, 2013

The Ultimate Statistic for Predicting Match Outcomes

I've been reading a lot of posts recently comparing two statistics, shots (S) and shots-on-target (SoT), for predicting match outcomes.
Teams who win the battle for Shots on Target show a significantly increased probability of winning...
Neither of these statistics, however, can hold a candle to the ultimate statistic, G. Forget S and SoT. It turns out that teams with higher G win the game 100% of the time!

Of course, G is goals. And I'm not telling you anything new by saying that the team who scores more goals always wins. That comes straight from the rule book.

Now, I bring this up not to make a dumb joke (well, not just to make a dumb joke) but rather to make a complaint about the comparison of shots versus shots-on-target.

The root of my complaint is this fact: every goal is a shot and a shot-on-target. We can think of shots-on-target as goals + blocks + saves. And we can think of shots-on-target as goals + blocks + saves + shots-off-target. Both of these statistics are goals + other-stuff.

If we compare these two statistics by checking which better correlates with winning, then we are essentially comparing which version of goals + other-stuff better correlates with just plain goals. (This is especially true if we compare ratios of these statistics to match outcomes, which seems to be the usual practice now.)

One way that we could see a better correlation is if one version of "other stuff" is better correlated with goals than the other. But another way to get a better correlation is for one version of "other stuff" to simply be smaller than the other. In particular, if both versions of "other stuff" are pure randomness, then whichever one is smaller will produce a better correlation. That is, goals + small randomness will correlate better with goals than goals + large randomness.

Should we be worried about that happening here? Yes, we should because shots-on-target is much smaller than shots (usually, 2-3 times smaller).

We can eliminate this concern, however, by removing "goals" from these formulas. That is, we can compare shots that are not goals to shots on target that are not goals to see which better correlates with winning the match.

I tested this using the matches from this season of the Barclay's Premier League (~450 matches so far). When we use the usual statistics, we find what others have claimed: shots-on-target has a higher correlation with winning. The correlation coefficient for shots-on-target is .476 versus .299 for shots, which is a huge difference. However, if we subtract out the goals from these, the correlation coefficient for shots-on-target is .117 versus .151 for shots. Now, shots becomes the better predictor. This is what we would suspect would happen if the advantage for shots-on-target was simply due to being smaller.

More importantly, this shows that including shots-off-target in our statistic increases the correlation with winning, at least once we remove from our statistic the very thing (goals) we were trying to predict. This only seems fair since, if we are allowed to include goals as part of our statistic to predict match outcomes, then we should just use the ultimate statistic: goals itself.

Afterward

The above analysis is not meant to prove that shots rather than shots-on-target is the better statistic for analyzing matches. Instead, the above analysis is meant to show that comparing these two statistics by which better correlates with winning is not sensible since one statistic can correlate better simply by being smaller.

(There are more sensible ways to compare these statistics, for example, by looking at which has less regression to the mean, amongst other things...)

No comments:

Post a Comment