Bookmaker Analysis NRL 2009-2015

Forward

This is the first of many R-Rated posts to come! Now before you get excited, when i say R-Rated I mean that I will be providing R (programming language) code snippets throughout the post for those interested in re-creating some of the work I have done, and/or doing your own analysis. For those not interested in the R-Rated stuff, avert your precious eyes!!

Introduction

Its time to take a look at bookmakers historic odds data (and in fact this should be one of the very first things you look at). Also (and for me this is actually in retrospect), the bookmakers predictions should actually be the benchmark for any predictive modelling you do yourself. You will see this a bit later on, but trust me, the bookmakers are actually very good at predicting the outcomes of matches. In fact I would wager that the best predictive model or system you come up with, will only be on about par with the long term bookmakers precision!

But all is not lost, just because the bookmakers are as good as, or maybe even better than you at predicting game outcomes, there are still ways to at least identify what constitutes a safe bet and what constitutes a poor and or risky bet by analysing the bookmakers information.

The following takes a look at how good the bookmakers are a predicting games, and takes a look at opportunities for potential profit making staking strategies using statistical analysis of the bookmakers odds for winning and losing predictions

The data set

Of course what you need first and foremost is a decent data set of bookmaker odds. Luckily I have a sample up on my website which can be downloaded for such analysis. Some of the data manipulation is already done for you such as normalising the odds, calculating the implied probability, vig and overround, labeling true positive, false positive events etc. A couple of quick notes on the data set:

There is only one odds column (bookie_back_odds) and this represents the average of multiple bookmaker odds close to closing of bets.
Odds are the decimal odds of a team wining
The probability is simply calculated using 1/Odds
Normalised odds/probabilities are just weight averaged to 100%

The key variables of interest will be:

win_lose (binary indicator if the team won (1) or lost (0))
bookie_back_prob_norm (bookies probability that a team will win normalised to 100%)
bookie_label (binary indicator of the bookies prediction if the team will win or lose)
TP (True positive results (where predicted win and actual result was a win))
FP (False positive results (where bookie predicted a win and result was a loss))

OK, lets get started!!

Load and inspect data

Fist lets load the historic data from the maxwellAI website and call it b.

b <- read.csv ("http://maxwellai.com/wp-content/uploads/2016/03/nrl_2009_2015_bookmaker_odds.csv", header = TRUE)

Now Lets take a quick look at the list of features in the file

colnames(b)

##  [1] "match_id"              "match_team_id"        
##  [3] "season"                "round_no"             
##  [5] "game_no"               "team_descr"           
##  [7] "team_against_descr"    "score"                
##  [9] "score_against"         "home_away"            
## [11] "win_margin"            "win_lose"             
## [13] "bookie_back_odds"      "bookie_back_prob"     
## [15] "bookie_back_odds_norm" "bookie_back_prob_norm"
## [17] "overround"             "vig"                  
## [19] "bookie_label"          "TP"                   
## [21] "FP"                    "TN"                   
## [23] "FN"

And quickly take a look at the data (just the top 5 entries)

str(head(b, n=5))

## 'data.frame':    5 obs. of  23 variables:
##  $ match_id             : int  1 1 2 2 3
##  $ match_team_id        : int  1 2 4 3 5
##  $ season               : int  2009 2009 2009 2009 2009
##  $ round_no             : int  1 1 1 1 1
##  $ game_no              : int  1 1 2 2 3
##  $ team_descr           : Factor w/ 16 levels "Brisbane Broncos",..: 1 10 7 14 4
##  $ team_against_descr   : Factor w/ 16 levels "Brisbane Broncos",..: 10 1 14 7 12
##  $ score                : int  19 18 17 16 18
##  $ score_against        : int  18 19 16 17 10
##  $ home_away            : Factor w/ 2 levels "A","H": 2 1 2 1 2
##  $ win_margin           : int  1 0 1 0 8
##  $ win_lose             : int  1 0 1 0 1
##  $ bookie_back_odds     : num  1.69 2.15 1.4 2.93 1.42
##  $ bookie_back_prob     : num  0.592 0.465 0.714 0.341 0.704
##  $ bookie_back_odds_norm: num  1.6 2.03 1.33 2.78 1.35
##  $ bookie_back_prob_norm: num  0.56 0.44 0.677 0.323 0.669
##  $ overround            : num  0.057 0.057 0.056 0.056 0.053
##  $ vig                  : num  0.054 0.054 0.053 0.053 0.05
##  $ bookie_label         : int  1 0 1 0 1
##  $ TP                   : int  1 0 1 0 1
##  $ FP                   : int  0 0 0 0 0
##  $ TN                   : int  0 1 0 1 0
##  $ FN                   : int  0 0 0 0 0

Analyse Bookmakers historic precision

Right! Now lets have a look at how good the (collective) bookmaker actually is at predicting winners!

Subset data and calculate precision

First We need to subset the data so that bookie_label = 1 (we are only interested in the bookies win pick). Then we can calculate the precision of the bookmaker. We will just add the precision column to the data frame for convenience:

bp <- subset(b, b$bookie_label == 1)
#add precision column
bp$precision <- sum(bp$TP)/(sum(bp$FP) + sum(bp$TP))
#report the overall precision
round (max(bp$precision), digits=2)

## [1] 0.65

Observations

We observe that the overall precision is 0.65! This means that historically, across 2009-2015 the bookmaker has predicted 65% of games accurately which is really quite good.

Visualising precision across years

Now, lets have a quick look at what that looks like visually across each year by creating a simple line graph of the average precision by round. Also lets just add a linear trend line and a 0.5 ‘cut-off’ line to see if we can identify any simple trends (note i have omitted the R code to create the plots which is a bit lengthy):

Observations

We can see that:

for the majority of seasons, the predictive precision per round increases as the season progress’.The exception to this is 2013 and 2015. 2013 which started off very strong, but bombed in the later rounds after round 24. 2015 was relatively flat.
in most years the grand final (last round) was picked correctly.
Season 2010 & 2015 appear to have the lowest overall accuracy (tend line is closest to 0.5 cut off line)
the majority of the time, the precision is above the 0.5 cut off line, however there are a few very low precision rounds. It is difficult to correlate any specific low precision rounds across the seasons

Bookmakers probability distribution

Now let’s have a look at how the the bookmakers probabilities compare between the false positives (when they picked a team to win and they lost) and the true positives. This is going to tell us how well calibrated the probabilities are and will be useful for determining appropriate staking strategies

Fist let’s take a quick look at the summary statistics for the normalised probabilities for true positive events and false positive events:

For true positive events:

bTP <- subset(bp,bp$TP ==1)
#save third quartile for use later
bpTPq <-quantile(bTP$bookie_back_prob_norm, c(.75)) 

print (summary(bTP$bookie_back_prob_norm))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.5010  0.5870  0.6470  0.6561  0.7070  0.9380

For false positive events:

bFP <- subset(bp,bp$FP ==1)
#save third quartile for use later
bpFPq <-quantile(bFP$bookie_back_prob_norm, c(.75)) 
print (summary(bFP$bookie_back_prob_norm))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.5010  0.5630  0.6180  0.6267  0.6740  0.9120

And lets take a look at that visually in a simple box-plot:

p2 <- ggplot(bp,aes(factor(TP),bookie_back_prob_norm))
p2 <- p2 + geom_boxplot()
plot(p2)

Great! we can see straight way that the probabilities are well calibrated because the mean of the probability of the false positive events is lower than the mean of the probability of the true positives. This means that even though the bookies have mis-predictions (false positives) their implied probability is lower for these events (which is really good! (for them – not so much for the punters)).

We also observe that the third quartile for false positives is 0.67 (equivalent to 1.49 odds) and that there are a large number of true positive >0.67. Now this is important; this means that when the bookmaker gives odds odds on which are >0.67 implied probability (<1.49 odds) then in the overwelwing majority of cases, the bookmaker will accurately pick the winner. We can actually calculate the bookmakers precision in this region Which is 75%!!. This means if you bet on anything with odds <1.49, then you should win ~75% of you bets:

bp75 <- subset(b, b$bookie_back_prob_norm> bpTPq)
#add precision column
bp75$precision <- sum(bp75$TP)/(sum(bp75$FP) + sum(bp75$TP))
#report the overall precision
round (max(bp75$precision), digits=2)

## [1] 0.75

If you want to look at it from an odds perspective, we can create the same box-plot; but just with odds on the y axis instead of the probability. Note that the plot should be reversed because higher (IE longer odds means lower probability). I personally prefer looking at it from a probability point of view but in some scenarios it might be good to look from and odds point of view. Lets see:

p3 <- ggplot(bp,aes(factor(TP),bookie_back_odds_norm))
p3 <- p3 + geom_boxplot()
plot(p3)

Yes the box-plots are reversed but you can make the same inference as above. That if odds on are <1.49 then in the majority of cases the bookie will be correct!

But what does it all mean??

So what does all this mean for us?? Well for one the bookmaker is going to be damn hard to beat! They have well calibrated probabilities a precise predictive model, and on top of this have a little lea-way due to the overround/vig. But like I said in the introduction all is not lost; just in doing this simple exercise we have learnt a great deal about the bookmakers. And we can try use this analysis to help us out in our own ‘informed’ gambling. So lets review what we know and think about how we can use it logically.

Ok, so we know that the bookmaker predicts winners correctly about 75% of the time when odds on are <1.47. This means that if we pick games when the odds are <1.47 then we also have a 75% chance of a return on a bet… but does it mean we will make a profit? We should just be able to place bets on anything less than odds of <1.47 and get a return right?

Lets see. I will quickly make a simulated bet with the logic of staking on everything which is >0.67 in probability (equivalent to <1.47 odds):

#create a new dataframe using bp for simple simulated staking
bss <- bp
#create a 'stake' column and just set it to a fixed staking value
bss$stake <- 10
#find the 3rd quantile of the bookie probability
bssFPq <-quantile(bFP$bookie_back_prob_norm, c(.75)) 
bssTPq <-quantile(bTP$bookie_back_prob_norm, c(.75)) 
#create stake column using logic
bss$stake <- ifelse(bss$bookie_back_prob_norm > bssFPq , bss$stake,0)
#now simulate profit/loss on the stake amount
bss$profit <- ifelse (bss$TP == 1, (bss$stake*bss$bookie_back_odds-bss$stake),-bss$stake)
#create a total profit/loss cloumn
bss$profit_total <- sum(bss$profit)
#now create a cumulative profit for graphing
bss$profit_cum <-  round(cumsum(bss$profit),2)

#print the amount of profit we would make
print(max(bss$profit_total))

## [1] -284.6

Bummer!! that’s -$-284.6, so we actually made a loss. Seems its not so simple after all!! So what went wrong here? Well, there must be enough outliers in the False positive region to offset our small profits.

Lets take a look at how many false positives there are which are > 0.67

sum(ifelse(bss$bookie_back_prob_norm >bssFPq & bss$FP ==1, 1,0))

## [1] 120

So there a 120 of these occurrences. Ok so that means straight up we are going to lose $1,200 ($10 stake * the amount of false positives).

Now lets count the number of occurrences of True positives which are >0.67 probability

sum(ifelse(bss$bookie_back_prob_norm >bssFPq & bss$TP ==1, 1,0))

## [1] 338

Ok, so there are 338 of these. That seems like a lot more, but obviously not enough.. lets see why. The average of the odds for the bookmaker above >0.67 probability is 1.28:

mean(subset(bp$bookie_back_odds,bp$bookie_back_prob_norm >bssFPq))

## [1] 1.276114

So on average our profit on all our bets above >0.67 is (on average) going to be $946:

# (1.34 * stake * number of bets) - stake * number of bets
(1.28 * 10 * 338) - (10 * 338)

## [1] 946.4

So we can see straight away that we are going to make a loss because this value ($946) is smaller than than the value from false negatives at the same threshold ($1,200)

We can use this knowledge to create a simple formula to check if we can profit from any sort of ‘short odds’ betting. We can use:

((TPd*TPb)-TPb) - FPb

where

TPd is the average decimal odds of the true positives above/within the threshold
FPd is the number of False positive bets above/within the threshold
TPb is the number of True positive bets above/within the threshold

So we could reduce our previous staking simulation to: (1.28*338)-338 - 120

((1.28 *338)-371)-131

## [1] -69.36

Which is -69. Since this is negative we aren’t going to make any sort of money staking like this because the number of false postives outwieghs the number of true postives times the odds of the true postives

Right! so is there anything that might work? Well when we looked at the box plot there looked to be as many outliers in the False positive range above 0.67 as there were True positives in this range. So what about if we just use the range between the 3rd quartiles (75th percentiles) of the false positives and true positives?

p2b <- ggplot(bp,aes(factor(TP),bookie_back_prob_norm))
p2b <- p2b + geom_boxplot()
p2b <- p2b + geom_hline(aes(yintercept=bpFPq), col = "coral", linetype = "dashed", size = 0.8)
p2b <- p2b + geom_hline(aes(yintercept=bpTPq), col = "coral", linetype = "dashed", size = 0.8)
plot(p2b)

Sounds like a pretty simple strategy, maybe to good to be true so lets quickly check how it performs. Just using a really simple fixed staking calculation on the data using the logic that we will bet $10 on every game which has an implied probability of between ~0.67 and ~0.71:

#create a 'stake' column and just set it to a fixed staking value
bp$stake <- 10

#save 75 percentile (3rd quartile of true postives and false postives)
bpTPq <-quantile(bTP$bookie_back_prob_norm, c(.75)) 
bpFPq <-quantile(bFP$bookie_back_prob_norm, c(.75)) 
#set up basic staking logic 
  #if the bookie back odds are between 3rd q of the FP and 3rd Q of the TP then we are going to stake $10 otherwise we are not going to stake ($0)
bp$stake <- ifelse(bp$bookie_back_prob_norm > bpFPq & bp$bookie_back_prob_norm < bpTPq, bp$stake,0)
#now simulate profit/loss on the stake amount
bp$profit <- ifelse (bp$TP == 1, (bp$stake*bp$bookie_back_odds-bp$stake),-bp$stake)
#create a total profit/loss cloumn
bp$profit_total <- sum(bp$profit)
#now create a cumulative profit for graphing
bp$profit_cum <-  round(cumsum(bp$profit),2)

#print the amount of profit we would make
print(max(bp$profit_total))

## [1] 42.2

Yay!!!! All that hard work and we got $42.00!! On my way to money town oh yeah!!

Ahem… back to reality. Lets chart that baby up and see what it looks like..

p4 <- ggplot(bp, aes(match_id,profit_cum))
p4 <- p4 + geom_path(aes(col=profit_cum), size = 0.5)
p4 <- p4 + geom_hline(aes(yintercept=0.0), col = "red", linetype = "dashed", size = 0.8)
plot(p4)

Not so pretty… BUT what really stands out is that profit never dropped below 0… now that is interesting…

So it appears that betting in this region is profitable (just). Lets just explore a little bit further. The total amount staked across all seasons was $42:

sum(bp$stake)

## [1] 1500

So if we made $42 then return on investment (ROI) is ~3%:

(sum(bp$profit)/sum(bp$stake))

## [1] 0.02813333

That is we invested $1,500 over the term and ended up with $1,542 (profit $42) for a total return of…. ~3%… amazing.. (I am being sarcastic, BUT making any profit as you will come to know is actually quite a challenge)

So how many bets did we actually make?

sum(ifelse(bp$stake ==10, 1,0))

## [1] 150

184 bets! That’s not many…, how many games did we forgo betting on…

sum(ifelse(bp$stake ==0, 1,0))

## [1] 1242

1,242! Ok so we ended up betting and profiting small on a low number of games, but came out on top. What was our staking accuracy??

Its the number of bets we placed and had profit over the total number of bets we placed:

(sum(ifelse(bp$profit >0, 1,0)))/
  (sum(ifelse(bp$stake >0, 1,0)))

## [1] 0.7466667

74%!! well that’s pretty damn good!

Recap

Ok lets just recap what we have learned.

If we use the bookmakers normalized probability to predict the outcome of matches then we would have achieved a 64% precision across seasons 2009-2015
If we stake on matches within the noramilsed probability of the 3rd quartile of the true positives and false positive results then we can achieve small (~3% ROI) and hence constitutes a safe bet
Using a simple staking threshold to back strong favorites did not achieve profit because the number of false positives outweighed the number of true positives times the odds.

Exploring long odds

So what if we look at the same scenario but for true positive and false negative events in the chance that we can gains some profit from betting on long odds (>2.0)

I’ll just jump straight to the box-plots for these, and quickly plot our preferred betting range as before:

bn <- subset(b, b$bookie_label == 0)
bTN <- subset(bn,bn$TN ==1)
bFN <- subset(bn,bn$FN ==1)
bnTNq <-quantile(bTN$bookie_back_prob_norm, c(.75)) 
bnFNq <-quantile(bFN$bookie_back_prob_norm, c(.75)) 
bn <- subset(b,b$bookie_label ==0)
p5 <- ggplot(bn,aes(factor(TN),bookie_back_prob_norm))
p5 <- p5 + geom_boxplot()
p5 <- p5 + geom_hline(aes(yintercept=bnTNq), col = "coral", linetype = "dashed", size = 0.5)
p5 <- p5 + geom_hline(aes(yintercept=bnFNq), col = "coral", linetype = "dashed", size = 0.5)
plot(p5)

Right, now lets put our opposite hats on. Looking at the box-plots we see that the plots are reversed compared to our previous probability distribution plots which is what we expect for well calibrated probabilities in the true/false negative space. But now what we want to explore is the area where there are a higher number of false negatives than there are true negatives. This is because a false negative means the model has predicted a loss, but the result was a win. So lets see if we can leverage off this because these are long odds so will pay out more.

Using a similar approach as before lets explore placing bets on games which are above the 3rd quartile of the true negative results and below the third quartile of the false negatives (the region between the dashed line/s)

#create a 'stake' column and just set it to a fixed staking value
bn$stake <- 10
#set up basic staking logic 
  #if the bookie back odds are between 0.67 & 0.71 we are going to stake $10 otherwise we are not going to stake ($0)
bn$stake <- ifelse(bn$bookie_back_prob_norm > bnTNq & bn$bookie_back_prob_norm < bnFNq, bn$stake,0)
#now simulate profit/loss on the stake amount
bn$profit <- ifelse (bn$FN == 1, (bn$stake*bn$bookie_back_odds-bn$stake),-bn$stake)
#create a total profit/loss cloumn
bn$profit_total <- sum(bn$profit)
#now create a cumulative profit for graphing
bn$profit_cum <-  round(cumsum(bn$profit),2)

#print the amount of profit we would make
print(max(bn$profit_total))

## [1] -120

Bugger!! looks like backing long odds in the region of higher false negatives doesn’t quite pay -off, we are down $120 which means that this would have wiped our profit from the short odds betting…Back to the drawing board on that one..

Summary

While brief, the above is a simple review of how good the bookmakers are at predicting winners (65% accuracy across 2009-2015) and gives brief in-site into the structure of the bookmakers odds. It shows that the bookmakers probabilities are well calibrated (the mean of the false positives is lower than the mean of the true positives) and additionally shows that there is a theoretical window of short odds betting which can make a very small, long term return on investment (~3%).

Now its time to explore any strategies which might make a higher ROI!! Stay tuned!!