Deep dive Archives

Staking Strategy – Part 1

March 6, 2016June 12, 2016 Kane Maxwell

Overview of strategy

Only stake on games from round 5 onwards (due to models historic performance in early rounds and specific features which are driving the model precision)
Only stake on games in which the models perceived probability is greater than the bookmakers implied probability and on the condition that the perceived probability is greater than the average of the models false positive perceived probability (FPpp)
Stake using a single bookmaker (in this case Sportsbet)
Use a fixed stake of 5% of the current bank
Do not stake using any exotic bet types (such as accumulators, multi bets etc) and stake on simple head to head bets offered by the bookmaker (N.B. Sportsbet only offers head to head bets without draws (ie you cannot bet on a the outcome of an NRL match to be a draw)

The odds are NOT in your favour

Introduction

The following outlines the staking strategy I will be using for the 2016 season. The staking strategy is specifically tuned to the performance of the chosen predictive model and the detail and reasoning is presented.

This is the sad reality:

You have come up with a model (mathematic/clairvoyant or other) which can predict the outcome of matches over a season with >50% accuracy, Great! However, this does NOT mean you will be able to achieve profit. In fact it is most probable that you will lose money even if your predictive power is >50%. Although the following might come off as a little heavy on the technical speak, all I am trying to say is that in order to make a profit, you need to have a model which is able to predict winners with greater precision than the implied odds or have a model which is able to optimise staking on true positive results and mitigate staking on true negative results. This is much more difficult than simply predicting who will win a match. It also identifies one of the issues with perceived probabilities generated from machine learning algorithms – which is that they generally don’t reflect the true ^[*]probability of an event occurring (which is BAD if we are using these probabilities for staking).

[*] Of course no one knows the true probability of an event occurring; however some probabilities more closely reflect ‘reality’ than others

As discussed earlier, evaluation/performance measures of a two class classification model are calculated or derived using the models true positives/negative and false positives/negative results. While all of these measures are important in determining the predictive power of a model, some are of greater importance to us when evaluating if a model can generate long term profit. Specifically, when we back a team to win in a head to head match and lose, it hurts! (our dignity and wallet). In our predictive model evaluation, these instances are called ‘false positives’ (basically when the model tells us to back a team but the actual result is a loss). So it’s really important for our model to have as few false positives as possible (and of course to have as many true positives as possible). Without getting too technical, recall that the evaluation measure precision (sometimes called positive predictive value) is derived by dividing the true positives by the sum of true positives and false positives (TP/TP+FP). Because higher precision means fewer false positives this measure becomes quite predictive of how the model will perform in making a profit. In addition, because we are dealing with head to head staking (where we can only stake on the team to win), we are exclusively dealing with positive predictions (positive class = win). This means that model and staking evaluation is done primarily using true positives and false positives because there will be no true negative (and hence no false negative) results. The following outlines some simple measures one can use to evaluating if a predictive model has the potential to make profit.

Evaluating the models potential for profit

These are the facts (keeping it very simple):

For a head to head match where there are no lay bets (ie you can only back the winner):
If a bookmaker offered evens on every match (backing a winner at odds = 2.0, implied probability = 0.5) and your model had a precision across all stakes made of 50% then you would break even (profit/loss = $0)
You can only make a long term profit using a fixed staking strategy if your models precision is greater than the implied probability offered by the bookmaker over the number of matches on which you have staked (ie where the model has predicted the positive class and you have staked on its recommendation). This measure can be termed the models potential for profitability (MPP) where the higher the models precision is over the implied probability the greater the MPP.
If precision is less than the implied probability offered by the bookmaker in the same scenario, it may still be possible to make a profit using proportional staking strategies but only if the conditional potential for profitability (MCPP) is positive. If the MCPP is positive, the models perceived probabilities are said to be well calibrated. If a model’s MCPP is negative then it may be possible to (re) calibrate the perceived probabilities using regression or classification techniques.
If both MPP and MCCP are negative and the models perceived probabilities cannot be calibrated so that MCPP is positive then the model cannot make a profit
Therefore; in addition to having a model with high precision you must calculate and understand the MPP and MCPP of the predictive model utilising historic odds information. A model with high precision but negative MCPP will lose money using a proportional staking strategy.

So if MPP and MCPP are so important, what the heck are they? Well they are simple measure/s to quickly evaluate your model to determine if you can actually make money or if you need to go back to the drawing board

The measure of the models potential for profitability (MPP) can be calculated by:

Predictive models average Precision – Average Implied probability

Expressed as

MPP = mp – P(A)

where

mp = the average precision of a predictive model

P(A) = the average of the implied probability

If this is positive then you have the potential to make a profit from fixed staking. The higher this measure the greater the potential for profit (using any staking strategy). If this measure is negative then you cannot use a fixed staking strategy and expect to make long term profit

If this measure is negative, it may still be possible to make a profit IF the models conditional potential for profitability (MCPP) is positive.

The conditional potential for profitability (MCPP) can be calculated by:

(Perceived probability of true positives – Implied probability of true positives) + (Implied probability of false positives – perceived probability of false positives)

Expressed as:

(P(A^TP) – P(B^TP)) + (P(B^FP) – P(A^FP))

Where:

P(A^TP) = the average perceived probability of true positive events

P(A^FP) = the average perceived probability of false positive events

P(B^TP) = the average implied probability of true positive events

P(B^FP) = the average implied probability of false positive events

If MCPP is positive we can say that the perceived probabilities are well calibrated against the implied probabilities and there is potential to make profit utilising a proportional staking strategy even if MCP is negative. If both MCP and MCPP are negative, then the predictive model cannot be profitable. When evaluating MCP and MCPP results we can infer:

MPP positive => Potential for profit using fixed staking or proportional staking

MPP negative & MCPP positive -> Potential for profit using proportional staking

MPP positive and MCPP negative -> Potential for profit using fixed stake or proportional staking (because precision of the model will be higher than the average perceived probability)

MPP & MCPP negative -> No potential for profit

Lets have a look at how this works in practice.

MPP example

If the average of the bookmaker odds over the term of your investment is 1.67 (ie implied probability of 60%), then you need a model precision of 60% to break even. Ie if you have an MPP of 0 you will break even.

Proof:

Given:

10 matches
fixed stake (fs) of $10 on each match
odds ~1.67 (equivalent to implied probability of 60%)
60% Precision (6 correct picks (true positives (TP)) and 4 incorrect picks (false positives (FP))

MPP = 0.6-0.6 = 0. We should break even if using a fixed stake. Lets see:

Profit under the given scenario

= profit gained from correct picks (true positives) – loss from incorrect picks (false negatives)

= ((odds * fs * TP) – (stake * TP)) – (fs * FP)

=((1.67*10*6) – (10*6)) – (10*4)

=(100-60) – 40

=40-40

Yup looks like it holds true.

If under the same conditions your model precision was 70% MPP would be 0.7-0.6 = 0.1. Since this is positive we expect that we will make a profit. Lets check:

profit = profit gained from correct picks (true positives) – loss from incorrect picks (false negatives)

= ((odds * fs * cp) – (stake * cp)) – (fs * FP)

=((1.67*10*7) – (10*7)) – (10*3)

=16.7

Great so all we need to do to quickly check if our model has the potential to make profit is to calculate MPP. Great! but what if MPP is negative?

Well that’s when we need to look at MCPP. This measure basically looks at the difference in our false positive and true positive perceived probabilities in reference to the implied probabilities. The idea is to determine if we can overcome lower model precision and still make profit by optimising staking on true positives and mitigate staking on true negatives.

Stay tuned for more detail on the taking strategy

Deep Dive – Part 01 – Introduction, summary and feature list

February 21, 2016February 21, 2016 Kane Maxwell

Introduction

The purpose of this post is to provide details on the methodology behind the development of the predictive model (PM) which will be used for predicting the outcome of the matches of the 2016 NRL season. Additionally, it provides details on the theoretical performance of the machine learning (ML) algorithms on the 2009-2015 season data. I also outline some of the software I use to assist in creating the predictive model, however I must say that I have no affiliation whatsoever to any software vendor –I just used the software I deemed fit for my purpose due to cost and/or familiarity.

Some of the terms and concepts may be unfamiliar to the reader and where these are not listed in the glossary, I encourage you to do your own research into the terms if you are interested. In most cases I will not be delving into great detail behind the mathematics of the models, but will be providing details on the evaluation measure results for each of the algorithms. For clarification I use the term/s predictive model and machine learning algorithm somewhat interchangeably, however the difference is that; a PM is the entire prediction system whereas a ML algorithm is simply an algorithm which is initialised on a data set inside the system. For example, the PM might comprise of a system which extracts data from a database, transforms and manipulates the data into features, subsequently uses an ML algorithm on the data to predict outcomes and finally returns the results to a spreadsheet.

Summary of Method

To develop the PM, I had to do a lot of ground work, and this is not unlike any other traditional predictive modelling project. Specifically, when starting from base principles I break my predictive modelling projects down into various steps including:

Data mining/compilation/extraction
Data validation/standardisation/transformation
Analysis and (statistical) inspection of data
Feature generation
ML Algorithm generation/initialisation
ML testing and evaluation
Generation of predictive model incorporating best ML algorithm

For the NRL experiment I started from ground zero so I had to follow these steps fairly well. Specifically, I had to:

extract disparate NRL data (such as historic match results/statistics, historic bookmaker odds for all games) from a range of sources (websites, books etc.)
- I used a semi-automated (programmatic) and in some cases manual process to do this
transform this data and standardise the data for input into a database
- I initially compiled spreadsheets of data in Microsoft Excel and subsequently loaded into a holding database (SQL Server 2014) for further manipulation
create a database and import data into the database
- I used SQL Server 2014 and Azure DB
validate and inspect the data
- I used a range of software tools to assist in data inspection including R, Python, SQL programming language and Excel to plot information, produce histograms, inspect statistics of features etc.
generate features for input into ML algorithms
- I used a range of software tools to generate feature including R, Python, and SQL programming language
train and program a range of ML algorithms using generated features as inputs
- I used Azure ML in conjunction with R and Python to develop ML algorithms
evaluate and compare the performance of machine learning algorithms
- I used Microsoft Azure ML in conjunction with R and Python evaluate ML algorithms
select the best performing algorithm and incorporate into the PM system
- I use Microsoft SQL Server 2014 in conjunction with Microsoft Azure to manage the execution of the predictive model

In addition to this (and this is something I have not traditionally had to do) I had to evaluate the performance of the ML algorithms with respect to return on investment (ROI). That is, I had to ensure that I had the best performing algorithm which could also be optimized for appropriate staking on games.

In the following section/s I won’t delve into the details of the data extraction/standaridisation/validation process, or the database build process (although these are extremely important foundations steps) however I will say:

I could not find one single (free) source of information which had all of the data I wanted, and in some cases could not automate the data gathering process. Therefore, this was an extremely time consuming process and there are ultimately some features which I did not compile which may have been useful for training the ML algorithms. Of course this means that the current machine learning algorithms have the potential to get even better as time progresses and I slowly acquire more data.

Additionally, if you plan to undertake a similar experiment to me, I would highly recommend that you get familiar with and utilize a relational database management system (RDMS) such as SQL Server, Oracle database PostgreSQL or MySQL to store your data. Whilst it is possible to store all your information in spreadsheet/s there are numerous long term advantages for using a RDMS.

Data inspection and feature generation

I won’t bore you with the details behind the statistical inspection of each variable in the raw data set, although I will give you the key things you need to know. Firstly, we are analysing the compiled data from seasons 2009-2015. The number of total matches in this period is 1,608. In each season there are 201 matches played comprising of 192 regular season games, four quarter final games, two preliminary final games and 1 grand finale. The number of matches per round type has remained consistent across years 2009-2015 and the same number of matches per round are scheduled for 2016. Since the inception of the golden point rule there have only been 8 matches (0.5% of the representative data set) which ended in a draw. I determined through later predictive modelling that the chance of a draw was small enough as to be probabilistically insignificant and therefore chose to utilise binary (two class classification algorithms) for predicting the outcome of matches. The advantage of this is that binary classification algorithms tend to be much more simple than multiclass algorithms which has some computational advantage. The disadvantage is that the algorithm/s will not classify any result as a draw. It should be noted though, that the majority of bookmakers only offer head to head odds (and not odds for draw) on NRL games so it is seemingly consensus (that draws are very rare).

Also, because I only store the raw match statistics in the database (such as score, venue, match date etc.) it is necessary to create summary variables for use as input features into the ML algorithm/s. These summary features are what ultimately assist or ‘train’ the algorithm to ‘learn’ what features are important in predicting the outcome. The list of features I generated as inputs into the algorithm is presented below. Note that there are some features which I would like use as inputs which I just haven’t had time to compile. Namely things like number of cumulative penalties to date, tackles for and against, player team shifts, number of injuries leading into a match, if a ‘star’ player is playing to etc. The thing is though; if you include these features, you need them for every single game in your data set, which means I have to have a much larger database of information (if anyone is generous enough to donate info let me know!!). In reality the list of features I have is fairly limited, however as I show later, is still good enough for >0.5 prediction accuracy. The advantage of having a relatively small set of features is that they are easy to capture on an ongoing basis with limited time/resources and therefore there is some benefit in cost time analysis (how difficult is to maintain and capture information for what it ultimately gives you). Additionally as I said before, assuming some of the missing variables actual do have a positive impact on the predictive power of the algorithm, then the current models can only get better (this is what machine learning is all about)

NB:

It is very important that if you chose to create summary features like I have that you don’t summarise information which is inclusive of the current row (the current match). It seems obvious, but when you run a cumulative total in most software programs (even in excel) the cumulative total includes the current row (go check for yourself). As an example if you are creating a cumulative total of the number of matches the team has won in the last five rounds, make sure you only run a cumulative total on matches up until and excluding current game.

And without further ado, I present the feature list:

NRL Predictive Modelling feature list

Feature	Description
season	NRL season (year)
round_no	season round number
game_no	game number in the round
round_type_code	type of round (Regular, Preliminary final, Semi Final, Grand Final)
dpt_month	month number in the year the match is played
dpt_week	week number in the year the match is played
dpt_day	day number in the week the match is played
dpt_doy	day number in the year the match is played
dpt_hour	hour in the day the match is played
dpt_minute	minute in the hour the match is played
days_from_last_game	number of days since the team last played
team_descr	team description/code
team_against_descr	description/code of the against team
home_away	if the team is playing at home or playing away
venue_desc	description of the venue the match is held at
venue_loc	location of the venue the match is held at
sum_last_5_wins	the number of games the team has one in the last five matches (irrespective of season)
sum_last_8_wins	the number of games the team has one in the last eight matches (irrespective of season)
sum_season_ladder_points	the number of points the team has on the NRL ladder
sum_season_win_margin	the teams cumulative total of winning margins for the season
sum_season_lose_margin	the teams cumulative total of losing margins for the season
season_points_for	the teams cumulative scores for to date
season_points_against	the teams cumulative scores against to date
total_season_wins	the teams cumulative win total for the season
total_season_losses	the teams cumulative loss total for the season
sum_season_last_5_wins	the number of games the team has one in the last five matches (within the season)
sum_season_last_8_wins	the number of games the team has one in the last eight matches (within season)

Stay tuned for Part 02 -Generation and evaluation of ML algorithms, where the actual generation and performance of the machine learning algorithms will be presented!!