Deep Dive – Part 01 – Introduction, summary and feature list

Introduction

The purpose of this post is to provide details on the methodology behind the development of the predictive model (PM) which will be used for predicting the outcome of the matches of the 2016 NRL season. Additionally, it provides details on the theoretical performance of the machine learning (ML) algorithms on the 2009-2015 season data. I also outline some of the software I use to assist in creating the predictive model, however I must say that I have no affiliation whatsoever to any software vendor –I just used the software I deemed fit for my purpose due to cost and/or familiarity.

Some of the terms and concepts may be unfamiliar to the reader and where these are not listed in the glossary, I encourage you to do your own research into the terms if you are interested. In most cases I will not be delving into great detail behind the mathematics of the models, but will be providing details on the evaluation measure results for each of the algorithms. For clarification I use the term/s predictive model and machine learning algorithm somewhat interchangeably, however the difference is that; a PM is the entire prediction system whereas a ML algorithm is simply an algorithm which is initialised on a data set inside the system. For example, the PM might comprise of a system which extracts data from a database, transforms and manipulates the data into features, subsequently uses an ML algorithm on the data to predict outcomes and finally returns the results to a spreadsheet.

Summary of Method

To develop the PM, I had to do a lot of ground work, and this is not unlike any other traditional predictive modelling project. Specifically, when starting from base principles I break my predictive modelling projects down into various steps including:

  • Data mining/compilation/extraction
  • Data validation/standardisation/transformation
  • Analysis and (statistical) inspection of data
  • Feature generation
  • ML Algorithm generation/initialisation
  • ML testing and evaluation
  • Generation of predictive model incorporating best ML algorithm

For the NRL experiment I started from ground zero so I had to follow these steps fairly well. Specifically, I had to:

  • extract disparate NRL data (such as historic match results/statistics, historic bookmaker odds for all games) from a range of sources (websites, books etc.)
    • I used a semi-automated (programmatic) and in some cases manual process to do this
  • transform this data and standardise the data for input into a database
    • I initially compiled spreadsheets of data in Microsoft Excel and subsequently loaded into a holding database (SQL Server 2014) for further manipulation
  • create a database and import data into the database
    • I used SQL Server 2014 and Azure DB
  • validate and inspect the data
    • I used a range of software tools to assist in data inspection including R, Python, SQL programming language and Excel to plot information, produce histograms, inspect statistics of features etc.
  • generate features for input into ML algorithms
    • I used a range of software tools to generate feature including R, Python, and SQL programming language
  • train and program a range of ML algorithms using generated features as inputs
    • I used Azure ML in conjunction with R and Python to develop ML algorithms
  • evaluate and compare the performance of machine learning algorithms
    • I used Microsoft Azure ML in conjunction with R and Python evaluate ML algorithms
  • select the best performing algorithm and incorporate into the PM system
    • I use Microsoft SQL Server 2014 in conjunction with Microsoft Azure to manage the execution of the predictive model

In addition to this (and this is something I have not traditionally had to do) I had to evaluate the performance of the ML algorithms with respect to return on investment (ROI). That is, I had to ensure that I had the best performing algorithm which could also be optimized for appropriate staking on games.

In the following section/s I won’t delve into the details of the data extraction/standaridisation/validation process, or the database build process (although these are extremely important foundations steps) however I will say:

I could not find one single (free) source of information which had all of the data I wanted, and in some cases could not automate the data gathering process. Therefore, this was an extremely time consuming process and there are ultimately some features which I did not compile which may have been useful for training the ML algorithms. Of course this means that the current machine learning algorithms have the potential to get even better as time progresses and I slowly acquire more data.

Additionally, if you plan to undertake a similar experiment to me, I would highly recommend that you get familiar with and utilize a relational database management system (RDMS) such as SQL Server, Oracle database PostgreSQL or MySQL to store your data. Whilst it is possible to store all your information in spreadsheet/s there are numerous long term advantages for using a RDMS.

Data inspection and feature generation

I won’t bore you with the details behind the statistical inspection of each variable in the raw data set, although I will give you the key things you need to know. Firstly, we are analysing the compiled data from seasons 2009-2015. The number of total matches in this period is 1,608. In each season there are 201 matches played comprising of 192 regular season games, four quarter final games, two preliminary final games and 1 grand finale. The number of matches per round type has remained consistent across years 2009-2015 and the same number of matches per round are scheduled for 2016. Since the inception of the golden point rule there have only been 8 matches (0.5% of the representative data set) which ended in a draw. I determined through later predictive modelling that the chance of a draw was small enough as to be probabilistically insignificant and therefore chose to utilise binary (two class classification algorithms) for predicting the outcome of matches. The advantage of this is that binary classification algorithms tend to be much more simple than multiclass algorithms which has some computational advantage. The disadvantage is that the algorithm/s will not classify any result as a draw. It should be noted though, that the majority of bookmakers only offer head to head odds (and not odds for draw) on NRL games so it is seemingly consensus (that draws are very rare).

Also, because I only store the raw match statistics in the database (such as score, venue, match date etc.) it is necessary to create summary variables for use as input features into the ML algorithm/s. These summary features are what ultimately assist or ‘train’ the algorithm to ‘learn’ what features are important in predicting the outcome. The list of features I generated as inputs into the algorithm is presented below. Note that there are some features which I would like use as inputs which I just haven’t had time to compile. Namely things like number of cumulative penalties to date, tackles for and against, player team shifts, number of injuries leading into a match, if a ‘star’ player is playing to etc. The thing is though; if you include these features, you need them for every single game in your data set, which means I have to have a much larger database of information (if anyone is generous enough to donate info let me know!!). In reality the list of features I have is fairly limited, however as I show later, is still good enough for >0.5 prediction accuracy. The advantage of having a relatively small set of features is that they are easy to capture on an ongoing basis with limited time/resources and therefore there is some benefit in cost time analysis (how difficult is to maintain and capture information for what it ultimately gives you).  Additionally as I said before, assuming some of the missing variables actual do have a positive impact on the predictive power of the algorithm, then the current models can only get better (this is what machine learning is all about)

NB:

It is very important that if you chose to create summary features like I have that you don’t summarise information which is inclusive of the current row (the current match). It seems obvious, but when you run a cumulative total in most software programs (even in excel) the cumulative total includes the current row (go check for yourself). As an example if you are creating a cumulative total of the number of matches the team has won in the last five rounds, make sure you only run a cumulative total on matches up until and excluding current game.

And without further ado, I present the feature list:

NRL Predictive Modelling feature list

FeatureDescription
seasonNRL season (year)
round_noseason round number
game_nogame number in the round
round_type_codetype of round (Regular, Preliminary final, Semi Final, Grand Final)
dpt_monthmonth number in the year the match is played
dpt_weekweek number in the year the match is played
dpt_dayday number in the week the match is played
dpt_doyday number in the year the match is played
dpt_hourhour in the day the match is played
dpt_minuteminute in the hour the match is played
days_from_last_gamenumber of days since the team last played
team_descrteam description/code
team_against_descrdescription/code of the against team
home_awayif the team is playing at home or playing away
venue_descdescription of the venue the match is held at
venue_loclocation of the venue the match is held at
sum_last_5_winsthe number of games the team has one in the last five matches (irrespective of season)
sum_last_8_winsthe number of games the team has one in the last eight matches (irrespective of season)
sum_season_ladder_pointsthe number of points the team has on the NRL ladder
sum_season_win_marginthe team’s cumulative total of winning margins for the season
sum_season_lose_marginthe team’s cumulative total of losing margins for the season
season_points_forthe team’s cumulative scores for to date
season_points_againstthe team’s cumulative scores against to date
total_season_winsthe team’s cumulative win total for the season
total_season_lossesthe team’s cumulative loss total for the season
sum_season_last_5_winsthe number of games the team has one in the last five matches (within the season)
sum_season_last_8_winsthe number of games the team has one in the last eight matches (within season)
Stay tuned for Part 02 -Generation and evaluation of ML algorithms, where the actual generation and performance of the machine learning algorithms will be presented!!