Slide 1

Slide 1 text

Machine Learning Outside the Kaggle Lines Craig Franklin

Slide 2

Slide 2 text

About me ! Developer at Seer Medical ! Recent convert to Aussie Rules Football fandom "presias, cfranklin11 ! craigfranklin.dev @englishcraig

Slide 3

Slide 3 text

- Competitions & learning - Good for learning - Level playing field - Same rules, data, objective Kaggle is fine ! @englishcraig

Slide 4

Slide 4 text

What is footy (,pping)? Robert Merkel, Jacknstock at English Wikipedia / Public domain

Slide 5

Slide 5 text

- Australian-rules football: contact sport, oblong ball - Footy tipping: office betting pool - Most correct picks wins - Lots of novel challenges & mistakes, but learned from them By Unknown author - Fairfax Photo Archives, Public Domain By Flickerd - Own work, CC BY-SA 4.0 @englishcraig

Slide 6

Slide 6 text

- Footy tipping = picking winners - Got data, trained the model, built the app - Unexpected input Clearly define your problem Photo by Patrick Tomasso on Unsplash

Slide 7

Slide 7 text

- Margin of victory for first match is tie breaker - Had classifier, so manually entered margins - Next season: changed to regressor for margins @englishcraig

Slide 8

Slide 8 text

- Kaggle provide all the data for you - Collecting data to solve your ML problem is hard - Dynamic data sources are particularly difficult - Start of first season, betting odds data were blank - I panicked & scraped a betting site Know the en)re life stories of your data sources Photo by CoWomen on Unsplash

Slide 9

Slide 9 text

- Rosters & betting odds change up until the start of each match - Avoid predictions with blank or stale data - Observe data sources as you would the data itself Weekly Data Update Schedule Day Data Types Data Sources Notes Monday Match results Player stats afltables.com Some6mes delayed 6ll Tuesday or Wednesday Tuesday Be:ng odds footywire.com Wednesday Team rosters afl.com.au For Thursday match only Thursday Team rosters afl.com.au For all later matches @englishcraig

Slide 10

Slide 10 text

- Collecting and cleaning your own data is a lot of work - Web scrapers & using undocumented APIs are difficult to maintain - I thought I was clever & original Get by with a li,le help from your friends @englishcraig

Slide 11

Slide 11 text

@englishcraig

Slide 12

Slide 12 text

- fitzRoy: R package for AFL data - R packages can be good source of data - More effort upfront, but lower maintenance - Example: AFL website changing to JS heavy UI h"ps:/ /github.com/jimmyday12/fitzRoy @englishcraig

Slide 13

Slide 13 text

- What values can be missing? What values are unique? - Index: team, season, round number Make your assump-ons explicit @englishcraig

Slide 14

Slide 14 text

- 2010: Replay the Grand Final TFW you realise you go/a do it all again next week Ge#y Images @englishcraig

Slide 15

Slide 15 text

- Data bugs are often silent - Spot checks can catch them, but are time consuming - Raising errors codifies assumptions about valid data Data bugs could be hiding anywhere Photo by Ka*e Moum on Unsplash @englishcraig

Slide 16

Slide 16 text

- Time-series models (ARIMA, Elo) need data sorted by date/ time - Spent days debugging an Elo model with 50% accuracy When doing date,me-sensi,ve calcula,ons, make sure your rows are in the correct order assert data_frame["date"].is_monotonic_increasing, ( "Data must be sorted by date to calculate cumulative values ", "or make predictions with time-series models.", ) data_frame.groupby(["team", "year"].cumsum("match_wins")) @englishcraig

Slide 17

Slide 17 text

- Different stats start in different years - Lots of blank values - Imputing didn't make sense, so decided to fill with 0s - Caused lots of bugs Joining disparate data sets is risky Data set First season # blank seasons Match results 1897 0 Player scoring stats 1897 0 Basic player stats* 1965 68 Advanced player stats** 1999 102 Be@ng odds 2010 113 *For basic in-game events like kicks, tackles, etc. **For less-common in-game events or ones that require player loca:on. @englishcraig

Slide 18

Slide 18 text

- When joining data sets & filling with zeros, can't know which zeros are valid - Some values should never be blank (team, season, round number) - Assertions important, because you can't test for them, resulting in bad training/predictions Make sure there are no dodgy zeros data_to_check = data_frame[NEVER_ZERO_COLUMNS] zeros_data_frame = data_to_check[(data_to_check == 0).any(axis=1)] assert not zeros_data_frame.any().any(), ( "An invalid fillna produced index column values of 0:\n" f"{zeros_data_frame}" ) @englishcraig

Slide 19

Slide 19 text

- Second season: regressor instead of classifier - Kaggle is temporary: no tech debt - Project has to be maintained, so make it pleasant to work with Op#mise for maintainability first, accuracy second Photo by Cesar Carlevarino Aragon on Unsplash

Slide 20

Slide 20 text

- Kaggle is all about performance - Project models can be complicated, just weight cost/benefit - Diminishing returns with increased complexity - Abandoned first model/app because too messy Calculate the cost/benefit of complexity Photo by StellrWeb on Unsplash @englishcraig

Slide 21

Slide 21 text

The Joy of Produc.on

Slide 22

Slide 22 text

- Deployed to Heroku - Crashed because didn't have C library Boost.Python - Re-architect for Docker Know your system-level dependencies (or control them) • Do you need any of the Boost C++ libraries, gcc, or g++? • Do you need to control your environment with Docker? @englishcraig

Slide 23

Slide 23 text

- Second season: deployed to heroku, crashed - Player data used too much memory - Re-architect for DigitalOcean - Data pipelines & ML models are memory hungry - Know your data usage & how much your server has Know your server's specs Photo by Jordan Rowland on Unsplash @englishcraig

Slide 24

Slide 24 text

Predic'ng the bounce of an oblong ball Flickerd / CC BY-SA (h3ps:/ /crea8vecommons.org/licenses/by-sa/4.0)

Slide 25

Slide 25 text

- Rough start, but came back to win in final match 2018 Season Results Tipper Correct Tips* Tipresias (me) 140 Top Coworker 139 Oddsmakers 140 * Regular season only @englishcraig

Slide 26

Slide 26 text

- Added data, improved model - Rough start, rough middle, rough finish - Success isn't guaranteed - Kaggle has static test set, sport is chaotic and each season is unique - Random tipper's instinct can beat the odds as well as the machines 2019 Season Results Tipper Correct Tips* Tipresias (me) 133 Top Coworker 138 Oddsmakers 135 * Regular season only @englishcraig

Slide 27

Slide 27 text

Thank you All the 'ps: 'presias.net All the code: +presias All the slides: ! craigfranklin.dev All the complaints: @englishcraig