Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Machine Learning Outside the Kaggle Lines: PyCo...

Craig
August 26, 2020

Machine Learning Outside the Kaggle Lines: PyConJP 2020

Kaggle competitions are great, but what do you do when you have a cool idea for your own machine-learning project? Learn about all the dirty data, bugs of others, and keeping it all running, when building from zero to production. Hear about the mistakes that I've made so you can avoid them yourself.

Craig

August 26, 2020
Tweet

More Decks by Craig

Other Decks in Programming

Transcript

  1. About me ! Backend developer at Vinomofo ! Recent convert

    to Aussie Rules Football fandom "presias, cfranklin11 ! craigfranklin.dev @englishcraig
  2. - Competitions & learning - Good for learning - Level

    playing field - Same rules, data, objective Kaggle is fine ! @englishcraig
  3. - Australian-rules football: contact sport, oblong ball - Footy tipping:

    office betting pool - Most correct picks wins - Lots of novel challenges & mistakes, but learned from them By Unknown author - Fairfax Photo Archives, Public Domain By Flickerd - Own work, CC BY-SA 4.0 @englishcraig
  4. - Footy tipping = picking winners - Got data, trained

    the model, built the app - Unexpected input Clearly define your problem Photo by Patrick Tomasso on Unsplash
  5. - Margin of victory for first match is tie breaker

    - Had classifier, so manually entered margins - Next season: changed to regressor for margins @englishcraig
  6. - Kaggle provide all the data for you - Collecting

    data to solve your ML problem is hard - Dynamic data sources are particularly difficult - Start of first season, betting odds data were blank - I panicked & scraped a betting site Know the en)re life stories of your data sources Photo by CoWomen on Unsplash
  7. - Rosters & betting odds change up until the start

    of each match - Avoid predictions with blank or stale data - Observe data sources as you would the data itself Weekly Data Update Schedule Day Data Types Data Sources Notes Monday Match results Player stats afltables.com Some6mes delayed 6ll Tuesday or Wednesday Tuesday Be:ng odds footywire.com Wednesday Team rosters afl.com.au For Thursday match only Thursday Team rosters afl.com.au For all later matches @englishcraig
  8. - Collecting and cleaning your own data is a lot

    of work - Web scrapers & using undocumented APIs are difficult to maintain - I thought I was clever & original Get by with a li,le help from your friends @englishcraig
  9. - fitzRoy: R package for AFL data - R packages

    can be good source of data - More effort upfront, but lower maintenance - Example: AFL website changing to JS heavy UI h"ps:/ /github.com/jimmyday12/fitzRoy @englishcraig
  10. - What values can be missing? What values are unique?

    - Index: team, season, round number Make your assump-ons explicit @englishcraig
  11. - 2010: Replay the Grand Final TFW you realise you

    go/a do it all again next week Ge#y Images @englishcraig
  12. - Data bugs are often silent - Spot checks can

    catch them, but are time consuming - Raising errors codifies assumptions about valid data Data bugs could be hiding anywhere Photo by Ka*e Moum on Unsplash @englishcraig
  13. - Time-series models (ARIMA, Elo) need data sorted by date/

    time - Spent days debugging an Elo model with 50% accuracy When doing date,me-sensi,ve calcula,ons, make sure your rows are in the correct order assert data_frame["date"].is_monotonic_increasing, ( "Data must be sorted by date to calculate cumulative values ", "or make predictions with time-series models.", ) data_frame.groupby(["team", "year"].cumsum("match_wins")) @englishcraig
  14. - Different stats start in different years - Lots of

    blank values - Imputing didn't make sense, so decided to fill with 0s - Caused lots of bugs Joining disparate data sets is risky Data set First season # blank seasons Match results 1897 0 Player scoring stats 1897 0 Basic player stats* 1965 68 Advanced player stats** 1999 102 Be@ng odds 2010 113 *For basic in-game events like kicks, tackles, etc. **For less-common in-game events or ones that require player loca:on. @englishcraig
  15. - When joining data sets & filling with zeros, can't

    know which zeros are valid - Some values should never be blank (team, season, round number) - Assertions important, because you can't test for them, resulting in bad training/predictions Make sure there are no dodgy zeros data_to_check = data_frame[NEVER_ZERO_COLUMNS] zeros_data_frame = data_to_check[(data_to_check == 0).any(axis=1)] assert not zeros_data_frame.any().any(), ( "An invalid fillna produced index column values of 0:\n" f"{zeros_data_frame}" ) @englishcraig
  16. - Second season: regressor instead of classifier - Kaggle is

    temporary: no tech debt - Project has to be maintained, so make it pleasant to work with Op#mise for maintainability first, accuracy second Photo by Cesar Carlevarino Aragon on Unsplash
  17. - Kaggle is all about performance - Project models can

    be complicated, just weight cost/benefit - Diminishing returns with increased complexity - Abandoned first model/app because too messy Calculate the cost/benefit of complexity Photo by StellrWeb on Unsplash @englishcraig
  18. - Deployed to Heroku - Crashed because didn't have C

    library Boost.Python - Re-architect for Docker Know your system-level dependencies (or control them) • Do you need any of the Boost C++ libraries, gcc, or g++? • Do you need to control your environment with Docker? @englishcraig
  19. - Second season: deployed to heroku, crashed - Player data

    used too much memory - Re-architect for DigitalOcean - Data pipelines & ML models are memory hungry - Know your data usage & how much your server has Know your server's specs Photo by Jordan Rowland on Unsplash @englishcraig
  20. Predic'ng the bounce of an oblong ball Flickerd / CC

    BY-SA (h3ps:/ /crea8vecommons.org/licenses/by-sa/4.0)
  21. - Rough start, but came back to win in final

    match 2018 Season Results Tipper Correct Tips* Tipresias (me) 140 Top Coworker 139 Oddsmakers 140 * Regular season only @englishcraig
  22. - Added data, improved model - Rough start, rough middle,

    rough finish - Success isn't guaranteed - Kaggle has static test set, sport is chaotic and each season is unique - Random tipper's instinct can beat the odds as well as the machines 2019 Season Results Tipper Correct Tips* Tipresias (me) 133 Top Coworker 138 Oddsmakers 135 * Regular season only @englishcraig
  23. Thank you All the 'ps: 'presias.net All the code: +presias

    All the slides: ! craigfranklin.dev All the complaints: @englishcraig