Slide 1

Slide 1 text

‹#› © Cloudera, Inc. All rights reserved. Moneyballing: Using Data to Win At Fantasy Football Juliet Hougland @j_houg

Slide 2

Slide 2 text

‹#› © Cloudera, Inc. All rights reserved. whoami

Slide 3

Slide 3 text

‹#› © Cloudera, Inc. All rights reserved. Why am I here?

Slide 4

Slide 4 text

‹#› © Cloudera, Inc. All rights reserved. Serious Business img: [1], [2], [3]

Slide 5

Slide 5 text

‹#› © Cloudera, Inc. All rights reserved. What is Fantasy Football? • Draft players • Play players from your pool each week • Those players score points for you according to some rules =>

Slide 6

Slide 6 text

‹#› © Cloudera, Inc. All rights reserved. Why Fantasy Football? img: [4]

Slide 7

Slide 7 text

‹#› © Cloudera, Inc. All rights reserved. Financial Account Aggregators: Theoretically creepy, actually useful!

Slide 8

Slide 8 text

‹#› © Cloudera, Inc. All rights reserved. Why Fantasy Football? img: [4]

Slide 9

Slide 9 text

‹#› © Cloudera, Inc. All rights reserved. Data Modeling

Slide 10

Slide 10 text

‹#› © Cloudera, Inc. All rights reserved. Two 500lb Gorillas in the Room img: [5], [6] 3NF Dimensionally Modeled Data Warehouse

Slide 11

Slide 11 text

‹#› © Cloudera, Inc. All rights reserved. • Application backend • No data duplication => • Low latency CRUD ops Engineered for Different Workloads • Ease and flexibility in per-key aggregations • Business Intelligence (BI) reporting • Conformed dimensions • Obvious sensible joins 3NF Dimensionally Modeled Data Warehouse img: [7]

Slide 12

Slide 12 text

‹#› © Cloudera, Inc. All rights reserved. It isn’t NoSql, it is… No (Or, please very few) Joins

Slide 13

Slide 13 text

‹#› © Cloudera, Inc. All rights reserved. Good Cluster, Bad Cluster

Slide 14

Slide 14 text

‹#› © Cloudera, Inc. All rights reserved.

Slide 15

Slide 15 text

‹#› © Cloudera, Inc. All rights reserved. Two ways to initially receive data CSVs 3NF img: [8]

Slide 16

Slide 16 text

‹#› © Cloudera, Inc. All rights reserved. • Easy to update • Data only gets written once 3ND DB Format

Slide 17

Slide 17 text

‹#› © Cloudera, Inc. All rights reserved. • Star Schema • Facts are in the center of star • Dimensions are the points • Optimized for ease of aggregate queries • Conformed dimensions Data Warehouse

Slide 18

Slide 18 text

‹#› © 2014 Cloudera, Inc. All rights reserved. Schema for Football Play Prediction SQL statements on columns make sense, because columns are like per-player tables

Slide 19

Slide 19 text

‹#› © Cloudera, Inc. All rights reserved. Supernova Schema: Collapsed Star Schema

Slide 20

Slide 20 text

‹#› © Cloudera, Inc. All rights reserved. Exhibit: SQL in your SQL

Slide 21

Slide 21 text

‹#› © Cloudera, Inc. All rights reserved. Exhibit? • Exhibit: https://github.com/jwills/exhibit • I heard you like SQL…

Slide 22

Slide 22 text

‹#› © Cloudera, Inc. All rights reserved. CREATE TABLE IF NOT EXISTS passing_game_sess AS SELECT a.psr as player, b.gid, collect_all(named_struct( 'pid', a.pid, 'yds', a.yds, 'intcpt', cast(e.pid is not null as int), 'td', cast(coalesce(c.pts, 0) as int), 'conv', cast(d.pid is not null as int))) as passes FROM football.pass …other stuff… How do we make a (Hive) Supernova table?

Slide 23

Slide 23 text

‹#› © Cloudera, Inc. All rights reserved. DESCRIBE TABLE passing_game_sess; What does this table look like? Column Name Data Type player string gid int passes array>

Slide 24

Slide 24 text

‹#› © Cloudera, Inc. All rights reserved. CREATE TABLE player_game_points STORED AS PARQUET AS SELECT a.player, a.gid, b.*, …other stuff… FROM scoring_game_sessions a LATERAL VIEW within_table(array( "select sum(yds) yds, sum(intcpt) intcpt, sum(td) td, sum(conv) conv from t1", "select (yds/25) - 2*intcpt + 2*conv + 4*td as passing from last"), passes) b …other stuff… How do we query a supernova table?

Slide 25

Slide 25 text

‹#› © Cloudera, Inc. All rights reserved. DESCRIBE TABLE player_game_points; What does this table look like? Column Name Data Type player string gid int passing int rushing int receiving int

Slide 26

Slide 26 text

‹#› © Cloudera, Inc. All rights reserved. Roster Risk Optimization

Slide 27

Slide 27 text

‹#› © Cloudera, Inc. All rights reserved. What are we predicting? • We want to maximize points • … and manage risk • Historic fantasy points are easy to calculate given granular play history • => Regression problem • => Start simple. Predict the average pts scored last year. • What is risk for us? • => Variance in points scored.

Slide 28

Slide 28 text

‹#› © Cloudera, Inc. All rights reserved. The predictor • E[points | year = 2014] = E[point | year =2013] • I always start with that predictor: • E[X | this interval] = E[X | last interval] • We can define risk as the std deviation of their expected points per game, over the season. • Risk(Player) = Sqrt(Variance(X))

Slide 29

Slide 29 text

‹#› © Cloudera, Inc. All rights reserved. How do I compute per player predictions? CREATE TABLE player_prediction AS select player_game_points.player AS player, games.seas as seas, sum(passing + rushing + receiving) AS totalPts, stddev_pop(sum(passing +rushing + receiving)) AS stddev, players.pos1 FROM player_game_points LEFT JOIN games ON player_game_points.gid = games.gid LEFT JOIN players ON players.player = player_game_points.player GROUP BY player_game_points.player, players.pos1, games.seas; !

Slide 30

Slide 30 text

‹#› © Cloudera, Inc. All rights reserved. Wide Receiver Running Back

Slide 31

Slide 31 text

‹#› © Cloudera, Inc. All rights reserved. Quarter Back Tight End

Slide 32

Slide 32 text

‹#› © Cloudera, Inc. All rights reserved. So, what do our rosters look like? file:///home/juliet/src/footballtimeviz/index.html

Slide 33

Slide 33 text

‹#› © Cloudera, Inc. All rights reserved. • Yep, but the workflow is modular • I always start with the simplest plausible model • Any other model, and an estimate of it’s variance for each player (bootstrap?) would work • What is your favorite regression algorithm? My model is simple

Slide 34

Slide 34 text

‹#› © Cloudera, Inc. All rights reserved. Where does risk come from?

Slide 35

Slide 35 text

‹#› © Cloudera, Inc. All rights reserved. Probably Injuries? img: 11

Slide 36

Slide 36 text

‹#› © Cloudera, Inc. All rights reserved. Can I find data to justify this? from BeautifulSoup import BeautifulSoup from urllib2 import urlopen import csv ! soup = BeautifulSoup(urlopen('http://www.cbssports.com/nfl/injuries')) ! outputFile = open("injuries.csv", "w") csvwriter = csv.writer(outputFile) ! for table in soup.findAll("table", attrs = {"class" : "data"}): rows = table.findAll("tr") if len(rows) > 0: team_name = rows.pop(0).find("a") for row in rows: cells = [cell.text for cell in row.findAll('td')] if len(cells) == 6: cells += team_name csvwriter.writerow(cells) ! outputFile.close()

Slide 37

Slide 37 text

‹#› © Cloudera, Inc. All rights reserved. Photo Credits 1. bit.ly/1xCEAGE 2. bit.ly/1DJvZGH 3. bit.ly/1D6wb4E 4. bit.ly/1yRu7Zp 5. bit.ly/1Fg3TEt 6. bit.ly/1zgDLnp 7. bit.ly/1CCmOuI 8. bit.ly/1zYnRmN 9. bit.ly/1uAWEWC 10. bit.ly/17bRg22 11. bit.ly/1AidZEP Find the code (but not the proprietary data) on github: https://github.com/jhlch/fantasy-football

Slide 38

Slide 38 text

‹#› © Cloudera, Inc. All rights reserved. 1. Get NFL data from armchair analysis 2. Read “Dimensional Modeling Toolkit” by Ralph Kimball 3. ??? 4. Profit Next Steps img: 1

Slide 39

Slide 39 text

‹#› © Cloudera, Inc. All rights reserved. Thank you! Juliet Hougland @j_houg