Data at FanDuel

Data at FanDuel AWS User Group Scotland, September 2014 1

John Sutherland @sneeu 2

EMR, Redshift & SQL 4

What was the original problem you were trying to solve?
10 “

SELECT ùser_id`, `sport`, TO_DATE(`date`) AS `d`, SUM(IF(`fee` > 0, 0,
1)) AS `free_plays`, SUM(IF(`fee` > 0, 1, 0)) AS `paid_plays` FROM èntry` GROUP BY ùser_id`, `sport`, `d` ; 11

Indexing 12

Full–table scan 13

MapReduce 14

Hadoop 15

Why AWS? 16

Why EMR? 17

≈0 18

Why Redshift? 20

PostgreSQL–like 21

Columnar 22

23 id email email_confirmed 1 [email protected] TRUE 2 [email protected] TRUE
3 [email protected] TRUE 4 [email protected] TRUE 5 [email protected] FALSE 6 [email protected] FALSE

Row–orientated 1,[email protected],TRUE\n↩ 2,[email protected],TRUE\n↩ 3,[email protected],TRUE\n↩ 4,[email protected],TRUE\n↩ 5,[email protected],FALSE\n↩ 6,[email protected],FALSE 24 Columnar 1,2,3,4,5,6
! [email protected],[email protected],↩ [email protected],[email protected],↩ [email protected],[email protected] ! TRUE,TRUE,TRUE,TRUE,FALSE,FALSE

25 • Rows stored sequentially on disk • Find disk
location from index • Seek to location, retrieve full row Row–orientated Columnar • Columns stored separately on disk • Read full column • Compression (run–length) is easier with columns of a single type

26 SELECT * FROM user WHERE id = 21803; Row–orientated
Columnar SELECT AVG(amount) FROM deposits WHERE completed = today;

How? 27

sqoop Bulk data transfer between RDBMSs & Hadoop 29

30 SELECT MIN(c), MAX(c) FROM user; ! SELECT MIN(c), MAX(c)
FROM user WHERE c > previous_max_c; ! SELECT MIN(c), MAX(c) FROM user WHERE updated_at > previous_max_updated_at;

Generate n MapReduce jobs 31

Partitioned “id–space” 32

Hive Write SQL; generate MapReduce jobs 33

SELECT COUNT(*) FROM user WHERE date_created > '2014-01-01'; 34

hive> SELECT COUNT(*) FROM action_log; Total MapReduce jobs = 1
Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapred.reduce.tasks=<number> Starting Job = job_1410783831919_0150, Tracking URL = http://172.18.0.152:9046/ proxy/application_1410783831919_0150/ Kill Command = /home/hadoop/bin/hadoop job -kill job_1410783831919_0150 Hadoop job information for Stage-1: number of mappers: 17; number of reducers: 1 2014-09-23 16:42:12,527 Stage-1 map = 0%, reduce = 0% 2014-09-23 16:42:16,648 Stage-1 map = 47%, reduce = 0%, Cumulative CPU 19.14 sec 2014-09-23 16:42:17,746 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 51.65 sec 2014-09-23 16:42:18,783 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 51.65 sec 2014-09-23 16:42:19,818 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 53.22 sec MapReduce Total cumulative CPU time: 53 seconds 220 msec Ended Job = job_1410783831919_0150 Counters: MapReduce Jobs Launched: Job 0: Map: 17 Reduce: 1 Cumulative CPU: 53.22 sec HDFS Read: 4459072001 HDFS Write: 8 SUCCESS Total MapReduce CPU Time Spent: 53 seconds 220 msec OK Lots Time taken: 20.462 seconds, Fetched: 1 row(s) 35

Impala Faster than Hive 36

Pig Write PigLatin; generate MapReduce jobs 37

PigPen Write Clojure; generate MapReduce jobs 38

S3, COPY 39

COPY "user_tmp" FROM 's3://bucket/user'; DROP TABLE "user"; ALTER TABLE "user_tmp"
RENAME TO "user"; 40

Python 41

Fabric 42

DAG 43

Use–cases 44

Word count 45

Marketing 46

Log processing 47

Aﬀiliate payments 48

Other batch jobs 49

Ad–hoc reporting 50

Django, celery & SQLAlchemy 52

Thanks. Questions? 53

Data at FanDuel

Data at FanDuel

More Decks by John S.

Other Decks in Programming

Featured

Transcript