Hadoop + HBase Synergy - Speaker Deck

Slide 1

Slide 1 text

Derek Nelson | Software Engineer Synergy

Slide 2

Slide 2 text

Pixel “ﬁres”

Slide 3

Slide 3 text

Pixel “ﬁres” Serve ad?

Slide 4

Slide 4 text

Pixel “ﬁres” Serve ad? Ad served

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

7/2011 ~50GB/day 6/2013 ~ 10TB/day ~ 20 billion events per day

Slide 7

Slide 7 text

200-300 EMR nodes at any given time All MapReduce jobs are run in AWS Elastic MapReduce 15-30 different jobs running at any given time

Slide 8

Slide 8 text

200-300 EMR nodes at any given time All MapReduce jobs are run in AW Elastic MapReduce 15-30 different jobs running at any given time 2 c1.mediums running Thrift servers 10-12 nodes running in AWS HBase (0.92) c1.xlarges (8 cores, 7GB RAM, 1.6TB RAID0)

Slide 9

Slide 9 text

Everything we do with involves

Slide 10

Slide 10 text

0. Liquid Ads ( Storing MR job output as ﬂexibly as possible )

Slide 11

Slide 11 text

LiquidAds/Products Crawl

Slide 12

Slide 12 text

LiquidAds/Products Crawl Parse

Slide 13

Slide 13 text

main:image_path main:description main:price Crawl Parse Write to HBase •  These components can now be used to dynamically assembly ad units

Slide 14

Slide 14 text

sites.livebooks.com/purchase/get/site* https://cdn.ticketﬂy.com/retarget/?/receipt/973/3135/233993/* credit-unions.ﬁndthebest* https://apps.facebook.com/grandpoker?tag=buyersadroll_US_2013_1 url0, url1, 0.04 url2, url10, 0.29 url3, url5, 0.13 url1, url4, 0.75

Slide 15

Slide 15 text

similarity: 0.27 similarity: 0.73 product0 product1 product2

Slide 16

Slide 16 text

No content

Slide 17

Slide 17 text

Recommenda

Slide 18

Slide 18 text

HBase row (products table) HBase row (recommendations table)

Slide 19

Slide 19 text

HBase row from Hadoop recommender output (staged in Akamai for speed)

Slide 20

Slide 20 text

… { main:impressions: 45595, main:clicks: 71, main:cost: 60.80 main:conversions: 25 }

Slide 21

Slide 21 text

1. Counting Uniques ( Building queryable data structures with Hadoop and storing them in Hbase )

Slide 22

Slide 22 text

How many unique cookies have visited my site in the past 3 days? If I combine these two targe

Slide 23

Slide 23 text

How many unique cookies have visited my site in the past 3 days? How many unique cookies have visited my site AND seen an impression in the past day?

Slide 24

Slide 24 text

How many unique cookies have visited my site in the past 3 days? How many unique cookies have visited my site AND seen an impression in the past day? ∪ day 2 day 0 day 1 ∪ ∪ site trafﬁc impression trafﬁc

Slide 25

Slide 25 text

How many unique cookies have visited my site in the past 3 days? How many unique cookies have visited my site AND seen an impression in the past day? ∪ day 2 day 0 day 1 ∪ 32MB 32MB 32MB 32MB 64MB ∪ site trafﬁc impression trafﬁc

Slide 26

Slide 26 text

∪ site trafﬁc impression trafﬁc How many unique cookies have visited my site in the past 3 days? How many unique cookies have visited my site AND seen an impression in the past day? ∪ day 2 day 0 day 1 ∪ 32MB 32MB 32MB 32MB 64MB ✗

Slide 27

Slide 27 text

No content

Slide 28

Slide 28 text

Andrew Pascoe, Data Scientist

Slide 29

Slide 29 text

No content

Slide 30

Slide 30 text

HyperLogLog++ (with MinHash) •  Lossless unions with other HLL counters •  Interesc

Slide 31

Slide 31 text

HyperLogLog++ (with MinHash) 200M uniques in 50KB with 1% margin of error •  Lossless unions with other HLL counters •  Interesc

Slide 32

Slide 32 text

// Mapping over some log data context.write(advertiser, cookie); // Building the HLL counters with Hadoop

Slide 33

Slide 33 text

// Reducer HLLCounter hll = new HLLCounter(); for (Text cookie : cookies) { hll.put(cookie.get()); } key.set(advertiser, day); Put put = new Put(); put.add(Bytes.toBytes(“main”), Bytes.toBytes(hll)); // Put it into HBase context.write(row, put); // Mapping over some log data context.write(advertiser, cookie); // Building the HLL counters with Hadoop

Slide 34

Slide 34 text

HLLCounter0 HLLCounter1 HLLCounter2 HLLCounter0 HLLCounter1 key value // Reducer HLLCounter hll = new HLLCounter(); for (Text cookie : cookies) { hll.put(cookie.get()); } key.set(advertiser, day); Put put = new Put(); put.add(Bytes.toBytes(“main”), Bytes.toBytes(hll)); // Put it into HBase context.write(row, put); // Mapping over some log data context.write(advertiser, cookie); // Building the HLL counters with Hadoop

Slide 35

Slide 35 text

Text rpc( /** name of func to call */ 1:Text func, /** JSON parameters */ 2:Text params) Thrift // Querying the HLL counters in HBase

Slide 36

Slide 36 text

Text rpc( /** name of func to call */ 1:Text func, /** JSON parameters */ 2:Text params) Thrift Thrift client makes a call to custom Thrift server // Querying the HLL counters in HBase

Slide 37

Slide 37 text

Thrift Thrift client makes a call to custom Thrift server Thrift RPC uses AggregationClient with a custom CI to do data-local set operations on HLL counters Text rpc( /** name of func to call */ 1:Text func, /** JSON parameters */ 2:Text params) // Querying the HLL counters in HBase

Slide 38

Slide 38 text

Thrift Thrift client makes a call to custom Thrift server Thrift RPC uses AggregationClient with a custom CI to do data-local set operations on HLL counters AggregationClient returns the cardinality of the resulting HLL counter Text rpc( /** name of func to call */ 1:Text func, /** JSON parameters */ 2:Text params) // Querying the HLL counters in HBase

Slide 39

Slide 39 text

Thrift Thrift client makes a call to custom Thrift server Custom Thrift server returns: {‘size’: 40302333} Thrift RPC uses AggregationClient with a custom CI to do data-local set operations on HLL counters AggregationClient returns the cardinality of the resulting HLL counter Text rpc( /** name of func to call */ 1:Text func, /** JSON parameters */ 2:Text params) // Querying the HLL counters in HBase

Slide 40

Slide 40 text

2. Conversion tracking $$$ ( Storing state between MR jobs or steps )

Slide 41

Slide 41 text

impression How do we know when we caused a conversion?

Slide 42

Slide 42 text

impression impression How do we know when we caused a conversion?

Slide 43

Slide 43 text

impression impression impression How do we know when we caused a conversion?

Slide 44

Slide 44 text

impression impression impression impression How do we know when we caused a conversion?

Slide 45

Slide 45 text

impression impression impression impression click How do we know when we caused a conversion?

Slide 46

Slide 46 text

impression impression impression impression click conversion How do we know when we caused a conversion?

Slide 47

Slide 47 text

impression impression impression impression click conversion How do we know when we caused a conversion?

Slide 48

Slide 48 text

impression impression impression impression click conversion ✔ How do we know when we caused a conversion?

Slide 49

Slide 49 text

impression How do we know when we caused a conversion? impression impression impression impression click conversion ✔ conversion ✔ ? conversion ✗

Slide 50

Slide 50 text

// Mappin’ over 30 days of data context.write(cookie, event) // Reducin’ Collections.sort(events, BY_TIMESTAMP_ASC); Event lastEvent = null; // Iterate over the last 30 days of this cookie’s traffic for (Event ev : ev) { if (isConversionEvent(ev)) { if (lastEvent != null) { context.write(cookie, ev); } } } // . . . // Naïve approach using only Hadoop

Slide 51

Slide 51 text

•  Every day you’re mapping over the same 29 days you looked at the previous day •  If only we had random access into all possible conversion triggering events… // Problems… •  We’d only need to look at the day we’re generating conversions for…

Slide 52

Slide 52 text

impression impression impression impression click Keeping potential conversion triggering events in HBase impression

Slide 53

Slide 53 text

impression impression impression impression click Preemptively store ALL potential conversion triggering events in HBase impression

Slide 54

Slide 54 text

// Mappin’ over 1 day of pixel traffic if (isConversionEvent) { context.write(cookie, event); } // Reducin’ Get get = new Get(Bytes.toBytes(cookie)); get.setMaxVersions(10000); Result row = triggers.get(get); List events = rowToEvents(row); // Iterate over the last 30 days of this cookie’s traffic for (Event ev : events) { if (isConversionEvent(ev)) { if (lastEvent != null) { context.write(cookie, ev); } } } // . . . // Better approach using Hadoop + HBase

Slide 55

Slide 55 text

// Mappin’ over 1 day of pixel traffic if (isConversionEvent) { context.write(cookie, event); } Get get = new Get(Bytes.toBytes(cookie)); get.setMaxVersions(10000); Result row = triggers.get(get); List events = rowToEvents(row); // Iterate over the last 30 days of this cookie’s traffic for (Event ev : events) { if (isConversionEvent(ev)) { if (lastEvent != null) { context.write(cookie, ev); } } } // Better approach using Hadoop + HBase impression impression impression impression click conversion

Slide 56

Slide 56 text

•  Store MapReduce output in HBase when you don’t want to consume it sequentially all at once So…

Slide 57

Slide 57 text

•  Store MapReduce output in HBase when you don’t want to consume it sequentially all at once •  Build interesting data structures with Hadoop, and store them in HBase for querying (remember to customize your Thrift handler here) So…

Slide 58

Slide 58 text

•  Store MapReduce output in HBase when you don’t want to consume it sequentially all at once •  Build interesting data structures with Hadoop, and store them in HBase for querying •  Use HBase to keep state between MapReduce jobs (remember to customize your Thrift handler here) So…

Slide 59

Slide 59 text

? [email protected] Thank you! We’re hiring! [email protected] Derek Nelson