Hadoop + HBase Synergy

Derek Nelson | Software Engineer Synergy

Pixel “ﬁres”

Pixel “ﬁres” Serve ad?

Pixel “ﬁres” Serve ad? Ad served

7/2011 ~50GB/day 6/2013 ~ 10TB/day ~ 20 billion events per
day

200-300 EMR nodes at any given time All MapReduce jobs
are run in AWS Elastic MapReduce 15-30 different jobs running at any given time

200-300 EMR nodes at any given time All MapReduce jobs
are run in AW Elastic MapReduce 15-30 different jobs running at any given time 2 c1.mediums running Thrift servers 10-12 nodes running in AWS HBase (0.92) c1.xlarges (8 cores, 7GB RAM, 1.6TB RAID0)

Everything we do with involves

0. Liquid Ads ( Storing MR job output as ﬂexibly
as possible )

LiquidAds/Products Crawl

LiquidAds/Products Crawl Parse

<advertiser, product> main:image_path main:description main:price Crawl Parse Write to HBase
•  These components can now be used to dynamically assembly ad units

sites.livebooks.com/purchase/get/site* https://cdn.ticketﬂy.com/retarget/?/receipt/973/3135/233993/* credit-unions.ﬁndthebest* https://apps.facebook.com/grandpoker?tag=buyersadroll_US_2013_1 url0, url1, 0.04 url2, url10,
0.29 url3, url5, 0.13 url1, url4, 0.75

similarity: 0.27 similarity: 0.73 <product0, product1, algorithm> product0 product1 product2
<product0, product2, algorithm>

Recommenda<ons for the shoe

HBase row (products table) HBase row (recommendations table)

HBase row from Hadoop recommender output (staged in Akamai for
speed)

<advertiser, 2013-04-01, campaign> <advertiser, 2013-04-02, campaign> <advertiser, 2013-04-03, campaign> <advertiser,
2013-04-05, campaign> … <advertiser, 2013-04-04, campaign> { main:impressions: 45595, main:clicks: 71, main:cost: 60.80 main:conversions: 25 }

1. Counting Uniques ( Building queryable data structures with Hadoop
and storing them in Hbase )

How many unique cookies have visited my site in the
past 3 days? If I combine these two targe<ng lists, how many unique cookies will I be reaching? How many unique cookies have visited my site AND seen an impression in the past day?

past 3 days? How many unique cookies have visited my site AND seen an impression in the past day?

past 3 days? How many unique cookies have visited my site AND seen an impression in the past day? ∪ day 2 day 0 day 1 ∪ ∪ site trafﬁc impression trafﬁc

past 3 days? How many unique cookies have visited my site AND seen an impression in the past day? ∪ day 2 day 0 day 1 ∪ 32MB 32MB 32MB 32MB 64MB ∪ site trafﬁc impression trafﬁc

∪ site trafﬁc impression trafﬁc How many unique cookies have
visited my site in the past 3 days? How many unique cookies have visited my site AND seen an impression in the past day? ∪ day 2 day 0 day 1 ∪ 32MB 32MB 32MB 32MB 64MB ✗

Andrew Pascoe, Data Scientist

HyperLogLog++ (with MinHash) •  Lossless unions with other
HLL counters •  Interesc<on sizes with other HLL counters •  Set cardinality (no membership checks)

HyperLogLog++ (with MinHash) 200M uniques in 50KB with
1% margin of error •  Lossless unions with other HLL counters •  Interesc<on sizes with other HLL counters •  Set cardinality (no membership checks)

// Mapping over some log data context.write(advertiser, cookie);
// Building the HLL counters with Hadoop

// Reducer HLLCounter hll = new HLLCounter(); for
(Text cookie : cookies) { hll.put(cookie.get()); } key.set(advertiser, day); Put put = new Put(); put.add(Bytes.toBytes(“main”), Bytes.toBytes(hll)); // Put it into HBase context.write(row, put); // Mapping over some log data context.write(advertiser, cookie); // Building the HLL counters with Hadoop

<adver<ser0, day0> HLLCounter0 <adver<ser0, day1> HLLCounter1
<adver<ser0, day2> HLLCounter2 <adver<ser1, day0> HLLCounter0 <adver<ser1, day1> HLLCounter1 key value // Reducer HLLCounter hll = new HLLCounter(); for (Text cookie : cookies) { hll.put(cookie.get()); } key.set(advertiser, day); Put put = new Put(); put.add(Bytes.toBytes(“main”), Bytes.toBytes(hll)); // Put it into HBase context.write(row, put); // Mapping over some log data context.write(advertiser, cookie); // Building the HLL counters with Hadoop

Text rpc( /** name of func
to call */ 1:Text func, /** JSON parameters */ 2:Text params) Thrift // Querying the HLL counters in HBase

Text rpc( /** name of func
to call */ 1:Text func, /** JSON parameters */ 2:Text params) Thrift Thrift client makes a call to custom Thrift server // Querying the HLL counters in HBase

Thrift Thrift client makes a call to custom Thrift
server Thrift RPC uses AggregationClient with a custom CI to do data-local set operations on HLL counters Text rpc( /** name of func to call */ 1:Text func, /** JSON parameters */ 2:Text params) // Querying the HLL counters in HBase

server Thrift RPC uses AggregationClient with a custom CI to do data-local set operations on HLL counters AggregationClient returns the cardinality of the resulting HLL counter Text rpc( /** name of func to call */ 1:Text func, /** JSON parameters */ 2:Text params) // Querying the HLL counters in HBase

server Custom Thrift server returns: {‘size’: 40302333} Thrift RPC uses AggregationClient with a custom CI to do data-local set operations on HLL counters AggregationClient returns the cardinality of the resulting HLL counter Text rpc( /** name of func to call */ 1:Text func, /** JSON parameters */ 2:Text params) // Querying the HLL counters in HBase

2. Conversion tracking $$$ ( Storing state between MR jobs
or steps )

impression How do we know when we caused a conversion?

impression impression How do we know when we caused a
conversion?

impression impression impression How do we know when we caused
a conversion?

impression impression impression impression How do we know when we
caused a conversion?

impression impression impression impression click How do we know when
we caused a conversion?

impression impression impression impression click conversion How do we know
when we caused a conversion?

impression impression impression impression click conversion ✔ How do
we know when we caused a conversion?

impression How do we know when we caused a conversion?
impression impression impression impression click conversion ✔ conversion ✔ ? conversion ✗

// Mappin’ over 30 days of data context.write(cookie, event)
// Reducin’ Collections.sort(events, BY_TIMESTAMP_ASC); Event lastEvent = null; // Iterate over the last 30 days of this cookie’s traffic for (Event ev : ev) { if (isConversionEvent(ev)) { if (lastEvent != null) { context.write(cookie, ev); } } } // . . . // Naïve approach using only Hadoop

•  Every day you’re mapping over the same 29
days you looked at the previous day •  If only we had random access into all possible conversion triggering events… // Problems… •  We’d only need to look at the day we’re generating conversions for…

impression impression impression impression click Keeping potential conversion triggering events
in HBase impression

impression impression impression impression click Preemptively store ALL potential conversion
triggering events in HBase impression

// Mappin’ over 1 day of pixel traffic if
(isConversionEvent) { context.write(cookie, event); } // Reducin’ Get get = new Get(Bytes.toBytes(cookie)); get.setMaxVersions(10000); Result row = triggers.get(get); List<Event> events = rowToEvents(row); // Iterate over the last 30 days of this cookie’s traffic for (Event ev : events) { if (isConversionEvent(ev)) { if (lastEvent != null) { context.write(cookie, ev); } } } // . . . // Better approach using Hadoop + HBase

// Mappin’ over 1 day of pixel traffic if
(isConversionEvent) { context.write(cookie, event); } Get get = new Get(Bytes.toBytes(cookie)); get.setMaxVersions(10000); Result row = triggers.get(get); List<Event> events = rowToEvents(row); // Iterate over the last 30 days of this cookie’s traffic for (Event ev : events) { if (isConversionEvent(ev)) { if (lastEvent != null) { context.write(cookie, ev); } } } // Better approach using Hadoop + HBase impression impression impression impression click conversion

•  Store MapReduce output in HBase when you don’t want
to consume it sequentially all at once So…

to consume it sequentially all at once •  Build interesting data structures with Hadoop, and store them in HBase for querying (remember to customize your Thrift handler here) So…

to consume it sequentially all at once •  Build interesting data structures with Hadoop, and store them in HBase for querying •  Use HBase to keep state between MapReduce jobs (remember to customize your Thrift handler here) So…

? [email protected] Thank you! We’re hiring! [email protected] Derek Nelson

Hadoop + HBase Synergy

Hadoop + HBase Synergy

More Decks by Derek Nelson

Other Decks in Technology

Featured

Transcript