Slide 1

Slide 1 text

Derek Nelson | Software Engineer Synergy

Slide 2

Slide 2 text

Pixel “fires”

Slide 3

Slide 3 text

Pixel “fires” Serve ad?

Slide 4

Slide 4 text

Pixel “fires” Serve ad? Ad served

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

7/2011 ~50GB/day 6/2013 ~ 10TB/day ~ 20 billion events per day

Slide 7

Slide 7 text

200-300 EMR nodes at any given time All MapReduce jobs are run in AWS Elastic MapReduce 15-30 different jobs running at any given time

Slide 8

Slide 8 text

200-300 EMR nodes at any given time All MapReduce jobs are run in AW Elastic MapReduce 15-30 different jobs running at any given time 2 c1.mediums running Thrift servers 10-12 nodes running in AWS HBase (0.92) c1.xlarges (8 cores, 7GB RAM, 1.6TB RAID0)

Slide 9

Slide 9 text

Everything  we  do   with   involves  

Slide 10

Slide 10 text

0. Liquid Ads ( Storing MR job output as flexibly as possible )

Slide 11

Slide 11 text

LiquidAds/Products   Crawl

Slide 12

Slide 12 text

LiquidAds/Products   Crawl Parse

Slide 13

Slide 13 text

main:image_path main:description main:price Crawl Parse Write to HBase •  These components can now be used to dynamically assembly ad units

Slide 14

Slide 14 text

sites.livebooks.com/purchase/get/site* https://cdn.ticketfly.com/retarget/?/receipt/973/3135/233993/* credit-unions.findthebest* https://apps.facebook.com/grandpoker?tag=buyersadroll_US_2013_1 url0,  url1,  0.04   url2,  url10,  0.29   url3,  url5,  0.13   url1,  url4,  0.75  

Slide 15

Slide 15 text

similarity: 0.27 similarity: 0.73 product0 product1 product2

Slide 16

Slide 16 text

No content

Slide 17

Slide 17 text

Recommenda

Slide 18

Slide 18 text

HBase row (products table) HBase row (recommendations table)

Slide 19

Slide 19 text

HBase row from Hadoop recommender output (staged in Akamai for speed)

Slide 20

Slide 20 text

… { main:impressions: 45595, main:clicks: 71, main:cost: 60.80 main:conversions: 25 }

Slide 21

Slide 21 text

1. Counting Uniques ( Building queryable data structures with Hadoop and storing them in Hbase )

Slide 22

Slide 22 text

How many unique cookies have visited my site in the past 3 days? If  I  combine  these  two   targe

Slide 23

Slide 23 text

How many unique cookies have visited my site in the past 3 days? How many unique cookies have visited my site AND seen an impression in the past day?

Slide 24

Slide 24 text

How many unique cookies have visited my site in the past 3 days? How many unique cookies have visited my site AND seen an impression in the past day? ∪   day 2 day 0 day 1 ∪   ∪ site traffic impression traffic

Slide 25

Slide 25 text

How many unique cookies have visited my site in the past 3 days? How many unique cookies have visited my site AND seen an impression in the past day? ∪   day 2 day 0 day 1 ∪   32MB   32MB   32MB   32MB   64MB   ∪ site traffic impression traffic

Slide 26

Slide 26 text

∪ site traffic impression traffic How many unique cookies have visited my site in the past 3 days? How many unique cookies have visited my site AND seen an impression in the past day? ∪   day 2 day 0 day 1 ∪   32MB   32MB   32MB   32MB   64MB   ✗  

Slide 27

Slide 27 text

No content

Slide 28

Slide 28 text

Andrew Pascoe, Data Scientist

Slide 29

Slide 29 text

No content

Slide 30

Slide 30 text

HyperLogLog++  (with  MinHash)   •  Lossless  unions  with   other  HLL  counters   •  Interesc

Slide 31

Slide 31 text

HyperLogLog++  (with  MinHash)   200M  uniques  in  50KB   with  1%  margin  of  error   •  Lossless  unions  with   other  HLL  counters   •  Interesc

Slide 32

Slide 32 text

//  Mapping  over  some  log  data   context.write(advertiser,  cookie);   //  Building  the  HLL  counters  with  Hadoop  

Slide 33

Slide 33 text

//  Reducer   HLLCounter  hll  =  new  HLLCounter();   for  (Text  cookie  :  cookies)  {    hll.put(cookie.get());   }     key.set(advertiser,  day);   Put  put  =  new  Put();   put.add(Bytes.toBytes(“main”),          Bytes.toBytes(hll));     //  Put  it  into  HBase   context.write(row,  put);   //  Mapping  over  some  log  data   context.write(advertiser,  cookie);   //  Building  the  HLL  counters  with  Hadoop  

Slide 34

Slide 34 text

  HLLCounter0     HLLCounter1     HLLCounter2     HLLCounter0     HLLCounter1   key   value   //  Reducer   HLLCounter  hll  =  new  HLLCounter();   for  (Text  cookie  :  cookies)  {    hll.put(cookie.get());   }     key.set(advertiser,  day);   Put  put  =  new  Put();   put.add(Bytes.toBytes(“main”),          Bytes.toBytes(hll));     //  Put  it  into  HBase   context.write(row,  put);   //  Mapping  over  some  log  data   context.write(advertiser,  cookie);   //  Building  the  HLL  counters  with  Hadoop  

Slide 35

Slide 35 text

 Text  rpc(          /**  name  of  func  to  call  */          1:Text  func,          /**  JSON  parameters  */          2:Text  params)         Thrift   //  Querying  the  HLL  counters  in  HBase  

Slide 36

Slide 36 text

 Text  rpc(          /**  name  of  func  to  call  */          1:Text  func,          /**  JSON  parameters  */          2:Text  params)         Thrift   Thrift client makes a call to custom Thrift server //  Querying  the  HLL  counters  in  HBase  

Slide 37

Slide 37 text

Thrift   Thrift client makes a call to custom Thrift server Thrift RPC uses AggregationClient with a custom CI to do data-local set operations on HLL counters  Text  rpc(          /**  name  of  func  to  call  */          1:Text  func,          /**  JSON  parameters  */          2:Text  params)         //  Querying  the  HLL  counters  in  HBase  

Slide 38

Slide 38 text

Thrift   Thrift client makes a call to custom Thrift server Thrift RPC uses AggregationClient with a custom CI to do data-local set operations on HLL counters AggregationClient returns the cardinality of the resulting HLL counter  Text  rpc(          /**  name  of  func  to  call  */          1:Text  func,          /**  JSON  parameters  */          2:Text  params)         //  Querying  the  HLL  counters  in  HBase  

Slide 39

Slide 39 text

Thrift   Thrift client makes a call to custom Thrift server Custom Thrift server returns: {‘size’: 40302333} Thrift RPC uses AggregationClient with a custom CI to do data-local set operations on HLL counters AggregationClient returns the cardinality of the resulting HLL counter  Text  rpc(          /**  name  of  func  to  call  */          1:Text  func,          /**  JSON  parameters  */          2:Text  params)         //  Querying  the  HLL  counters  in  HBase  

Slide 40

Slide 40 text

2. Conversion tracking $$$ ( Storing state between MR jobs or steps )

Slide 41

Slide 41 text

impression How do we know when we caused a conversion?

Slide 42

Slide 42 text

impression impression How do we know when we caused a conversion?

Slide 43

Slide 43 text

impression impression impression How do we know when we caused a conversion?

Slide 44

Slide 44 text

impression impression impression impression How do we know when we caused a conversion?

Slide 45

Slide 45 text

impression impression impression impression click How do we know when we caused a conversion?

Slide 46

Slide 46 text

impression impression impression impression click conversion How do we know when we caused a conversion?

Slide 47

Slide 47 text

impression impression impression impression click conversion How do we know when we caused a conversion?

Slide 48

Slide 48 text

impression impression impression impression click conversion ✔   How do we know when we caused a conversion?

Slide 49

Slide 49 text

impression How do we know when we caused a conversion? impression impression impression impression click conversion ✔   conversion ✔   ? conversion ✗  

Slide 50

Slide 50 text

//  Mappin’  over  30  days  of  data   context.write(cookie,  event)   //  Reducin’   Collections.sort(events,  BY_TIMESTAMP_ASC);   Event  lastEvent  =  null;     //  Iterate  over  the  last  30  days  of  this  cookie’s  traffic   for  (Event  ev  :  ev)  {    if  (isConversionEvent(ev))  {      if  (lastEvent  !=  null)  {        context.write(cookie,  ev);      }    }   }   //  .  .  .   //  Naïve  approach  using  only  Hadoop  

Slide 51

Slide 51 text

•  Every  day  you’re  mapping   over  the  same  29  days  you   looked  at  the  previous  day   •  If  only  we  had  random  access   into  all  possible  conversion   triggering  events…   //  Problems…   •  We’d  only  need  to  look  at   the  day  we’re  generating   conversions  for…  

Slide 52

Slide 52 text

impression impression impression impression click Keeping potential conversion triggering events in HBase impression

Slide 53

Slide 53 text

impression impression impression impression click Preemptively store ALL potential conversion triggering events in HBase impression

Slide 54

Slide 54 text

//  Mappin’  over  1  day  of  pixel  traffic   if  (isConversionEvent)  {    context.write(cookie,  event);   }   //  Reducin’   Get  get  =  new  Get(Bytes.toBytes(cookie));   get.setMaxVersions(10000);   Result  row  =  triggers.get(get);     List  events  =  rowToEvents(row);     //  Iterate  over  the  last  30  days  of  this  cookie’s  traffic   for  (Event  ev  :  events)  {    if  (isConversionEvent(ev))  {      if  (lastEvent  !=  null)  {        context.write(cookie,  ev);      }    }   }   //  .  .  .   //  Better  approach  using  Hadoop  +  HBase  

Slide 55

Slide 55 text

//  Mappin’  over  1  day  of  pixel  traffic   if  (isConversionEvent)  {    context.write(cookie,  event);   }     Get  get  =  new  Get(Bytes.toBytes(cookie));   get.setMaxVersions(10000);   Result  row  =  triggers.get(get);     List  events  =  rowToEvents(row);     //  Iterate  over  the  last  30  days  of  this  cookie’s  traffic   for  (Event  ev  :  events)  {    if  (isConversionEvent(ev))  {      if  (lastEvent  !=  null)  {        context.write(cookie,  ev);      }    }   }   //  Better  approach  using  Hadoop  +  HBase   impression impression impression impression click conversion

Slide 56

Slide 56 text

•  Store MapReduce output in HBase when you don’t want to consume it sequentially all at once So…

Slide 57

Slide 57 text

•  Store MapReduce output in HBase when you don’t want to consume it sequentially all at once •  Build interesting data structures with Hadoop, and store them in HBase for querying (remember to customize your Thrift handler here)   So…

Slide 58

Slide 58 text

•  Store MapReduce output in HBase when you don’t want to consume it sequentially all at once •  Build interesting data structures with Hadoop, and store them in HBase for querying •  Use HBase to keep state between MapReduce jobs (remember to customize your Thrift handler here)   So…

Slide 59

Slide 59 text

? [email protected] Thank you! We’re hiring! [email protected] Derek Nelson