Hadoop + HBase Synergy

Hadoop + HBase Synergy

An overview of several AdRoll services that thrive on using Hadoop in conjunction with HBase

Dab13937f819208384a4ddb33ed56a58?s=128

Derek Nelson

June 14, 2013
Tweet

Transcript

  1. Derek Nelson | Software Engineer Synergy

  2. Pixel “fires”

  3. Pixel “fires” Serve ad?

  4. Pixel “fires” Serve ad? Ad served

  5. None
  6. 7/2011 ~50GB/day 6/2013 ~ 10TB/day ~ 20 billion events per

    day
  7. 200-300 EMR nodes at any given time All MapReduce jobs

    are run in AWS Elastic MapReduce 15-30 different jobs running at any given time
  8. 200-300 EMR nodes at any given time All MapReduce jobs

    are run in AW Elastic MapReduce 15-30 different jobs running at any given time 2 c1.mediums running Thrift servers 10-12 nodes running in AWS HBase (0.92) c1.xlarges (8 cores, 7GB RAM, 1.6TB RAID0)
  9. Everything  we  do   with   involves  

  10. 0. Liquid Ads ( Storing MR job output as flexibly

    as possible )
  11. LiquidAds/Products   Crawl

  12. LiquidAds/Products   Crawl Parse

  13. <advertiser, product> main:image_path main:description main:price Crawl Parse Write to HBase

    •  These components can now be used to dynamically assembly ad units
  14. sites.livebooks.com/purchase/get/site* https://cdn.ticketfly.com/retarget/?/receipt/973/3135/233993/* credit-unions.findthebest* https://apps.facebook.com/grandpoker?tag=buyersadroll_US_2013_1 url0,  url1,  0.04   url2,  url10,

     0.29   url3,  url5,  0.13   url1,  url4,  0.75  
  15. similarity: 0.27 similarity: 0.73 <product0, product1, algorithm> product0 product1 product2

    <product0, product2, algorithm>
  16. None
  17. Recommenda<ons  for  the  shoe  

  18. HBase row (products table) HBase row (recommendations table)

  19. HBase row from Hadoop recommender output (staged in Akamai for

    speed)
  20. <advertiser, 2013-04-01, campaign> <advertiser, 2013-04-02, campaign> <advertiser, 2013-04-03, campaign> <advertiser,

    2013-04-05, campaign> … <advertiser, 2013-04-04, campaign> { main:impressions: 45595, main:clicks: 71, main:cost: 60.80 main:conversions: 25 }
  21. 1. Counting Uniques ( Building queryable data structures with Hadoop

    and storing them in Hbase )
  22. How many unique cookies have visited my site in the

    past 3 days? If  I  combine  these  two   targe<ng  lists,  how  many   unique  cookies  will  I  be   reaching?   How  many  unique  cookies  have  visited   my  site  AND  seen  an  impression  in  the   past  day?  
  23. How many unique cookies have visited my site in the

    past 3 days? How many unique cookies have visited my site AND seen an impression in the past day?
  24. How many unique cookies have visited my site in the

    past 3 days? How many unique cookies have visited my site AND seen an impression in the past day? ∪   day 2 day 0 day 1 ∪   ∪ site traffic impression traffic
  25. How many unique cookies have visited my site in the

    past 3 days? How many unique cookies have visited my site AND seen an impression in the past day? ∪   day 2 day 0 day 1 ∪   32MB   32MB   32MB   32MB   64MB   ∪ site traffic impression traffic
  26. ∪ site traffic impression traffic How many unique cookies have

    visited my site in the past 3 days? How many unique cookies have visited my site AND seen an impression in the past day? ∪   day 2 day 0 day 1 ∪   32MB   32MB   32MB   32MB   64MB   ✗  
  27. None
  28. Andrew Pascoe, Data Scientist

  29. None
  30. HyperLogLog++  (with  MinHash)   •  Lossless  unions  with   other

     HLL  counters   •  Interesc<on  sizes  with   other  HLL  counters   •  Set  cardinality  (no   membership  checks)  
  31. HyperLogLog++  (with  MinHash)   200M  uniques  in  50KB   with

     1%  margin  of  error   •  Lossless  unions  with   other  HLL  counters   •  Interesc<on  sizes  with   other  HLL  counters   •  Set  cardinality  (no   membership  checks)  
  32. //  Mapping  over  some  log  data   context.write(advertiser,  cookie);  

    //  Building  the  HLL  counters  with  Hadoop  
  33. //  Reducer   HLLCounter  hll  =  new  HLLCounter();   for

     (Text  cookie  :  cookies)  {    hll.put(cookie.get());   }     key.set(advertiser,  day);   Put  put  =  new  Put();   put.add(Bytes.toBytes(“main”),          Bytes.toBytes(hll));     //  Put  it  into  HBase   context.write(row,  put);   //  Mapping  over  some  log  data   context.write(advertiser,  cookie);   //  Building  the  HLL  counters  with  Hadoop  
  34. <adver<ser0,  day0>   HLLCounter0   <adver<ser0,  day1>   HLLCounter1  

    <adver<ser0,  day2>   HLLCounter2   <adver<ser1,  day0>   HLLCounter0   <adver<ser1,  day1>   HLLCounter1   key   value   //  Reducer   HLLCounter  hll  =  new  HLLCounter();   for  (Text  cookie  :  cookies)  {    hll.put(cookie.get());   }     key.set(advertiser,  day);   Put  put  =  new  Put();   put.add(Bytes.toBytes(“main”),          Bytes.toBytes(hll));     //  Put  it  into  HBase   context.write(row,  put);   //  Mapping  over  some  log  data   context.write(advertiser,  cookie);   //  Building  the  HLL  counters  with  Hadoop  
  35.  Text  rpc(          /**  name  of  func

     to  call  */          1:Text  func,          /**  JSON  parameters  */          2:Text  params)         Thrift   //  Querying  the  HLL  counters  in  HBase  
  36.  Text  rpc(          /**  name  of  func

     to  call  */          1:Text  func,          /**  JSON  parameters  */          2:Text  params)         Thrift   Thrift client makes a call to custom Thrift server //  Querying  the  HLL  counters  in  HBase  
  37. Thrift   Thrift client makes a call to custom Thrift

    server Thrift RPC uses AggregationClient with a custom CI to do data-local set operations on HLL counters  Text  rpc(          /**  name  of  func  to  call  */          1:Text  func,          /**  JSON  parameters  */          2:Text  params)         //  Querying  the  HLL  counters  in  HBase  
  38. Thrift   Thrift client makes a call to custom Thrift

    server Thrift RPC uses AggregationClient with a custom CI to do data-local set operations on HLL counters AggregationClient returns the cardinality of the resulting HLL counter  Text  rpc(          /**  name  of  func  to  call  */          1:Text  func,          /**  JSON  parameters  */          2:Text  params)         //  Querying  the  HLL  counters  in  HBase  
  39. Thrift   Thrift client makes a call to custom Thrift

    server Custom Thrift server returns: {‘size’: 40302333} Thrift RPC uses AggregationClient with a custom CI to do data-local set operations on HLL counters AggregationClient returns the cardinality of the resulting HLL counter  Text  rpc(          /**  name  of  func  to  call  */          1:Text  func,          /**  JSON  parameters  */          2:Text  params)         //  Querying  the  HLL  counters  in  HBase  
  40. 2. Conversion tracking $$$ ( Storing state between MR jobs

    or steps )
  41. impression How do we know when we caused a conversion?

  42. impression impression How do we know when we caused a

    conversion?
  43. impression impression impression How do we know when we caused

    a conversion?
  44. impression impression impression impression How do we know when we

    caused a conversion?
  45. impression impression impression impression click How do we know when

    we caused a conversion?
  46. impression impression impression impression click conversion How do we know

    when we caused a conversion?
  47. impression impression impression impression click conversion How do we know

    when we caused a conversion?
  48. impression impression impression impression click conversion ✔   How do

    we know when we caused a conversion?
  49. impression How do we know when we caused a conversion?

    impression impression impression impression click conversion ✔   conversion ✔   ? conversion ✗  
  50. //  Mappin’  over  30  days  of  data   context.write(cookie,  event)

      //  Reducin’   Collections.sort(events,  BY_TIMESTAMP_ASC);   Event  lastEvent  =  null;     //  Iterate  over  the  last  30  days  of  this  cookie’s  traffic   for  (Event  ev  :  ev)  {    if  (isConversionEvent(ev))  {      if  (lastEvent  !=  null)  {        context.write(cookie,  ev);      }    }   }   //  .  .  .   //  Naïve  approach  using  only  Hadoop  
  51. •  Every  day  you’re  mapping   over  the  same  29

     days  you   looked  at  the  previous  day   •  If  only  we  had  random  access   into  all  possible  conversion   triggering  events…   //  Problems…   •  We’d  only  need  to  look  at   the  day  we’re  generating   conversions  for…  
  52. impression impression impression impression click Keeping potential conversion triggering events

    in HBase impression
  53. impression impression impression impression click Preemptively store ALL potential conversion

    triggering events in HBase impression
  54. //  Mappin’  over  1  day  of  pixel  traffic   if

     (isConversionEvent)  {    context.write(cookie,  event);   }   //  Reducin’   Get  get  =  new  Get(Bytes.toBytes(cookie));   get.setMaxVersions(10000);   Result  row  =  triggers.get(get);     List<Event>  events  =  rowToEvents(row);     //  Iterate  over  the  last  30  days  of  this  cookie’s  traffic   for  (Event  ev  :  events)  {    if  (isConversionEvent(ev))  {      if  (lastEvent  !=  null)  {        context.write(cookie,  ev);      }    }   }   //  .  .  .   //  Better  approach  using  Hadoop  +  HBase  
  55. //  Mappin’  over  1  day  of  pixel  traffic   if

     (isConversionEvent)  {    context.write(cookie,  event);   }     Get  get  =  new  Get(Bytes.toBytes(cookie));   get.setMaxVersions(10000);   Result  row  =  triggers.get(get);     List<Event>  events  =  rowToEvents(row);     //  Iterate  over  the  last  30  days  of  this  cookie’s  traffic   for  (Event  ev  :  events)  {    if  (isConversionEvent(ev))  {      if  (lastEvent  !=  null)  {        context.write(cookie,  ev);      }    }   }   //  Better  approach  using  Hadoop  +  HBase   impression impression impression impression click conversion
  56. •  Store MapReduce output in HBase when you don’t want

    to consume it sequentially all at once So…
  57. •  Store MapReduce output in HBase when you don’t want

    to consume it sequentially all at once •  Build interesting data structures with Hadoop, and store them in HBase for querying (remember to customize your Thrift handler here)   So…
  58. •  Store MapReduce output in HBase when you don’t want

    to consume it sequentially all at once •  Build interesting data structures with Hadoop, and store them in HBase for querying •  Use HBase to keep state between MapReduce jobs (remember to customize your Thrift handler here)   So…
  59. ? jobs.engineers@adroll.com Thank you! We’re hiring! derek@adroll.com Derek Nelson