Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Hadoop + HBase Synergy

Hadoop + HBase Synergy

An overview of several AdRoll services that thrive on using Hadoop in conjunction with HBase

Derek Nelson

June 14, 2013
Tweet

More Decks by Derek Nelson

Other Decks in Technology

Transcript

  1. 200-300 EMR nodes at any given time All MapReduce jobs

    are run in AWS Elastic MapReduce 15-30 different jobs running at any given time
  2. 200-300 EMR nodes at any given time All MapReduce jobs

    are run in AW Elastic MapReduce 15-30 different jobs running at any given time 2 c1.mediums running Thrift servers 10-12 nodes running in AWS HBase (0.92) c1.xlarges (8 cores, 7GB RAM, 1.6TB RAID0)
  3. <advertiser, product> main:image_path main:description main:price Crawl Parse Write to HBase

    •  These components can now be used to dynamically assembly ad units
  4. <advertiser, 2013-04-01, campaign> <advertiser, 2013-04-02, campaign> <advertiser, 2013-04-03, campaign> <advertiser,

    2013-04-05, campaign> … <advertiser, 2013-04-04, campaign> { main:impressions: 45595, main:clicks: 71, main:cost: 60.80 main:conversions: 25 }
  5. How many unique cookies have visited my site in the

    past 3 days? If  I  combine  these  two   targe<ng  lists,  how  many   unique  cookies  will  I  be   reaching?   How  many  unique  cookies  have  visited   my  site  AND  seen  an  impression  in  the   past  day?  
  6. How many unique cookies have visited my site in the

    past 3 days? How many unique cookies have visited my site AND seen an impression in the past day?
  7. How many unique cookies have visited my site in the

    past 3 days? How many unique cookies have visited my site AND seen an impression in the past day? ∪   day 2 day 0 day 1 ∪   ∪ site traffic impression traffic
  8. How many unique cookies have visited my site in the

    past 3 days? How many unique cookies have visited my site AND seen an impression in the past day? ∪   day 2 day 0 day 1 ∪   32MB   32MB   32MB   32MB   64MB   ∪ site traffic impression traffic
  9. ∪ site traffic impression traffic How many unique cookies have

    visited my site in the past 3 days? How many unique cookies have visited my site AND seen an impression in the past day? ∪   day 2 day 0 day 1 ∪   32MB   32MB   32MB   32MB   64MB   ✗  
  10. HyperLogLog++  (with  MinHash)   •  Lossless  unions  with   other

     HLL  counters   •  Interesc<on  sizes  with   other  HLL  counters   •  Set  cardinality  (no   membership  checks)  
  11. HyperLogLog++  (with  MinHash)   200M  uniques  in  50KB   with

     1%  margin  of  error   •  Lossless  unions  with   other  HLL  counters   •  Interesc<on  sizes  with   other  HLL  counters   •  Set  cardinality  (no   membership  checks)  
  12. //  Mapping  over  some  log  data   context.write(advertiser,  cookie);  

    //  Building  the  HLL  counters  with  Hadoop  
  13. //  Reducer   HLLCounter  hll  =  new  HLLCounter();   for

     (Text  cookie  :  cookies)  {    hll.put(cookie.get());   }     key.set(advertiser,  day);   Put  put  =  new  Put();   put.add(Bytes.toBytes(“main”),          Bytes.toBytes(hll));     //  Put  it  into  HBase   context.write(row,  put);   //  Mapping  over  some  log  data   context.write(advertiser,  cookie);   //  Building  the  HLL  counters  with  Hadoop  
  14. <adver<ser0,  day0>   HLLCounter0   <adver<ser0,  day1>   HLLCounter1  

    <adver<ser0,  day2>   HLLCounter2   <adver<ser1,  day0>   HLLCounter0   <adver<ser1,  day1>   HLLCounter1   key   value   //  Reducer   HLLCounter  hll  =  new  HLLCounter();   for  (Text  cookie  :  cookies)  {    hll.put(cookie.get());   }     key.set(advertiser,  day);   Put  put  =  new  Put();   put.add(Bytes.toBytes(“main”),          Bytes.toBytes(hll));     //  Put  it  into  HBase   context.write(row,  put);   //  Mapping  over  some  log  data   context.write(advertiser,  cookie);   //  Building  the  HLL  counters  with  Hadoop  
  15.  Text  rpc(          /**  name  of  func

     to  call  */          1:Text  func,          /**  JSON  parameters  */          2:Text  params)         Thrift   //  Querying  the  HLL  counters  in  HBase  
  16.  Text  rpc(          /**  name  of  func

     to  call  */          1:Text  func,          /**  JSON  parameters  */          2:Text  params)         Thrift   Thrift client makes a call to custom Thrift server //  Querying  the  HLL  counters  in  HBase  
  17. Thrift   Thrift client makes a call to custom Thrift

    server Thrift RPC uses AggregationClient with a custom CI to do data-local set operations on HLL counters  Text  rpc(          /**  name  of  func  to  call  */          1:Text  func,          /**  JSON  parameters  */          2:Text  params)         //  Querying  the  HLL  counters  in  HBase  
  18. Thrift   Thrift client makes a call to custom Thrift

    server Thrift RPC uses AggregationClient with a custom CI to do data-local set operations on HLL counters AggregationClient returns the cardinality of the resulting HLL counter  Text  rpc(          /**  name  of  func  to  call  */          1:Text  func,          /**  JSON  parameters  */          2:Text  params)         //  Querying  the  HLL  counters  in  HBase  
  19. Thrift   Thrift client makes a call to custom Thrift

    server Custom Thrift server returns: {‘size’: 40302333} Thrift RPC uses AggregationClient with a custom CI to do data-local set operations on HLL counters AggregationClient returns the cardinality of the resulting HLL counter  Text  rpc(          /**  name  of  func  to  call  */          1:Text  func,          /**  JSON  parameters  */          2:Text  params)         //  Querying  the  HLL  counters  in  HBase  
  20. impression How do we know when we caused a conversion?

    impression impression impression impression click conversion ✔   conversion ✔   ? conversion ✗  
  21. //  Mappin’  over  30  days  of  data   context.write(cookie,  event)

      //  Reducin’   Collections.sort(events,  BY_TIMESTAMP_ASC);   Event  lastEvent  =  null;     //  Iterate  over  the  last  30  days  of  this  cookie’s  traffic   for  (Event  ev  :  ev)  {    if  (isConversionEvent(ev))  {      if  (lastEvent  !=  null)  {        context.write(cookie,  ev);      }    }   }   //  .  .  .   //  Naïve  approach  using  only  Hadoop  
  22. •  Every  day  you’re  mapping   over  the  same  29

     days  you   looked  at  the  previous  day   •  If  only  we  had  random  access   into  all  possible  conversion   triggering  events…   //  Problems…   •  We’d  only  need  to  look  at   the  day  we’re  generating   conversions  for…  
  23. //  Mappin’  over  1  day  of  pixel  traffic   if

     (isConversionEvent)  {    context.write(cookie,  event);   }   //  Reducin’   Get  get  =  new  Get(Bytes.toBytes(cookie));   get.setMaxVersions(10000);   Result  row  =  triggers.get(get);     List<Event>  events  =  rowToEvents(row);     //  Iterate  over  the  last  30  days  of  this  cookie’s  traffic   for  (Event  ev  :  events)  {    if  (isConversionEvent(ev))  {      if  (lastEvent  !=  null)  {        context.write(cookie,  ev);      }    }   }   //  .  .  .   //  Better  approach  using  Hadoop  +  HBase  
  24. //  Mappin’  over  1  day  of  pixel  traffic   if

     (isConversionEvent)  {    context.write(cookie,  event);   }     Get  get  =  new  Get(Bytes.toBytes(cookie));   get.setMaxVersions(10000);   Result  row  =  triggers.get(get);     List<Event>  events  =  rowToEvents(row);     //  Iterate  over  the  last  30  days  of  this  cookie’s  traffic   for  (Event  ev  :  events)  {    if  (isConversionEvent(ev))  {      if  (lastEvent  !=  null)  {        context.write(cookie,  ev);      }    }   }   //  Better  approach  using  Hadoop  +  HBase   impression impression impression impression click conversion
  25. •  Store MapReduce output in HBase when you don’t want

    to consume it sequentially all at once So…
  26. •  Store MapReduce output in HBase when you don’t want

    to consume it sequentially all at once •  Build interesting data structures with Hadoop, and store them in HBase for querying (remember to customize your Thrift handler here)   So…
  27. •  Store MapReduce output in HBase when you don’t want

    to consume it sequentially all at once •  Build interesting data structures with Hadoop, and store them in HBase for querying •  Use HBase to keep state between MapReduce jobs (remember to customize your Thrift handler here)   So…