Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Ad Networks analytics using Hadoop and Splout SQL by IVÁN DE PRADO at Big Data Spain 2013

Ad Networks analytics using Hadoop and Splout SQL by IVÁN DE PRADO at Big Data Spain 2013

Ad Networks act as the middleman between advertisers and publishers on the Internet. The advertiser is the agent that wants to allocate a particular ad in different medias. The publisher is the agent who owns the medias. These medias are usually web pages or mobile applications.
Session presented at Big Data Spain 2013 Conference
8th Nov 2013
Kinépolis Madrid
http://www.bigdataspain.org
Event promoted by http://www.paradigmatecnologico.com
Abstract: http://www.bigdataspain.org/2013/conference/ad-networks-analytics-using-hadoop-and-splout-sql

Cb6e6da05b5b943d2691ceefa3381cad?s=128

Big Data Spain

November 14, 2013
Tweet

Transcript

  1. Ad Networks analytics using Hadoop and Splout SQL Iván de

    Prado
  2. Ad Networks analytics using Hadoop and Splout SQL Iván de

    Prado Alonso – CEO of Datasalt www.datasalt.es @ivanprado @datasalt
  3. Big Data consulting & training

  4. None
  5. Agenda 1. Analytics for Ad Networks 2. Our solution 1.

    Hadoop + Splout SQL 2. Splout SQL in detail 3. Pre-aggregations v.s. Sampling 3. Conclusions
  6. Analytics for Ad Networks

  7. Ad Networks Principal agents › Advertiser › Publisher › Web

    pages › Mobile apps Ad Network › Network of agents that mediate between advertisers and publishers › DSPs, SSPs, DMPs, ADTs, ITDs, etc
  8. For the sake of simplicity... Let’s consider a monolithic Ad

    Network › Single agent between advertisers and publishers But the exposed solution is also useful for DSPs, SSPs, DMPs, etc.
  9. Need for analytics For advertisers › Monitoring campaigns › Improve

    ROI For publishers › Improve ad placement But there can be › Tens of thousands of advertisers › Hundred of thousands of publishers
  10. Analytics Counting impressions, clicks and CPC › For a given

    range of dates › Filtered by › Campaign › Location › Language › Browser/device › Ad type › ... or any combination of the above!
  11. Two-fold usage Operational › For invoicing, accounting, etc. › Limited

    set of parameter variations › Fixed date ranges and common aggregations › Exact results expected Exploratory › Unlimited variations of parameters › Ad-hoc filtering › Approximated results are enough
  12. Challenges Billions of events and hundreds of gigabytes per day

    › Need for a distributed system Query flexibility › Need to cope with operational and exploratory queries Web latencies › Queries must return in milliseconds
  13. Exploding Data needed to serve analytics panels is Big Data

    › Thousands of advertiser panels › Even more for publisher panels But individually each agent panel can be served with one machine › At least for the 98% of advertisers/publishers › Horizontal partitioning is a good strategy
  14. Our solution

  15. Our solution

  16. Hadoop Scalable › Storage of raw data › Computing capabilities

    Good for › Creating pre-computed aggregations (views) › Generating samples of data Bad for › Serving data › On-line aggregations
  17. Scalable › Serving of full SQL queries (unlike NoSQLs) Good

    for › Ad-hoc aggregations over pre- computed views › Serving low-latency web pages with concurrency
  18. A well-balanced solution Hadoop › Provides a scalable repository for

    impressions › Performs off-line pre-aggregations and sampling Splout SQL › Serves queries › Performs on-line aggregations in sub-second latencies › Each partition contains only data for a few agents, which ensures performance
  19. Splout SQL (in detail)

  20. Splout SQL in detail solation between generation and servin

  21. plout SQL Architecture

  22. IMPRESSIONS PID AID Amou nt S10 0 U20 102 S10

    1 U20 60 Tablespace T_ADVERTISERS ADVERTISER S AID Nam e U20 Doug U21 Ted U40 John IMPRESSIONS PID AID Amou nt S10 0 U20 102 S10 1 U20 60 S22 U40 99 table ADVERTISERS table IMPRESIONS Generate tablespace T_ADVERTISERS with 2 partition partitioned by AID partitioned by AID Partition U10 – U35 Partition U36 – U60 ADVERTISE RS AID Nam e U40 John IMPRESSIONS PID AID Amou nt S22 3 U40 99 Generation ADVERTISE RS AID Nam e U20 Doug U21 Ted
  23. API - Generation Command line Loading CSV files Java API

    HCatalog $ hadoop jar splout-*-hadoop.jar generate … Hive Pig
  24. SELECT Name, sum(Amount) FROM ADVERTISERS a, IMPRESSIONS i WHERE a.AID

    = i.AID AND AID = ‘U20’; For key = ‘U20’, tablespace=‘T_ADVERTIS Partition U10 – U35 Serving Partition U36 – U60 ADVERTISE RS AID Nam e U20 Doug U21 Ted IMPRESSIONS PID AID Amou nt S10 0 U20 102 S10 U20 60 ADVERTISE RS AID Nam e U40 John IMPRESSIONS PID AID Amou nt S22 3 U40 99
  25. SELECT Name, sum(Amount) FROM ADVERTISERS a, IMPRESSIONS i WHERE a.AID

    = i.AID AND AID = ‘U40’; For key = ‘U40’, tablespace=‘T_ADVERTIS Serving Partition U36 – U60 Partition U10 – U35 ADVERTISE RS AID Nam e U20 Doug U21 Ted IMPRESSIONS PID AID Amou nt S10 0 U20 102 S10 U20 60 ADVERTISE RS AID Nam e U40 John IMPRESSIONS PID AID Amou nt S22 3 U40 99
  26. API - Service Rest API JSON response

  27. API - Console

  28. Pre- aggregations v.s. Sampling

  29. Operational usage Invoicing, accounting, monitoring, etc. › Exact results ›

    Constrained space of aggregations Pre-computed aggregates done in Hadoop › For example: › per day › per day per location Extended aggregations done on-line › Using Splout SQL › For example, aggregate per week based on daily stats
  30. Why not to pre-compute everything? Create one table per each

    dimension combination › For two dimensions (day, location): › day › location › location, day For n dimensions › 2n – 1 combinations › It explodes!
  31. Exploratory usage Ad-hoc filters to learn from data › Approximated

    results are enough Intensive use of sampling › It can provide good accuracy with fast response Confidence interval › p=proportion › n=sample size › z=normal distribution p± z α/2 p×(1− p) n
  32. Samples Created on Hadoop › Different sample sets › For

    last X days › For last year Splout SQL for serving them › On-line analytics over samples › 1 Million records per second* (44 bytes per row) › Faster with data in memory • Warming data prior use • 2.7 Million records per second* * Measured in a laptop
  33. Pre-aggregations pros & cons Advantages › Exact results › Good

    for exploring the long-tail Limitations › Only for a constrained amount of aggregation combinations › Not good for exploratory analysis
  34. Sampling pros & cons Advantages › Fast filtering for any

    set of dimensions › Good accuracy for Top N queries Limitations › Bad for narrow dimension filters › Bad for exploring the long-tail › Approximated results
  35. Conclusions

  36. Conclusions Analytics in Ad Networks is a complex question ›

    Due to the amount of data › Due to the amount of agents It can be solved using Hadoop + Splout SQL › By the use of partitioning › Using pre-aggregations › For operative usages › Using sampling › For exploratory profiles
  37. Questions? Iván de Prado Alonso – CEO of Datasalt www.datasalt.es

    @ivanprado @datasalt
  38. None