Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Ad Networks analytics using Hadoop and Splout S...

Ad Networks analytics using Hadoop and Splout SQL by IVÁN DE PRADO at Big Data Spain 2013

Ad Networks act as the middleman between advertisers and publishers on the Internet. The advertiser is the agent that wants to allocate a particular ad in different medias. The publisher is the agent who owns the medias. These medias are usually web pages or mobile applications.
Session presented at Big Data Spain 2013 Conference
8th Nov 2013
Kinépolis Madrid
http://www.bigdataspain.org
Event promoted by http://www.paradigmatecnologico.com
Abstract: http://www.bigdataspain.org/2013/conference/ad-networks-analytics-using-hadoop-and-splout-sql

Big Data Spain

November 14, 2013
Tweet

More Decks by Big Data Spain

Other Decks in Technology

Transcript

  1. Ad Networks analytics using Hadoop and Splout SQL Iván de

    Prado Alonso – CEO of Datasalt www.datasalt.es @ivanprado @datasalt
  2. Agenda 1. Analytics for Ad Networks 2. Our solution 1.

    Hadoop + Splout SQL 2. Splout SQL in detail 3. Pre-aggregations v.s. Sampling 3. Conclusions
  3. Ad Networks Principal agents › Advertiser › Publisher › Web

    pages › Mobile apps Ad Network › Network of agents that mediate between advertisers and publishers › DSPs, SSPs, DMPs, ADTs, ITDs, etc
  4. For the sake of simplicity... Let’s consider a monolithic Ad

    Network › Single agent between advertisers and publishers But the exposed solution is also useful for DSPs, SSPs, DMPs, etc.
  5. Need for analytics For advertisers › Monitoring campaigns › Improve

    ROI For publishers › Improve ad placement But there can be › Tens of thousands of advertisers › Hundred of thousands of publishers
  6. Analytics Counting impressions, clicks and CPC › For a given

    range of dates › Filtered by › Campaign › Location › Language › Browser/device › Ad type › ... or any combination of the above!
  7. Two-fold usage Operational › For invoicing, accounting, etc. › Limited

    set of parameter variations › Fixed date ranges and common aggregations › Exact results expected Exploratory › Unlimited variations of parameters › Ad-hoc filtering › Approximated results are enough
  8. Challenges Billions of events and hundreds of gigabytes per day

    › Need for a distributed system Query flexibility › Need to cope with operational and exploratory queries Web latencies › Queries must return in milliseconds
  9. Exploding Data needed to serve analytics panels is Big Data

    › Thousands of advertiser panels › Even more for publisher panels But individually each agent panel can be served with one machine › At least for the 98% of advertisers/publishers › Horizontal partitioning is a good strategy
  10. Hadoop Scalable › Storage of raw data › Computing capabilities

    Good for › Creating pre-computed aggregations (views) › Generating samples of data Bad for › Serving data › On-line aggregations
  11. Scalable › Serving of full SQL queries (unlike NoSQLs) Good

    for › Ad-hoc aggregations over pre- computed views › Serving low-latency web pages with concurrency
  12. A well-balanced solution Hadoop › Provides a scalable repository for

    impressions › Performs off-line pre-aggregations and sampling Splout SQL › Serves queries › Performs on-line aggregations in sub-second latencies › Each partition contains only data for a few agents, which ensures performance
  13. IMPRESSIONS PID AID Amou nt S10 0 U20 102 S10

    1 U20 60 Tablespace T_ADVERTISERS ADVERTISER S AID Nam e U20 Doug U21 Ted U40 John IMPRESSIONS PID AID Amou nt S10 0 U20 102 S10 1 U20 60 S22 U40 99 table ADVERTISERS table IMPRESIONS Generate tablespace T_ADVERTISERS with 2 partition partitioned by AID partitioned by AID Partition U10 – U35 Partition U36 – U60 ADVERTISE RS AID Nam e U40 John IMPRESSIONS PID AID Amou nt S22 3 U40 99 Generation ADVERTISE RS AID Nam e U20 Doug U21 Ted
  14. API - Generation Command line Loading CSV files Java API

    HCatalog $ hadoop jar splout-*-hadoop.jar generate … Hive Pig
  15. SELECT Name, sum(Amount) FROM ADVERTISERS a, IMPRESSIONS i WHERE a.AID

    = i.AID AND AID = ‘U20’; For key = ‘U20’, tablespace=‘T_ADVERTIS Partition U10 – U35 Serving Partition U36 – U60 ADVERTISE RS AID Nam e U20 Doug U21 Ted IMPRESSIONS PID AID Amou nt S10 0 U20 102 S10 U20 60 ADVERTISE RS AID Nam e U40 John IMPRESSIONS PID AID Amou nt S22 3 U40 99
  16. SELECT Name, sum(Amount) FROM ADVERTISERS a, IMPRESSIONS i WHERE a.AID

    = i.AID AND AID = ‘U40’; For key = ‘U40’, tablespace=‘T_ADVERTIS Serving Partition U36 – U60 Partition U10 – U35 ADVERTISE RS AID Nam e U20 Doug U21 Ted IMPRESSIONS PID AID Amou nt S10 0 U20 102 S10 U20 60 ADVERTISE RS AID Nam e U40 John IMPRESSIONS PID AID Amou nt S22 3 U40 99
  17. Operational usage Invoicing, accounting, monitoring, etc. › Exact results ›

    Constrained space of aggregations Pre-computed aggregates done in Hadoop › For example: › per day › per day per location Extended aggregations done on-line › Using Splout SQL › For example, aggregate per week based on daily stats
  18. Why not to pre-compute everything? Create one table per each

    dimension combination › For two dimensions (day, location): › day › location › location, day For n dimensions › 2n – 1 combinations › It explodes!
  19. Exploratory usage Ad-hoc filters to learn from data › Approximated

    results are enough Intensive use of sampling › It can provide good accuracy with fast response Confidence interval › p=proportion › n=sample size › z=normal distribution p± z α/2 p×(1− p) n
  20. Samples Created on Hadoop › Different sample sets › For

    last X days › For last year Splout SQL for serving them › On-line analytics over samples › 1 Million records per second* (44 bytes per row) › Faster with data in memory • Warming data prior use • 2.7 Million records per second* * Measured in a laptop
  21. Pre-aggregations pros & cons Advantages › Exact results › Good

    for exploring the long-tail Limitations › Only for a constrained amount of aggregation combinations › Not good for exploratory analysis
  22. Sampling pros & cons Advantages › Fast filtering for any

    set of dimensions › Good accuracy for Top N queries Limitations › Bad for narrow dimension filters › Bad for exploring the long-tail › Approximated results
  23. Conclusions Analytics in Ad Networks is a complex question ›

    Due to the amount of data › Due to the amount of agents It can be solved using Hadoop + Splout SQL › By the use of partitioning › Using pre-aggregations › For operative usages › Using sampling › For exploratory profiles