Ad Networks analytics using Hadoop and Splout SQL by IVÁN DE PRADO at Big Data Spain 2013

Ad Networks analytics using Hadoop and Splout SQL Iván de
Prado

Ad Networks analytics using Hadoop and Splout SQL Iván de
Prado Alonso – CEO of Datasalt www.datasalt.es @ivanprado @datasalt

Big Data consulting & training

Agenda 1. Analytics for Ad Networks 2. Our solution 1.
Hadoop + Splout SQL 2. Splout SQL in detail 3. Pre-aggregations v.s. Sampling 3. Conclusions

Analytics for Ad Networks

Ad Networks Principal agents › Advertiser › Publisher › Web
pages › Mobile apps Ad Network › Network of agents that mediate between advertisers and publishers › DSPs, SSPs, DMPs, ADTs, ITDs, etc

For the sake of simplicity... Let’s consider a monolithic Ad
Network › Single agent between advertisers and publishers But the exposed solution is also useful for DSPs, SSPs, DMPs, etc.

Need for analytics For advertisers › Monitoring campaigns › Improve
ROI For publishers › Improve ad placement But there can be › Tens of thousands of advertisers › Hundred of thousands of publishers

Analytics Counting impressions, clicks and CPC › For a given
range of dates › Filtered by › Campaign › Location › Language › Browser/device › Ad type › ... or any combination of the above!

Two-fold usage Operational › For invoicing, accounting, etc. › Limited
set of parameter variations › Fixed date ranges and common aggregations › Exact results expected Exploratory › Unlimited variations of parameters › Ad-hoc filtering › Approximated results are enough

Challenges Billions of events and hundreds of gigabytes per day
› Need for a distributed system Query flexibility › Need to cope with operational and exploratory queries Web latencies › Queries must return in milliseconds

Exploding Data needed to serve analytics panels is Big Data
› Thousands of advertiser panels › Even more for publisher panels But individually each agent panel can be served with one machine › At least for the 98% of advertisers/publishers › Horizontal partitioning is a good strategy

Our solution

Hadoop Scalable › Storage of raw data › Computing capabilities
Good for › Creating pre-computed aggregations (views) › Generating samples of data Bad for › Serving data › On-line aggregations

Scalable › Serving of full SQL queries (unlike NoSQLs) Good
for › Ad-hoc aggregations over pre- computed views › Serving low-latency web pages with concurrency

A well-balanced solution Hadoop › Provides a scalable repository for
impressions › Performs off-line pre-aggregations and sampling Splout SQL › Serves queries › Performs on-line aggregations in sub-second latencies › Each partition contains only data for a few agents, which ensures performance

Splout SQL (in detail)

Splout SQL in detail solation between generation and servin

plout SQL Architecture

IMPRESSIONS PID AID Amou nt S10 0 U20 102 S10
1 U20 60 Tablespace T_ADVERTISERS ADVERTISER S AID Nam e U20 Doug U21 Ted U40 John IMPRESSIONS PID AID Amou nt S10 0 U20 102 S10 1 U20 60 S22 U40 99 table ADVERTISERS table IMPRESIONS Generate tablespace T_ADVERTISERS with 2 partition partitioned by AID partitioned by AID Partition U10 – U35 Partition U36 – U60 ADVERTISE RS AID Nam e U40 John IMPRESSIONS PID AID Amou nt S22 3 U40 99 Generation ADVERTISE RS AID Nam e U20 Doug U21 Ted

API - Generation Command line Loading CSV files Java API
HCatalog $ hadoop jar splout-*-hadoop.jar generate … Hive Pig

SELECT Name, sum(Amount) FROM ADVERTISERS a, IMPRESSIONS i WHERE a.AID
= i.AID AND AID = ‘U20’; For key = ‘U20’, tablespace=‘T_ADVERTIS Partition U10 – U35 Serving Partition U36 – U60 ADVERTISE RS AID Nam e U20 Doug U21 Ted IMPRESSIONS PID AID Amou nt S10 0 U20 102 S10 U20 60 ADVERTISE RS AID Nam e U40 John IMPRESSIONS PID AID Amou nt S22 3 U40 99

SELECT Name, sum(Amount) FROM ADVERTISERS a, IMPRESSIONS i WHERE a.AID
= i.AID AND AID = ‘U40’; For key = ‘U40’, tablespace=‘T_ADVERTIS Serving Partition U36 – U60 Partition U10 – U35 ADVERTISE RS AID Nam e U20 Doug U21 Ted IMPRESSIONS PID AID Amou nt S10 0 U20 102 S10 U20 60 ADVERTISE RS AID Nam e U40 John IMPRESSIONS PID AID Amou nt S22 3 U40 99

API - Service Rest API JSON response

API - Console

Pre- aggregations v.s. Sampling

Operational usage Invoicing, accounting, monitoring, etc. › Exact results ›
Constrained space of aggregations Pre-computed aggregates done in Hadoop › For example: › per day › per day per location Extended aggregations done on-line › Using Splout SQL › For example, aggregate per week based on daily stats

Why not to pre-compute everything? Create one table per each
dimension combination › For two dimensions (day, location): › day › location › location, day For n dimensions › 2n – 1 combinations › It explodes!

Exploratory usage Ad-hoc filters to learn from data › Approximated
results are enough Intensive use of sampling › It can provide good accuracy with fast response Confidence interval › p=proportion › n=sample size › z=normal distribution p± z α/2 p×(1− p) n

Samples Created on Hadoop › Different sample sets › For
last X days › For last year Splout SQL for serving them › On-line analytics over samples › 1 Million records per second* (44 bytes per row) › Faster with data in memory • Warming data prior use • 2.7 Million records per second* * Measured in a laptop

Pre-aggregations pros & cons Advantages › Exact results › Good
for exploring the long-tail Limitations › Only for a constrained amount of aggregation combinations › Not good for exploratory analysis

Sampling pros & cons Advantages › Fast filtering for any
set of dimensions › Good accuracy for Top N queries Limitations › Bad for narrow dimension filters › Bad for exploring the long-tail › Approximated results

Conclusions

Conclusions Analytics in Ad Networks is a complex question ›
Due to the amount of data › Due to the amount of agents It can be solved using Hadoop + Splout SQL › By the use of partitioning › Using pre-aggregations › For operative usages › Using sampling › For exploratory profiles

Questions? Iván de Prado Alonso – CEO of Datasalt www.datasalt.es
@ivanprado @datasalt

Ad Networks analytics using Hadoop and Splout S...

Ad Networks analytics using Hadoop and Splout SQL by IVÁN DE PRADO at Big Data Spain 2013

More Decks by Big Data Spain

Other Decks in Technology

Featured

Transcript