Building Real-Time Metrics Pipelines

BUILDING REAL-TIME METRICS PIPELINES SAMANTHA QUIÑONES Velocity Amsterdam, Oct. 2015

SAMANTHA QUIÑONES ABOUT ME ▸ Software Engineer since 1997 ▸
Doing “media stuﬀ” since 2012 ▸ Principal @ AOL since 2014 ▸ @ieatkillerbees ▸ http://samanthaquinones.com

IMAGE CREDITS ▸ Clock on the roof of Our Lady
of Dormition Melkite Greek Catholic Patriarchal Cathedral, Damascus, Syria ▸ https://commons.wikimedia.org/wiki/Category:Church_clocks#/media/ File:Clock_of_the_Melkite_Greek_Catholic_Church,_Damascus.jpg ▸ Bernard Gagnon ▸ CC BY-SA 3.0 ▸ The SJ train 58/637 with an Rc locomotive from Stockholm passes by Etterstad on its way on Hovedbanen to Oslo Central Station, about eight minutes late. The Loenga–Alnabru freight line is seen to the right. ▸ https://upload.wikimedia.org/wikipedia/commons/d/d7/Swedish_train_in_Norway.jpg ▸ Peter Krefting ▸ (CC BY-SA 2.0)

THAT AOL?

“HOW WOULD YOU LET EDITORS TEST HOW WELL DIFFERENT HEADLINES
PERFORM FOR THE SAME PIECE OF CONTENT?” Shashi Reddy, Senior Engineer, AOL

MEASURE ONCE CUT TWICE TRADITIONAL METRICS ▸ Request response time
- are we responding fast enough? ▸ Cache hit rate - are we making our backend work too hard? ▸ Resource utilization - do we have enough “hardware”?

Delay Perception <100ms Instantaneous <300ms Perceptible Delay <1000ms System “Working”
<10000ms System “Slow” >10000ms System “Down” Source: O’Reilly Media

AMAZON SALES DATA EFFECTS OF LATENCY Amazon Sales: -1% sales
per 100ms increased latency Sales (USD) 0 500 1000 1500 2000 Seconds of Latency 0 1 2 3 5 6 7 8 9 10 11 12 13 14 15 S = S - (msL*1%) Linden, G. (2006, December 3). Make Data Useful. Data Mining (CS345). Lecture conducted from Stanford University, Stanford, CA.

GOOGLE SEARCH EXPERIMENT EFFECTS OF LATENCY Google Search Experiment (4-6
Weeks) % Fewer Searches per Day -1 -0.75 -0.5 -0.25 0 Milliseconds of Additional Latency 50 100 200 400 Schurman, E., Brutlag, J. (2009, June 23). The User and Business Impact of Server Delays, Additional Bytes, and HTTP Chunking in Web Search. O'Reilly Velocity. Lecture conducted from O'Reilly Media, San Jose, CA.

BEHAVIORAL METRICS MULTIVARIATE (A/B) TESTING ▸ Sort all users in
to groups ▸ 1 control group receives unaltered content ▸ 1 or more groups receive altered content ▸ Measure behavioral statistics (CTR, abandon rate, time on page, scroll depth) for each group

OTHER METRICS STATE MONITORING ▸ Exception logging ▸ Load monitoring
▸ System performance ▸ Application performance ▸ Cache performance

UNDERSTANDING YOUR AUDIENCE TRAFFIC METRICS

BEHAVIORAL METRICS MEASURING USER BEHAVIOR & EXPERIENCE ▸ Application path
- What does the user click on? ▸ Usage patterns - When does the user visit? Where do they come from? ▸ Mouse & attention tracking - What draws the user’s attention? ▸ RUM

TRAFFIC METRICS DEMOGRAPHIC INFORMATION COLLECTION ▸ Geographic location and region
▸ ISP ▸ Device information ▸ Anonymized user identiﬁcation

CASE STUDY AOL MEDIA PLATFORM ▸ Content management ▸ Distributed
rendering farm ▸ Integrated development environment using custom DSL ▸ Content aggregation platform ▸ Machine learning platform ▸ Multi-tenant system

CASE STUDY MEASURING THE AOL MEDIA PLATFORM

CASE STUDY METRICS & ANALYTICS ▸ Omniture (revenue analytics) ▸
New Relic (APM) ▸ ELK (APM) ▸ AOL proprietary data platform (RUM & Demographics)

CASE STUDY AOL DATA LAYER ▸ Massively distributed data collection
▸ Hadoop ▸ Access via Hive & Pig ▸ Time-shared ▸ Cassandra ▸ Vertica (ingested Omniture data) ▸ Streaming Interface (raw data)

“BEACON” SERVER “BEACON” SERVER “BEACON” SERVER “BEACON” SERVER “BEACON” SERVER
RABBITMQ FARM DATA LAYER SERVICES cassandra hadoop RABBITMQ STREAMER FARM DATA LAYER STREAMER DATA LAYER STREAMER DATA LAYER STREAMER

BEACON PAYLOAD { "anonymous_id": "e33d53be-7b7e-11e5-8bcf-feff819cdc9f", "channel": "aol.us", "user_agent": "Mozilla/5.0 (Macintosh;
Intel Mac OS X 10_10_5) AppleWebKit/537.36", "referer": "www.aol.com", "location": "country=us,region=va,city=alexandria,latitude=38.819940,longitude=-77.145418", "mv_tests": "mv_test1:mv_test_pop_id;mv_test_metadata" }

~40 metrics ~1.6 KB per EVENT

~15,000 events 25 MB per SECOND

1,300,000,000 events ~2 TB per DAY

CASE STUDY CONTENT CREATORS WANT TO KNOW ▸ Today’s traﬃc
by author & vertical ▸ Top performing articles for the past hour ▸ Recent social engagement trends

CASE STUDY CONTENT SITE DEVELOPERS WANT TO KNOW ▸ API
Query Performance ▸ Details of handled exceptions ▸ How to maximize cache hit rate

THEY NEED TO KNOW NOW. THE 24-HOUR MEDIA CYCLE

“HOW WOULD YOU LET EDITORS TEST HOW WELL DIFFERENT HEADLINES
PERFORM FOR THE SAME PIECE OF CONTENT?” Shashi Reddy, Senior Engineer, AOL

IN THREE EASY FAILURES BUILDING A REAL-TIME DATA PIPELINE

THE PROOF OF CONCEPT COLLECTOR COLLECTOR COLLECTOR STREAMER COLLECTOR COLLECTOR
COLLECTOR RECEIVER COLLECTOR COLLECTOR COLLECTOR STATSD CLUSTER ELASTICSEARCH

TINY, ENCAPSULATED NANOSERVICES this.visit = function(record) { if (record.userAgent) {
var parser = new UAParser(); parser.setUA(record.userAgent); var user_agent = parser.getResult(); return { user_agent: user_agent } } return {}; };

CASE STUDY PROOF OF CONCEPT PERFORMANCE & RESULTS ▸ Message
Rate: 300 per second ▸ Receivers Needed: ~70+ ▸ StatsD imposes a number of limitations ▸ Breaks rich payloads down in to discrete metrics ▸ Anything but in-ﬂight aggregation means querying Elasticsearch

An eﬃcient real-time data pathway consists of a network of
transits and terminals, where no single node acts as both a transit and a terminal at the same time. CASE STUDY

CASE STUDY TRANSITS ▸ Short-term ▸ In-memory ▸ Volatile storage
▸ Data with life-spans up to a few seconds

CASE STUDY TERMINALS ▸ Destinations that store, ▸ Destroy, ▸
or Retransmit data

KAFKA VS RABBITMQ TOOL EVALUATION

TOOL EVALUATION - KAFKA VS STORM APACHE KAFKA ▸ Pub/Sub
Message Broker ▸ Born @LinkedIn around 2011 ▸ Apache project since 2014 ▸ Key focuses ▸ Message integrity (persistence-ﬁrst model) ▸ Message order ▸ Fault tolerance

TOOL EVALUATION RABBITMQ ▸ AMQP implementation ▸ Born in 2007
▸ Acquired by Pivotal Software in 2013 ▸ Key focuses: ▸ General-purpose messaging ▸ Routing ▸ HA through Federation

TOOL EVALUATION REQUIREMENTS & CONSIDERATIONS ▸ Payloads may arrive in
any order. ▸ Some data loss is acceptable. ▸ Consumers may only want small subsets of data ▸ Need to route data to consumers in multiple datacentres / in AWS ▸ Broad support for languages

TOOL EVALUATION TRANSIT: RABBITMQ ▸ RabbitMQ’s priorities are similar to
ours ▸ Federation over at-least-once delivery ▸ Supports complex routing ▸ Allows federation over network boundaries (even when it’s dumb) ▸ Mature clients for our Big Three Stacks (Java, Node.js, PHP) ▸ Big enterprises like stuﬀ with companies behind it

VERSION 1 COLLECTOR COLLECTOR COLLECTOR STREAMER COLLECTOR COLLECTOR COLLECTOR RECEIVER
COLLECTOR COLLECTOR COLLECTOR RABBITMQ ELASTICSEARCH

CASE STUDY MORE THAN JUST RABBITMQ ▸ Moved away from
Observer Pattern for data processing to a single in and a single out event. ▸ Node.js event handling is VERY fast, but the sheer number of events being created caused memory problems. ▸ Rather than tuning within the app or engine, let back pressure mechanism regulate input rate.

while (buffer.length > 0) { var char = buffer.shift(); if
('\n' === char) { queue.push(new Buffer(outbuf.join(''))); continue; } outbuf.push(char); } var i = 0; var tBuf = buffer.slice(); while (i < buffer.length) { var char = tBuf[i++]; if ('\n' === char) { queue.push(new Buffer(outbuf.join(''))); } outbuf.push(char); }

CASE STUDY VERSION 1 PERFORMANCE & RESULTS ▸ Message Rate:
600/s ▸ Receivers Needed: ~35+ ▸ Adding code to handle weird edge cases in data degrades performance. ▸ Micro-optimization of code leads to hard-to-ﬁx crashes and memory leaks.

NOT KNOWING HOW TO USE A TOOL DOESN’T MEAN IT’S
BROKEN.

CASE STUDY GETTING SERIOUS ▸ Receiving data, editing it, and
routing it in the same step violates my transit/ terminal separation policy. ▸ Receiver needs to be a simple transit that consumes and pushes data on to RabbitMQ ▸ Nice-to-haves: ▸ Static & dynamic optimization ▸ Clean multithreading/multiprocessing ▸ Good memory management for large, volatile in-memory data sets

TOOL EVALUATION PICKING A STACK FOR THE DATA RECEIVER -
THE PROS ▸ Node.js - Simple, easy-to-distribute, fast. ▸ Go - Native concurrency & memory management, fast compiler. ▸ Rust - C++ with modern tooling. ▸ Java - Static & dynamic optimization, good memory management & multithreading. ▸ C/C++ - Speed, good libraries for handling concurrency & memory.

TOOL EVALUATION PICKING A STACK FOR THE DATA RECEIVER -
THE CONS ▸ Node.js - Too many instances needed to manage production ﬂow. ▸ Go - No one on my team has any familiarity. ▸ Rust - No one on my team has any desire to have any familiarity. ▸ Java - All the cool kids will pick on me. ▸ C/C++ - I like myself too much.

AN ARCHITECT MUST UNDERSTAND OTHERS’ VISIONS BEFORE EXPRESSING THEIR OWN.
ARE YOU REALLY GOING TO QUOTE YOURSELF?

mfw java :(

VERSION 2 (JAVA BOOGALOO) COLLECTOR COLLECTOR COLLECTOR STREAMER COLLECTOR COLLECTOR
COLLECTOR RECEIVER ELASTICSEARCH RABBITMQ COLLECTOR COLLECTOR COLLECTOR PROCESSOR/ ROUTER

public class StreamReader { private static final Logger logger =
Logger.getLogger(StreamReader.class.getName()); private StreamerQueue queue = new StreamerQueue(); private StreamProcessor processor; private List<StreamReader.BeaconWorkerThread> workerThreads = new ArrayList(); private RtStreamerClient client; public StreamReader(String streamerURI, AmqpClient amqpClient, String appID, String tpcFltrs, String rfFltrs, String bt) { ArrayList queueList = new ArrayList(); this.processor = new StreamProcessor(amqpClient); byte numThreads = 8; for(int i = 0; i < numThreads; ++i) { StreamReader.BeaconWorkerThread worker = new StreamReader.BeaconWorkerThread(); this.workerThreads.add(worker); worker.start(); } queueList.add(this.queue); this.client = new RtStreamerClient(streamerURI, appID, tpcFltrs, rfFltrs, bt, queueList); } } CREATING MULTIPLE THREADS WITH STANDALONE CONNECTIONS TO RABBITMQ SIMPLE WRAPPER AROUND NATIVE JAVA LINE STREAMER

public class StreamProcessor { private static final Logger logger =
Logger.getLogger(StreamProcessor.class.getName()); private AmqpClient amqpClient; public StreamProcessor(AmqpClient amqpClient) { this.amqpClient = amqpClient; } public void send(String data) throws Exception { this.amqpClient.send(data.getBytes()); logger.debug("Sent event " + data + " to AMQP"); } } SIMPLE PASS-THRU

QUEUE QUEUE QUEUE QUEUE QUEUE QUEUE QUEUE QUEUE QUEUE QUEUE
QUEUE NETWORK INPUT NETWORK OUTPUT Linked List Queues

CASE STUDY VERSION 2 PERFORMANCE & RESULTS ▸ Message Rate:
2600/s ▸ Receivers Needed: ~10 ▸ Validity ﬁltering is almost free in the Java receiver (can’t parse as JSON, drop it) ▸ Processor / Router Service selects only the messages it wants. Everything else is left for another service to collect, or to be dropped on the ﬂoor.

WITHOUT CONSUMERS, A PIPELINE IS USELESS. PLEASE STOP QUOTING YOURSELF
SAMANTHA, IT’S PATHETIC

LETS DO MATH AT IT! REAL-TIME ANALYTICS SERVICE

REAL-TIME ANALYTICS SERVICE GOALS ▸ Provide (near) real-time statistics, metrics,
and analytics for editorial staﬀ ▸ Allow statistical evaluation of arbitrary variables ▸ Provide a simple interface for developers working in the publishing stack (PHP)

REAL-TIME ANALYTICS SERVICE WHAT IS ELASTIC SEARCH ▸ A full-text
search database ▸ A high performance NOSQL document store that features ▸ High-availability via clustering ▸ Rack/Datacentre-aware sharding ▸ Expressive & dynamic query DSL ▸ Some powerful full-text search, I guess, whatever?

AOL US East Datacentre AOL France Datacentre AWS us-east-1 Region
AOL US West Datacentre ELASTICSEARCH MASTER ELASTICSEARCH NODE ELASTICSEARCH NODE ELASTICSEARCH NODE

ELASTICSEARCH CLUSTERING ONE INDEX, TWO REPLICAS MASTER NODE NODE NODE
R0 P1 P2 P0 R1 R2 R3 R2 R3

{ "query": { "filtered": { "query": { "multi_match": { "query":
"miley cyrus", "fields": [ "byline", "title", "contents" ], "type": "cross_fields" } }, "filter": { "terms": { "site_id": [ 698 ] } } } }, "size": 25 }

{ "size": 0, "query": { "filtered": { "query": { "terms":
{ "content.source.cms.post_id": [ 12347, 22314, 242123, 342414 ] } }, "filter": { "bool": { "must": [ { "term": { "click_type": "ping" } }, { "range": { "timestamp": { "gte": 1445854380000, "lte": 1445940780000 } } } ] } } } }, "aggregations": { "post_id": { "terms": { "field": "content.source.cms.post_id", "size": 4, "order": { "_count": "desc" } }, "aggregations": { "search_terms": { "terms": { "field": "referer.search_term.raw" } }, "source": { "terms": { "field": "referer.medium" }, "aggregations": { "referer": { "terms": { "field": "referer.referer" }, "aggregations": { "search_terms": { "terms": { "field": "referrer.search_term.raw" } } } } } } } } } }

<?php $params = []; $params['type'] = 'stat'; $params['index'] = isset($args->search_index)
? $args->search_index : $elasticSearch->getDatedIndexList($start, $end); $params['ignore_unavailable'] = true; $params['body'] = $this->getQuery($args->post_ids, $start, $end); $results = $client->search($params);

CASE STUDY MULTIVARIATE TESTING - REQUIREMENTS ▸ Allow editors to
test the performance of any discrete content element ▸ Content elements being: headlines, deks, ledes, subledes, hero images, river images, etc. ▸ Editors should be able to create, start, stop, and evaluate tests without spending developer time.

CASE STUDY MULTIVARIATE TESTING - IMPLEMENTATION ▸ Assign new visitors
to a test group via cookie ▸ Inject test markers in to beacon payload ▸ Compare CTR for PVs with test markers to calculate performance

{ "mv_stats": { "type": "nested", "include_in_parent": true, "properties": { "hash":
{ "type": "string" }, "test_id": { "type": "integer" } } } } TEST POPULATION IDENTIFIER TEST ID

{ "size": 0, "query": { "filtered": { "query": { "terms":
{ "mv_stats.test_id": [ 42 ] } }, "filter": { "bool": { “must": [ { "term": { "click_type": "ping" } }, { "range": { "timestamp": { "gte": 1445854380000, "lte": 1445940780000 } } } ] } } } }, "aggs": { "event_type": { "terms": { "field": "click_type" }, "aggs": { "multivariate": { "nested": { "path": "mv_stats" }, "aggs": { "test_ids": { "terms": { "field": "mv_stats.test_id" }, "aggs": { "hashes": { "terms": { "field": "mv_stats.hash" }, "aggs": { "event_types": { "terms": { "field": "click_type" } } } } } } } } } } } } REGULAR PAGEVIEW AGGREGATIONS NEST AND TAKE A CONTEXT OF THE PARENT

<?php $results = $this->analytics()->multivariate()->get([ 'test_id' => $id, 'event_type' => 'all',
'start' => $test['started'] ])->data(); if (!empty($results['hashes'])) { foreach (array_keys($test['items']) as $hash) { $clicks = 0; if (!empty($results['hashes'][$hash]['clicks'])) { $clicks = $results['hashes'][$hash]['clicks']; } $pings = 0; if (!empty($results['hashes'][$hash]['pings'])) { $pings = $results['hashes'][$hash]['pings']; } $test['items'][$hash]['clicks'] = $clicks; $test['items'][$hash]['pings'] = $pings; $test['items'][$hash]['percent'] = ($clicks / $pings) * 100; } }

MEANINGLESS SHINIES TO OOH AND AHH AT WALL MAPS

RABBITMQ INPUT OUT TO ANALYTICS SERVICE OUT TO VISUALIZATION SERVICE

EMBEDDABLE VISUALIZATIONS IN-DEVELOPMENT

var views = 0; var socket = io(); socket.on('pageview', function(point)
{ views++; }); function tick() { data.push(views); views = 0; path .attr("d", line) .attr("transform", null) .transition() .duration(500) .ease("linear") .attr("transform", "translate(" + x(0) + ",0)") .each("end", tick); data.shift(); } tick();

LIVE PROFILING DEVELOPERS NEED LOVE TOO

LIVE PROFILING DEVELOPING ON THE AOL MEDIA PLATFORM ▸ Use
our API and build what you like on servers you manage. ▸ Use our managed hosting platform which handles scaling, caching, etc. ▸ But… requires you to work in a custom DSL

HOLY CRAP THE GUY WHO BUILT ALL OF THIS IS
A GENIUS DEVELOPING FOR THE AOL MEDIA PLATFORM ▸ Create a repository in your source control system of choice ▸ Write code in our twig-based language (CodeBlocks) ▸ Code on your local machine is synced to a live sandbox with access to test data and resources that mirror production ▸ Promote sandboxes to live production ▸ This was seriously all built by a guy named Ralph.

{% set posts = api.posts.query({ page: req.params.page|default(1), limit: 3, categories:
req.params.category ? [{parent:req.params.category}, req.params.category] : null, categories_match: 'any', tags: req.params.tag ? [req.params.tag] : null }) %}

DEV STARTS A PROFILER SESSION DEV VISITS PRODUCTION SITE WITH
QUERY PARAM RENDER SERVER ACTIVATES PROFILING EVENT MESSAGES ARE TAGGED WITH SESSION ID RABBITMQ ROUTES TAGGED MESSAGES TO PROFILER SERVICE DEV’S PROFILER CONSOLE CONNECTS TO PROFILER SERVICE PROFILER SERVICE WAITS FOR MESSAGES MESSAGES ARE RECEIVED AND RENDERED IN THE CONSOLE

CROSS-PLATFORM EVENTING WHEN A PACKET HITS A SOCKET ON A
POCKET ON A PORT

CROSS-PLATFORM EVENTING A LITTLE SOMETHING FOR THAT NICE ENGINEER OVER
THERE ▸ Allow devs to dispatch “native” events in one stack and observe them in another ▸ The PHP CMS uses the Symfony EventDispatcher to trigger an event in Node.js ▸ Distributed event handling without PHP workers ▸ Event-driven search indexing (no rivers or crons)

RABBITMQ INPUT OUT TO ANALYTICS SERVICE OUT TO VISUALIZATION SERVICE
RABBITMQ OUT TO EVENT HANDLER SERVICE

<?php public function dispatch($event_name, Event $event = null) { $dispatchedEvent
= parent::dispatch($event_name, $event); if ($dispatchedEvent instanceof ForwardableEvent) { $data = $dispatchedEvent->getEventData(); try { $this->amqp->publish( self::AMQP_CONNECTION, self::AMQP_EXCHANGE, self::AMQP_ROUTING_KEY, json_encode(['name' => $event_name, 'data' => $data]) ); } catch (\Exception $exc) { $this->logger->error( self::class . ' failed to publish event to AMQP.’, [ 'exception' => $exc ] ); } } return $dispatchedEvent; } OVERRIDING THE DEFAULT BEHAVIOR OF THE PHP EVENT DISPATCHER DEVS MARK EVENTS AS ‘FORWARDABLE’ BY IMPLEMENTING AN INTERFACE EVENTS ARE FORWARDED ON TO AN AMQP EXCHANGE

module.exports = { register: function (config) { client = new
es.Client({ hosts: config.hosts, log: Logger }); logger.info('AMP Elasticsearch Indexer module loaded!'); }, listeners: { 'amp.post.save': function (event, callback) { var index = 'posts'; var type = 'post'; var id = event['id']; if (!id) { return callback('Invalid post object received'); } indexRecord(index, type, id, event, callback); } } }; JS FUNCTION EXECUTED WHEN PHP DISPATCHES EVENTS

WRAPPING IT UP AOL’S DATA PIPELINE - BY THE NUMBERS
▸ 1.3 billion events per day ▸ Routed by RabbitMQ to microservice consumers ▸ Driving real-time analytics over 250 GB of raw data per day ▸ Visualizing 1.3 million events per day ▸ Generating live proﬁles for developers of ~50 properties ▸ Handling 10,000 Elasticsearch search index updates per day

WRAPPING IT UP AOL’S DATA PIPELINE - STACKS AND TECH
▸ Programming Languages: Java, Node.js, PHP, Python (HA load-balancing and routing) ▸ Hadoop, RabbitMQ, Elasticsearch, Vertica

WRAPPING IT UP AOL’S DATA PIPELINE - 2016 & BEYOND
▸ Embeddable visualizations ▸ On-demand stream ﬁlters with Redis time-series bucketing ▸ Real-time predictive performance analysis ▸ Real-time social sentiment analysis ▸ Moving all of this infrastructure to AWS (oy!) ▸ Integrating Apache Spark

PIPELINE MAP AOL DATA LAYER RABBITMQ REAL-TIME ANALYTICS SERVICE VISUALIZATIONS
SERVICE AOL MEDIA PLATFORM PROFILER SERVICE CROSS-PLATFORM EVENT PROPAGATION SERVICE RELEGENCE

Building Real-Time Metrics Pipelines

Building Real-Time Metrics Pipelines

More Decks by Samantha Quiñones

Other Decks in Technology

Featured

Transcript