Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building Real-Time Metrics Pipelines

Building Real-Time Metrics Pipelines

Presented at Velocity Europe in Amsterdam, 28-Oct-2015

Samantha Quiñones

October 28, 2015
Tweet

More Decks by Samantha Quiñones

Other Decks in Technology

Transcript

  1. SAMANTHA QUIÑONES ABOUT ME ▸ Software Engineer since 1997 ▸

    Doing “media stuff” since 2012 ▸ Principal @ AOL since 2014 ▸ @ieatkillerbees ▸ http://samanthaquinones.com
  2. IMAGE CREDITS ▸ Clock on the roof of Our Lady

    of Dormition Melkite Greek Catholic Patriarchal Cathedral, Damascus, Syria ▸ https://commons.wikimedia.org/wiki/Category:Church_clocks#/media/ File:Clock_of_the_Melkite_Greek_Catholic_Church,_Damascus.jpg ▸ Bernard Gagnon ▸ CC BY-SA 3.0 ▸ The SJ train 58/637 with an Rc locomotive from Stockholm passes by Etterstad on its way on Hovedbanen to Oslo Central Station, about eight minutes late. The Loenga–Alnabru freight line is seen to the right. ▸ https://upload.wikimedia.org/wikipedia/commons/d/d7/Swedish_train_in_Norway.jpg ▸ Peter Krefting ▸ (CC BY-SA 2.0)
  3. “HOW WOULD YOU LET EDITORS TEST HOW WELL DIFFERENT HEADLINES

    PERFORM FOR THE SAME PIECE OF CONTENT?” Shashi Reddy, Senior Engineer, AOL
  4. MEASURE ONCE CUT TWICE TRADITIONAL METRICS ▸ Request response time

    - are we responding fast enough? ▸ Cache hit rate - are we making our backend work too hard? ▸ Resource utilization - do we have enough “hardware”?
  5. Delay Perception <100ms Instantaneous <300ms Perceptible Delay <1000ms System “Working”

    <10000ms System “Slow” >10000ms System “Down” Source: O’Reilly Media
  6. AMAZON SALES DATA EFFECTS OF LATENCY Amazon Sales: -1% sales

    per 100ms increased latency Sales (USD) 0 500 1000 1500 2000 Seconds of Latency 0 1 2 3 5 6 7 8 9 10 11 12 13 14 15 S = S - (msL*1%) Linden, G. (2006, December 3). Make Data Useful. Data Mining (CS345). Lecture conducted from Stanford University, Stanford, CA.
  7. GOOGLE SEARCH EXPERIMENT EFFECTS OF LATENCY Google Search Experiment (4-6

    Weeks) % Fewer Searches per Day -1 -0.75 -0.5 -0.25 0 Milliseconds of Additional Latency 50 100 200 400 Schurman, E., Brutlag, J. (2009, June 23). The User and Business Impact of Server Delays, Additional Bytes, and HTTP Chunking in Web Search. O'Reilly Velocity. Lecture conducted from O'Reilly Media, San Jose, CA.
  8. BEHAVIORAL METRICS MULTIVARIATE (A/B) TESTING ▸ Sort all users in

    to groups ▸ 1 control group receives unaltered content ▸ 1 or more groups receive altered content ▸ Measure behavioral statistics (CTR, abandon rate, time on page, scroll depth) for each group
  9. OTHER METRICS STATE MONITORING ▸ Exception logging ▸ Load monitoring

    ▸ System performance ▸ Application performance ▸ Cache performance
  10. BEHAVIORAL METRICS MEASURING USER BEHAVIOR & EXPERIENCE ▸ Application path

    - What does the user click on? ▸ Usage patterns - When does the user visit? Where do they come from? ▸ Mouse & attention tracking - What draws the user’s attention? ▸ RUM
  11. TRAFFIC METRICS DEMOGRAPHIC INFORMATION COLLECTION ▸ Geographic location and region

    ▸ ISP ▸ Device information ▸ Anonymized user identification
  12. CASE STUDY AOL MEDIA PLATFORM ▸ Content management ▸ Distributed

    rendering farm ▸ Integrated development environment using custom DSL ▸ Content aggregation platform ▸ Machine learning platform ▸ Multi-tenant system
  13. CASE STUDY METRICS & ANALYTICS ▸ Omniture (revenue analytics) ▸

    New Relic (APM) ▸ ELK (APM) ▸ AOL proprietary data platform (RUM & Demographics)
  14. CASE STUDY AOL DATA LAYER ▸ Massively distributed data collection

    ▸ Hadoop ▸ Access via Hive & Pig ▸ Time-shared ▸ Cassandra ▸ Vertica (ingested Omniture data) ▸ Streaming Interface (raw data)
  15. “BEACON” SERVER “BEACON” SERVER “BEACON” SERVER “BEACON” SERVER “BEACON” SERVER

    RABBITMQ FARM DATA LAYER SERVICES cassandra hadoop RABBITMQ STREAMER FARM DATA LAYER STREAMER DATA LAYER STREAMER DATA LAYER STREAMER
  16. BEACON PAYLOAD { "anonymous_id": "e33d53be-7b7e-11e5-8bcf-feff819cdc9f", "channel": "aol.us", "user_agent": "Mozilla/5.0 (Macintosh;

    Intel Mac OS X 10_10_5) AppleWebKit/537.36", "referer": "www.aol.com", "location": "country=us,region=va,city=alexandria,latitude=38.819940,longitude=-77.145418", "mv_tests": "mv_test1:mv_test_pop_id;mv_test_metadata" }
  17. CASE STUDY CONTENT CREATORS WANT TO KNOW ▸ Today’s traffic

    by author & vertical ▸ Top performing articles for the past hour ▸ Recent social engagement trends
  18. CASE STUDY CONTENT SITE DEVELOPERS WANT TO KNOW ▸ API

    Query Performance ▸ Details of handled exceptions ▸ How to maximize cache hit rate
  19. “HOW WOULD YOU LET EDITORS TEST HOW WELL DIFFERENT HEADLINES

    PERFORM FOR THE SAME PIECE OF CONTENT?” Shashi Reddy, Senior Engineer, AOL
  20. THE PROOF OF CONCEPT COLLECTOR COLLECTOR COLLECTOR STREAMER COLLECTOR COLLECTOR

    COLLECTOR RECEIVER COLLECTOR COLLECTOR COLLECTOR STATSD CLUSTER ELASTICSEARCH
  21. TINY, ENCAPSULATED NANOSERVICES this.visit = function(record) { if (record.userAgent) {

    var parser = new UAParser(); parser.setUA(record.userAgent); var user_agent = parser.getResult(); return { user_agent: user_agent } } return {}; };
  22. CASE STUDY PROOF OF CONCEPT PERFORMANCE & RESULTS ▸ Message

    Rate: 300 per second ▸ Receivers Needed: ~70+ ▸ StatsD imposes a number of limitations ▸ Breaks rich payloads down in to discrete metrics ▸ Anything but in-flight aggregation means querying Elasticsearch
  23. An efficient real-time data pathway consists of a network of

    transits and terminals, where no single node acts as both a transit and a terminal at the same time. CASE STUDY
  24. CASE STUDY TRANSITS ▸ Short-term ▸ In-memory ▸ Volatile storage

    ▸ Data with life-spans up to a few seconds
  25. TOOL EVALUATION - KAFKA VS STORM APACHE KAFKA ▸ Pub/Sub

    Message Broker ▸ Born @LinkedIn around 2011 ▸ Apache project since 2014 ▸ Key focuses ▸ Message integrity (persistence-first model) ▸ Message order ▸ Fault tolerance
  26. TOOL EVALUATION RABBITMQ ▸ AMQP implementation ▸ Born in 2007

    ▸ Acquired by Pivotal Software in 2013 ▸ Key focuses: ▸ General-purpose messaging ▸ Routing ▸ HA through Federation
  27. TOOL EVALUATION REQUIREMENTS & CONSIDERATIONS ▸ Payloads may arrive in

    any order. ▸ Some data loss is acceptable. ▸ Consumers may only want small subsets of data ▸ Need to route data to consumers in multiple datacentres / in AWS ▸ Broad support for languages
  28. TOOL EVALUATION TRANSIT: RABBITMQ ▸ RabbitMQ’s priorities are similar to

    ours ▸ Federation over at-least-once delivery ▸ Supports complex routing ▸ Allows federation over network boundaries (even when it’s dumb) ▸ Mature clients for our Big Three Stacks (Java, Node.js, PHP) ▸ Big enterprises like stuff with companies behind it
  29. CASE STUDY MORE THAN JUST RABBITMQ ▸ Moved away from

    Observer Pattern for data processing to a single in and a single out event. ▸ Node.js event handling is VERY fast, but the sheer number of events being created caused memory problems. ▸ Rather than tuning within the app or engine, let back pressure mechanism regulate input rate.
  30. while (buffer.length > 0) { var char = buffer.shift(); if

    ('\n' === char) { queue.push(new Buffer(outbuf.join(''))); continue; } outbuf.push(char); } var i = 0; var tBuf = buffer.slice(); while (i < buffer.length) { var char = tBuf[i++]; if ('\n' === char) { queue.push(new Buffer(outbuf.join(''))); } outbuf.push(char); }
  31. CASE STUDY VERSION 1 PERFORMANCE & RESULTS ▸ Message Rate:

    600/s ▸ Receivers Needed: ~35+ ▸ Adding code to handle weird edge cases in data degrades performance. ▸ Micro-optimization of code leads to hard-to-fix crashes and memory leaks.
  32. CASE STUDY GETTING SERIOUS ▸ Receiving data, editing it, and

    routing it in the same step violates my transit/ terminal separation policy. ▸ Receiver needs to be a simple transit that consumes and pushes data on to RabbitMQ ▸ Nice-to-haves: ▸ Static & dynamic optimization ▸ Clean multithreading/multiprocessing ▸ Good memory management for large, volatile in-memory data sets
  33. TOOL EVALUATION PICKING A STACK FOR THE DATA RECEIVER -

    THE PROS ▸ Node.js - Simple, easy-to-distribute, fast. ▸ Go - Native concurrency & memory management, fast compiler. ▸ Rust - C++ with modern tooling. ▸ Java - Static & dynamic optimization, good memory management & multi- threading. ▸ C/C++ - Speed, good libraries for handling concurrency & memory.
  34. TOOL EVALUATION PICKING A STACK FOR THE DATA RECEIVER -

    THE CONS ▸ Node.js - Too many instances needed to manage production flow. ▸ Go - No one on my team has any familiarity. ▸ Rust - No one on my team has any desire to have any familiarity. ▸ Java - All the cool kids will pick on me. ▸ C/C++ - I like myself too much.
  35. VERSION 2 (JAVA BOOGALOO) COLLECTOR COLLECTOR COLLECTOR STREAMER COLLECTOR COLLECTOR

    COLLECTOR RECEIVER ELASTICSEARCH RABBITMQ COLLECTOR COLLECTOR COLLECTOR PROCESSOR/ ROUTER
  36. public class StreamReader { private static final Logger logger =

    Logger.getLogger(StreamReader.class.getName()); private StreamerQueue queue = new StreamerQueue(); private StreamProcessor processor; private List<StreamReader.BeaconWorkerThread> workerThreads = new ArrayList(); private RtStreamerClient client; public StreamReader(String streamerURI, AmqpClient amqpClient, String appID, String tpcFltrs, String rfFltrs, String bt) { ArrayList queueList = new ArrayList(); this.processor = new StreamProcessor(amqpClient); byte numThreads = 8; for(int i = 0; i < numThreads; ++i) { StreamReader.BeaconWorkerThread worker = new StreamReader.BeaconWorkerThread(); this.workerThreads.add(worker); worker.start(); } queueList.add(this.queue); this.client = new RtStreamerClient(streamerURI, appID, tpcFltrs, rfFltrs, bt, queueList); } } CREATING MULTIPLE THREADS WITH STANDALONE CONNECTIONS TO RABBITMQ SIMPLE WRAPPER AROUND NATIVE JAVA LINE STREAMER
  37. public class StreamProcessor { private static final Logger logger =

    Logger.getLogger(StreamProcessor.class.getName()); private AmqpClient amqpClient; public StreamProcessor(AmqpClient amqpClient) { this.amqpClient = amqpClient; } public void send(String data) throws Exception { this.amqpClient.send(data.getBytes()); logger.debug("Sent event " + data + " to AMQP"); } } SIMPLE PASS-THRU
  38. QUEUE QUEUE QUEUE QUEUE QUEUE QUEUE QUEUE QUEUE QUEUE QUEUE

    QUEUE NETWORK INPUT NETWORK OUTPUT Linked List Queues
  39. CASE STUDY VERSION 2 PERFORMANCE & RESULTS ▸ Message Rate:

    2600/s ▸ Receivers Needed: ~10 ▸ Validity filtering is almost free in the Java receiver (can’t parse as JSON, drop it) ▸ Processor / Router Service selects only the messages it wants. Everything else is left for another service to collect, or to be dropped on the floor.
  40. REAL-TIME ANALYTICS SERVICE GOALS ▸ Provide (near) real-time statistics, metrics,

    and analytics for editorial staff ▸ Allow statistical evaluation of arbitrary variables ▸ Provide a simple interface for developers working in the publishing stack (PHP)
  41. REAL-TIME ANALYTICS SERVICE WHAT IS ELASTIC SEARCH ▸ A full-text

    search database ▸ A high performance NOSQL document store that features ▸ High-availability via clustering ▸ Rack/Datacentre-aware sharding ▸ Expressive & dynamic query DSL ▸ Some powerful full-text search, I guess, whatever?
  42. AOL US East Datacentre AOL France Datacentre AWS us-east-1 Region

    AOL US West Datacentre ELASTICSEARCH MASTER ELASTICSEARCH NODE ELASTICSEARCH NODE ELASTICSEARCH NODE
  43. { "query": { "filtered": { "query": { "multi_match": { "query":

    "miley cyrus", "fields": [ "byline", "title", "contents" ], "type": "cross_fields" } }, "filter": { "terms": { "site_id": [ 698 ] } } } }, "size": 25 }
  44. { "size": 0, "query": { "filtered": { "query": { "terms":

    { "content.source.cms.post_id": [ 12347, 22314, 242123, 342414 ] } }, "filter": { "bool": { "must": [ { "term": { "click_type": "ping" } }, { "range": { "timestamp": { "gte": 1445854380000, "lte": 1445940780000 } } } ] } } } }, "aggregations": { "post_id": { "terms": { "field": "content.source.cms.post_id", "size": 4, "order": { "_count": "desc" } }, "aggregations": { "search_terms": { "terms": { "field": "referer.search_term.raw" } }, "source": { "terms": { "field": "referer.medium" }, "aggregations": { "referer": { "terms": { "field": "referer.referer" }, "aggregations": { "search_terms": { "terms": { "field": "referrer.search_term.raw" } } } } } } } } } }
  45. var elasticsearch = require('elasticsearch'); var client = new elasticsearch.Client({hosts: ['http://localhost:9200']});

    var buffer = []; for (var document in documents) { buffer.push({ index: { _index: "some_index", _type: "some_type" }}); buffer.push(document); } client.bulk({body: buffer});
  46. <?php $params = []; $params['type'] = 'stat'; $params['index'] = isset($args->search_index)

    ? $args->search_index : $elasticSearch->getDatedIndexList($start, $end); $params['ignore_unavailable'] = true; $params['body'] = $this->getQuery($args->post_ids, $start, $end); $results = $client->search($params);
  47. CASE STUDY MULTIVARIATE TESTING - REQUIREMENTS ▸ Allow editors to

    test the performance of any discrete content element ▸ Content elements being: headlines, deks, ledes, subledes, hero images, river images, etc. ▸ Editors should be able to create, start, stop, and evaluate tests without spending developer time.
  48. CASE STUDY MULTIVARIATE TESTING - IMPLEMENTATION ▸ Assign new visitors

    to a test group via cookie ▸ Inject test markers in to beacon payload ▸ Compare CTR for PVs with test markers to calculate performance
  49. { "mv_stats": { "type": "nested", "include_in_parent": true, "properties": { "hash":

    { "type": "string" }, "test_id": { "type": "integer" } } } } TEST POPULATION IDENTIFIER TEST ID
  50. { "size": 0, "query": { "filtered": { "query": { "terms":

    { "mv_stats.test_id": [ 42 ] } }, "filter": { "bool": { “must": [ { "term": { "click_type": "ping" } }, { "range": { "timestamp": { "gte": 1445854380000, "lte": 1445940780000 } } } ] } } } }, "aggs": { "event_type": { "terms": { "field": "click_type" }, "aggs": { "multivariate": { "nested": { "path": "mv_stats" }, "aggs": { "test_ids": { "terms": { "field": "mv_stats.test_id" }, "aggs": { "hashes": { "terms": { "field": "mv_stats.hash" }, "aggs": { "event_types": { "terms": { "field": "click_type" } } } } } } } } } } } } REGULAR PAGEVIEW AGGREGATIONS NEST AND TAKE A CONTEXT OF THE PARENT
  51. <?php $results = $this->analytics()->multivariate()->get([ 'test_id' => $id, 'event_type' => 'all',

    'start' => $test['started'] ])->data(); if (!empty($results['hashes'])) { foreach (array_keys($test['items']) as $hash) { $clicks = 0; if (!empty($results['hashes'][$hash]['clicks'])) { $clicks = $results['hashes'][$hash]['clicks']; } $pings = 0; if (!empty($results['hashes'][$hash]['pings'])) { $pings = $results['hashes'][$hash]['pings']; } $test['items'][$hash]['clicks'] = $clicks; $test['items'][$hash]['pings'] = $pings; $test['items'][$hash]['percent'] = ($clicks / $pings) * 100; } }
  52. function plot(point) { var points = svg.selectAll("circle") .data([point], function(d) {

    return d.id; }); points.enter() .append("circle") .attr("cx", function (d) { return projection([parseInt(d.location.geopoint.lon), parseInt(d.location.geopoint.lat)])[0] }) .attr("cy", function (d) { return projection([parseInt(d.location.geopoint.lon), parseInt(d.location.geopoint.lat)])[1] }) .attr("r", function (d) { return 1; }) .style('fill', 'red') .style('fill-opacity', 1) .style('stroke', 'red') .style('stroke-width', '0.5px') .style('stroke-opacity', 1) .transition() .duration(10000) .style('fill-opacity', 0) .style('stroke-opacity', 0) .attr('r', '32px').remove(); } var buffer = []; var socket = io(); socket.on('geopoint', function(point) { if (point.location.geopoint) { plot(point); } });
  53. var views = 0; var socket = io(); socket.on('pageview', function(point)

    { views++; }); function tick() { data.push(views); views = 0; path .attr("d", line) .attr("transform", null) .transition() .duration(500) .ease("linear") .attr("transform", "translate(" + x(0) + ",0)") .each("end", tick); data.shift(); } tick();
  54. LIVE PROFILING DEVELOPING ON THE AOL MEDIA PLATFORM ▸ Use

    our API and build what you like on servers you manage. ▸ Use our managed hosting platform which handles scaling, caching, etc. ▸ But… requires you to work in a custom DSL
  55. HOLY CRAP THE GUY WHO BUILT ALL OF THIS IS

    A GENIUS DEVELOPING FOR THE AOL MEDIA PLATFORM ▸ Create a repository in your source control system of choice ▸ Write code in our twig-based language (CodeBlocks) ▸ Code on your local machine is synced to a live sandbox with access to test data and resources that mirror production ▸ Promote sandboxes to live production ▸ This was seriously all built by a guy named Ralph.
  56. {% set posts = api.posts.query({ page: req.params.page|default(1), limit: 3, categories:

    req.params.category ? [{parent:req.params.category}, req.params.category] : null, categories_match: 'any', tags: req.params.tag ? [req.params.tag] : null }) %}
  57. DEV STARTS A PROFILER SESSION DEV VISITS PRODUCTION SITE WITH

    QUERY PARAM RENDER SERVER ACTIVATES PROFILING EVENT MESSAGES ARE TAGGED WITH SESSION ID RABBITMQ ROUTES TAGGED MESSAGES TO PROFILER SERVICE DEV’S PROFILER CONSOLE CONNECTS TO PROFILER SERVICE PROFILER SERVICE WAITS FOR MESSAGES MESSAGES ARE RECEIVED AND RENDERED IN THE CONSOLE
  58. CROSS-PLATFORM EVENTING A LITTLE SOMETHING FOR THAT NICE ENGINEER OVER

    THERE ▸ Allow devs to dispatch “native” events in one stack and observe them in another ▸ The PHP CMS uses the Symfony EventDispatcher to trigger an event in Node.js ▸ Distributed event handling without PHP workers ▸ Event-driven search indexing (no rivers or crons)
  59. <?php public function dispatch($event_name, Event $event = null) { $dispatchedEvent

    = parent::dispatch($event_name, $event); if ($dispatchedEvent instanceof ForwardableEvent) { $data = $dispatchedEvent->getEventData(); try { $this->amqp->publish( self::AMQP_CONNECTION, self::AMQP_EXCHANGE, self::AMQP_ROUTING_KEY, json_encode(['name' => $event_name, 'data' => $data]) ); } catch (\Exception $exc) { $this->logger->error( self::class . ' failed to publish event to AMQP.’, [ 'exception' => $exc ] ); } } return $dispatchedEvent; } OVERRIDING THE DEFAULT BEHAVIOR OF THE PHP EVENT DISPATCHER DEVS MARK EVENTS AS ‘FORWARDABLE’ BY IMPLEMENTING AN INTERFACE EVENTS ARE FORWARDED ON TO AN AMQP EXCHANGE
  60. module.exports = { register: function (config) { client = new

    es.Client({ hosts: config.hosts, log: Logger }); logger.info('AMP Elasticsearch Indexer module loaded!'); }, listeners: { 'amp.post.save': function (event, callback) { var index = 'posts'; var type = 'post'; var id = event['id']; if (!id) { return callback('Invalid post object received'); } indexRecord(index, type, id, event, callback); } } }; JS FUNCTION EXECUTED WHEN PHP DISPATCHES EVENTS
  61. WRAPPING IT UP AOL’S DATA PIPELINE - BY THE NUMBERS

    ▸ 1.3 billion events per day ▸ Routed by RabbitMQ to microservice consumers ▸ Driving real-time analytics over 250 GB of raw data per day ▸ Visualizing 1.3 million events per day ▸ Generating live profiles for developers of ~50 properties ▸ Handling 10,000 Elasticsearch search index updates per day
  62. WRAPPING IT UP AOL’S DATA PIPELINE - STACKS AND TECH

    ▸ Programming Languages: Java, Node.js, PHP, Python (HA load-balancing and routing) ▸ Hadoop, RabbitMQ, Elasticsearch, Vertica
  63. WRAPPING IT UP AOL’S DATA PIPELINE - 2016 & BEYOND

    ▸ Embeddable visualizations ▸ On-demand stream filters with Redis time-series bucketing ▸ Real-time predictive performance analysis ▸ Real-time social sentiment analysis ▸ Moving all of this infrastructure to AWS (oy!) ▸ Integrating Apache Spark
  64. PIPELINE MAP AOL DATA LAYER RABBITMQ REAL-TIME ANALYTICS SERVICE VISUALIZATIONS

    SERVICE AOL MEDIA PLATFORM PROFILER SERVICE CROSS-PLATFORM EVENT PROPAGATION SERVICE RELEGENCE