SAMANTHA QUIÑONES ABOUT ME ▸ Software Engineer since 1997 ▸ Doing “media stuff” since 2012 ▸ Principal @ AOL since 2014 ▸ @ieatkillerbees ▸ http://samanthaquinones.com
IMAGE CREDITS ▸ Clock on the roof of Our Lady of Dormition Melkite Greek Catholic Patriarchal Cathedral, Damascus, Syria ▸ https://commons.wikimedia.org/wiki/Category:Church_clocks#/media/ File:Clock_of_the_Melkite_Greek_Catholic_Church,_Damascus.jpg ▸ Bernard Gagnon ▸ CC BY-SA 3.0 ▸ The SJ train 58/637 with an Rc locomotive from Stockholm passes by Etterstad on its way on Hovedbanen to Oslo Central Station, about eight minutes late. The Loenga–Alnabru freight line is seen to the right. ▸ https://upload.wikimedia.org/wikipedia/commons/d/d7/Swedish_train_in_Norway.jpg ▸ Peter Krefting ▸ (CC BY-SA 2.0)
MEASURE ONCE CUT TWICE TRADITIONAL METRICS ▸ Request response time - are we responding fast enough? ▸ Cache hit rate - are we making our backend work too hard? ▸ Resource utilization - do we have enough “hardware”?
Delay Perception <100ms Instantaneous <300ms Perceptible Delay <1000ms System “Working” <10000ms System “Slow” >10000ms System “Down” Source: O’Reilly Media
AMAZON SALES DATA EFFECTS OF LATENCY Amazon Sales: -1% sales per 100ms increased latency Sales (USD) 0 500 1000 1500 2000 Seconds of Latency 0 1 2 3 5 6 7 8 9 10 11 12 13 14 15 S = S - (msL*1%) Linden, G. (2006, December 3). Make Data Useful. Data Mining (CS345). Lecture conducted from Stanford University, Stanford, CA.
GOOGLE SEARCH EXPERIMENT EFFECTS OF LATENCY Google Search Experiment (4-6 Weeks) % Fewer Searches per Day -1 -0.75 -0.5 -0.25 0 Milliseconds of Additional Latency 50 100 200 400 Schurman, E., Brutlag, J. (2009, June 23). The User and Business Impact of Server Delays, Additional Bytes, and HTTP Chunking in Web Search. O'Reilly Velocity. Lecture conducted from O'Reilly Media, San Jose, CA.
BEHAVIORAL METRICS MULTIVARIATE (A/B) TESTING ▸ Sort all users in to groups ▸ 1 control group receives unaltered content ▸ 1 or more groups receive altered content ▸ Measure behavioral statistics (CTR, abandon rate, time on page, scroll depth) for each group
BEHAVIORAL METRICS MEASURING USER BEHAVIOR & EXPERIENCE ▸ Application path - What does the user click on? ▸ Usage patterns - When does the user visit? Where do they come from? ▸ Mouse & attention tracking - What draws the user’s attention? ▸ RUM
CASE STUDY AOL MEDIA PLATFORM ▸ Content management ▸ Distributed rendering farm ▸ Integrated development environment using custom DSL ▸ Content aggregation platform ▸ Machine learning platform ▸ Multi-tenant system
“BEACON” SERVER “BEACON” SERVER “BEACON” SERVER “BEACON” SERVER “BEACON” SERVER RABBITMQ FARM DATA LAYER SERVICES cassandra hadoop RABBITMQ STREAMER FARM DATA LAYER STREAMER DATA LAYER STREAMER DATA LAYER STREAMER
CASE STUDY CONTENT CREATORS WANT TO KNOW ▸ Today’s traffic by author & vertical ▸ Top performing articles for the past hour ▸ Recent social engagement trends
CASE STUDY PROOF OF CONCEPT PERFORMANCE & RESULTS ▸ Message Rate: 300 per second ▸ Receivers Needed: ~70+ ▸ StatsD imposes a number of limitations ▸ Breaks rich payloads down in to discrete metrics ▸ Anything but in-flight aggregation means querying Elasticsearch
An efficient real-time data pathway consists of a network of transits and terminals, where no single node acts as both a transit and a terminal at the same time. CASE STUDY
TOOL EVALUATION RABBITMQ ▸ AMQP implementation ▸ Born in 2007 ▸ Acquired by Pivotal Software in 2013 ▸ Key focuses: ▸ General-purpose messaging ▸ Routing ▸ HA through Federation
TOOL EVALUATION REQUIREMENTS & CONSIDERATIONS ▸ Payloads may arrive in any order. ▸ Some data loss is acceptable. ▸ Consumers may only want small subsets of data ▸ Need to route data to consumers in multiple datacentres / in AWS ▸ Broad support for languages
TOOL EVALUATION TRANSIT: RABBITMQ ▸ RabbitMQ’s priorities are similar to ours ▸ Federation over at-least-once delivery ▸ Supports complex routing ▸ Allows federation over network boundaries (even when it’s dumb) ▸ Mature clients for our Big Three Stacks (Java, Node.js, PHP) ▸ Big enterprises like stuff with companies behind it
CASE STUDY MORE THAN JUST RABBITMQ ▸ Moved away from Observer Pattern for data processing to a single in and a single out event. ▸ Node.js event handling is VERY fast, but the sheer number of events being created caused memory problems. ▸ Rather than tuning within the app or engine, let back pressure mechanism regulate input rate.
while (buffer.length > 0) { var char = buffer.shift(); if ('\n' === char) { queue.push(new Buffer(outbuf.join(''))); continue; } outbuf.push(char); } var i = 0; var tBuf = buffer.slice(); while (i < buffer.length) { var char = tBuf[i++]; if ('\n' === char) { queue.push(new Buffer(outbuf.join(''))); } outbuf.push(char); }
CASE STUDY VERSION 1 PERFORMANCE & RESULTS ▸ Message Rate: 600/s ▸ Receivers Needed: ~35+ ▸ Adding code to handle weird edge cases in data degrades performance. ▸ Micro-optimization of code leads to hard-to-fix crashes and memory leaks.
CASE STUDY GETTING SERIOUS ▸ Receiving data, editing it, and routing it in the same step violates my transit/ terminal separation policy. ▸ Receiver needs to be a simple transit that consumes and pushes data on to RabbitMQ ▸ Nice-to-haves: ▸ Static & dynamic optimization ▸ Clean multithreading/multiprocessing ▸ Good memory management for large, volatile in-memory data sets
TOOL EVALUATION PICKING A STACK FOR THE DATA RECEIVER - THE PROS ▸ Node.js - Simple, easy-to-distribute, fast. ▸ Go - Native concurrency & memory management, fast compiler. ▸ Rust - C++ with modern tooling. ▸ Java - Static & dynamic optimization, good memory management & multi- threading. ▸ C/C++ - Speed, good libraries for handling concurrency & memory.
TOOL EVALUATION PICKING A STACK FOR THE DATA RECEIVER - THE CONS ▸ Node.js - Too many instances needed to manage production flow. ▸ Go - No one on my team has any familiarity. ▸ Rust - No one on my team has any desire to have any familiarity. ▸ Java - All the cool kids will pick on me. ▸ C/C++ - I like myself too much.
CASE STUDY VERSION 2 PERFORMANCE & RESULTS ▸ Message Rate: 2600/s ▸ Receivers Needed: ~10 ▸ Validity filtering is almost free in the Java receiver (can’t parse as JSON, drop it) ▸ Processor / Router Service selects only the messages it wants. Everything else is left for another service to collect, or to be dropped on the floor.
REAL-TIME ANALYTICS SERVICE GOALS ▸ Provide (near) real-time statistics, metrics, and analytics for editorial staff ▸ Allow statistical evaluation of arbitrary variables ▸ Provide a simple interface for developers working in the publishing stack (PHP)
REAL-TIME ANALYTICS SERVICE WHAT IS ELASTIC SEARCH ▸ A full-text search database ▸ A high performance NOSQL document store that features ▸ High-availability via clustering ▸ Rack/Datacentre-aware sharding ▸ Expressive & dynamic query DSL ▸ Some powerful full-text search, I guess, whatever?
AOL US East Datacentre AOL France Datacentre AWS us-east-1 Region AOL US West Datacentre ELASTICSEARCH MASTER ELASTICSEARCH NODE ELASTICSEARCH NODE ELASTICSEARCH NODE
CASE STUDY MULTIVARIATE TESTING - REQUIREMENTS ▸ Allow editors to test the performance of any discrete content element ▸ Content elements being: headlines, deks, ledes, subledes, hero images, river images, etc. ▸ Editors should be able to create, start, stop, and evaluate tests without spending developer time.
CASE STUDY MULTIVARIATE TESTING - IMPLEMENTATION ▸ Assign new visitors to a test group via cookie ▸ Inject test markers in to beacon payload ▸ Compare CTR for PVs with test markers to calculate performance
LIVE PROFILING DEVELOPING ON THE AOL MEDIA PLATFORM ▸ Use our API and build what you like on servers you manage. ▸ Use our managed hosting platform which handles scaling, caching, etc. ▸ But… requires you to work in a custom DSL
HOLY CRAP THE GUY WHO BUILT ALL OF THIS IS A GENIUS DEVELOPING FOR THE AOL MEDIA PLATFORM ▸ Create a repository in your source control system of choice ▸ Write code in our twig-based language (CodeBlocks) ▸ Code on your local machine is synced to a live sandbox with access to test data and resources that mirror production ▸ Promote sandboxes to live production ▸ This was seriously all built by a guy named Ralph.
DEV STARTS A PROFILER SESSION DEV VISITS PRODUCTION SITE WITH QUERY PARAM RENDER SERVER ACTIVATES PROFILING EVENT MESSAGES ARE TAGGED WITH SESSION ID RABBITMQ ROUTES TAGGED MESSAGES TO PROFILER SERVICE DEV’S PROFILER CONSOLE CONNECTS TO PROFILER SERVICE PROFILER SERVICE WAITS FOR MESSAGES MESSAGES ARE RECEIVED AND RENDERED IN THE CONSOLE
CROSS-PLATFORM EVENTING A LITTLE SOMETHING FOR THAT NICE ENGINEER OVER THERE ▸ Allow devs to dispatch “native” events in one stack and observe them in another ▸ The PHP CMS uses the Symfony EventDispatcher to trigger an event in Node.js ▸ Distributed event handling without PHP workers ▸ Event-driven search indexing (no rivers or crons)
module.exports = { register: function (config) { client = new es.Client({ hosts: config.hosts, log: Logger }); logger.info('AMP Elasticsearch Indexer module loaded!'); }, listeners: { 'amp.post.save': function (event, callback) { var index = 'posts'; var type = 'post'; var id = event['id']; if (!id) { return callback('Invalid post object received'); } indexRecord(index, type, id, event, callback); } } }; JS FUNCTION EXECUTED WHEN PHP DISPATCHES EVENTS
WRAPPING IT UP AOL’S DATA PIPELINE - BY THE NUMBERS ▸ 1.3 billion events per day ▸ Routed by RabbitMQ to microservice consumers ▸ Driving real-time analytics over 250 GB of raw data per day ▸ Visualizing 1.3 million events per day ▸ Generating live profiles for developers of ~50 properties ▸ Handling 10,000 Elasticsearch search index updates per day
WRAPPING IT UP AOL’S DATA PIPELINE - 2016 & BEYOND ▸ Embeddable visualizations ▸ On-demand stream filters with Redis time-series bucketing ▸ Real-time predictive performance analysis ▸ Real-time social sentiment analysis ▸ Moving all of this infrastructure to AWS (oy!) ▸ Integrating Apache Spark
PIPELINE MAP AOL DATA LAYER RABBITMQ REAL-TIME ANALYTICS SERVICE VISUALIZATIONS SERVICE AOL MEDIA PLATFORM PROFILER SERVICE CROSS-PLATFORM EVENT PROPAGATION SERVICE RELEGENCE