Drinking from the Firehose

Drinking from the Firehose Real-Time Metrics Samantha Quiñones

@ieatkillerbees http://samanthaquinones.com

“How would you let editors test how well different headlines
perform for the same piece of content?”

Areas of Interest • Quantified Application Performance • Perceived Application
Performance (User Experience) • User Behavior

Quantified Application Performance

• Are requests being handled efficiently? • Are adequate resources
available at each layer of the stack? • Is cache being utilized in an efficient manner?

CPU Time Per Request

Stack Performance

Perceived User Performance

• How long does it take for a page to
be “ready” for the user? • Is the page responsive to user input? • Does the page require an excessive number of requests to complete?

Delay Perception <100ms Instantaneous <300ms Perceptible Delay <1000ms System “Working”
<10000ms System “Slow” >10000ms System “Down” Source: O’Reilly Media

Effects of Latency Amazon Sales: -1% sales per 100ms increased
latency Sales (USD) 0 500 1000 1500 2000 Seconds of Latency 0 1 2 3 5 6 7 8 9 10 11 12 13 14 15 S = S - (msL*1%) Linden, G. (2006, December 3). Make Data Useful. Data Mining (CS345). Lecture conducted from Stanford University, Stanford, CA.

Effects of Latency Google Search Experiment (4-6 Weeks) % Fewer
Searches per Day -1 -0.75 -0.5 -0.25 0 Milliseconds of Additional Latency 50 100 200 400 Schurman, E., Brutlag, J. (2009, June 23). The User and Business Impact of Server Delays, Additional Bytes, and HTTP Chunking in Web Search. O'Reilly Velocity. Lecture conducted from O'Reilly Media, San Jose, CA.

Interpreting Web Profiler Information

User Behavior

Measuring User Behavior • Application path • Use patterns •
Mouse & attention tracking

Multivariate Testing • Sort all users in to groups •
1 control group receives unaltered content • 1 or more groups receive altered content • Measure behavioral statistics (CTR, abandon rate, time on page, scroll depth) for each group

Managing Big Data

How big is big?

1,300,000,000,000 events per DAY

~40 datapoints per EVENT

~15,000 records per SECOND

~600,000 datapoints Containing

25 megabytes / second At a rate up to

Collector Collector Collector Collector Collector Collector Collector Collector Collector Collector
Collector Collector Rabbit MQ Farm

Rabbit MQ Farm Hadoop

Hadoop • Framework for distributed storage and processing of data
• Designed to make managing very large datasets simple with… • Well-documented, open-source, common libraries • Optimizing for commodity hardware

Hadoop Distributed File System • Modeled after Google File System
• Stores logical files across multiple systems • Rack-aware • No read-write concurrency

MapReduce • Framework for massively parallel data processing tasks

Map <?php $document = "I'm a little teapot short and
stout here is my handle here is my spout"; /** * Outputs: [0,0,0,0,0,0,0,0,1,0,0,0,1,0,0] */ function map($target_word, $document) { return array_map( function ($word) use ($target_word) { if ($word === $target_word) { return 1; } return 0; }, preg_split('/\s+/', $document) ); } echo json_encode(map("is", $document)) . PHP_EOL;

Reduce <?php $data = [0,0,0,0,0,0,0,0,1,0,0,0,1,0,0]; /** * Outputs: 2 */
function reduce($data) { return array_reduce( $data, function ($count, $value) { return $count + $value; } ); } echo reduce($data) . PHP_EOL;

Hadoop Limitations • Hadoop jobs are batched and take significant
time to run • Data may not be available for 1+ hours after collection

“How would you let editors test how well different headlines
perform for the same piece of content?”

Consider Shelf-life • Most articles are relevant for < 24
hours • Interest peaks < 3 hours

Real-Time Pipelines

Collector Collector Collector Collector Collector Collector Collector Collector Collector Collector
Collector Collector Rabbit MQ Farm Collector Collector Collector Streamer Collector Collector Collector Streamer Collector Collector Collector Streamer

Version 1 (PoC) Collector Collector Collector Streamer Collector Collector Collector
Receiver Collector Collector Collector StatsD Cluster ElasticSearch

this.visit = function(record) { if (record.userAgent) { var parser =
new UAParser(); parser.setUA(record.userAgent); var user_agent = parser.getResult(); return { user_agent: user_agent } } return {}; };

Findings • Max throughput per collector: 300 events/second • ~70
receivers needed for prod • StatsD key format creates data redundancy and reduced data richness

Version 1 (PoC) Collector Collector Collector Streamer Collector Collector Collector
Receiver Collector Collector Collector StatsD Cluster ElasticSearch

Transits & Terminals • Transits - Short-term, in-memory, volatile storage
for data with a life-span up to a few seconds • Terminals - Destinations for data that either store, abandon, or transmit

An efficient real-time data pathway consists of a network of
transits and terminals, where no single node acts as both a transit and a terminal at the same time.

StatsD • Acts as a transit, taking data and passing
it along… • BUT • Acts as a terminal, aggregating keys in memory and becoming a transit after a time or buffer threshold.

Version 2 Collector Collector Collector Streamer Collector Collector Collector Receiver
ElasticSearch RabbitMQ

Version 2 • Eliminated eventing and improved performance • Replaced
StatsD with RabbitMQ • Data records are kept together • No longer works with Kibana (sadface)

RabbitMQ • Lightweight message broker • Allows complex message routing
without application-level logic • Can buffer 90-120 seconds of traffic

while (buffer.length > 0) { var char = buffer.shift(); if
('\n' === char) { queue.push(new Buffer(outbuf.join(''))); continue; } outbuf.push(char); } var i = 0; var tBuf = buffer.slice(); while (i < buffer.length) { var char = tBuf[i++]; if ('\n' === char) { queue.push(new Buffer(outbuf.join(''))); } outbuf.push(char); }

receivers needed for prod • Micro-optimized code became increasingly brittle and hard to maintain as custom logic was needed for every edge case

ElasticSearch RabbitMQ

Need to Get Serious • Very high throughput • Multi-threaded
worker pool with large memory buffers • Static & dynamic optimization • Efficient memory management for extremely volatile in-memory data • Eliminate any processing overhead. Receiver must be a Transit

And also… • Not GoLang (because no one on the
team is familiar with it) • Not Rust (because no one on the team wants to be familiar with it) • Not C (because C)

mfw java :(

Why Java? • Solid static & dynamic analysis and optimizations
in the S2BC & JIT compilers • Clients for the stuff I needed to talk to • Well-supported within AOL & within my team

ElasticSearch RabbitMQ Collector Collector Collector Processor/ Router

public class StreamReader { private static final Logger logger =
Logger.getLogger(StreamReader.class.getName()); private StreamerQueue queue = new StreamerQueue(); private StreamProcessor processor; private List<StreamReader.BeaconWorkerThread> workerThreads = new ArrayList(); private RtStreamerClient client; public StreamReader(String streamerURI, AmqpClient amqpClient, String appID, String tpcFltrs, String rfFltrs, String bt) { ArrayList queueList = new ArrayList(); this.processor = new StreamProcessor(amqpClient); byte numThreads = 8; for(int i = 0; i < numThreads; ++i) { StreamReader.BeaconWorkerThread worker = new StreamReader.BeaconWorkerThread(); this.workerThreads.add(worker); worker.start(); } queueList.add(this.queue); this.client = new RtStreamerClient(streamerURI, appID, tpcFltrs, rfFltrs, bt, queueList); } }

public class StreamProcessor { private static final Logger logger =
Logger.getLogger(StreamProcessor.class.getName()); private AmqpClient amqpClient; public StreamProcessor(AmqpClient amqpClient) { this.amqpClient = amqpClient; } public void send(String data) throws Exception { this.amqpClient.send(data.getBytes()); logger.debug("Sent event " + data + " to AMQP"); } }

Queue Queue Queue Queue Queue Queue Queue Queue Queue Queue
Queue Network Input Network Output Linked List Queues

receivers needed for prod

Why ElasticSearch • Open-source Lucene search engine • Highly-distributed storage
engine • Clusters nicely • Built-in aggregations like whoa

Aggregations • Geographic Boxing & Radius Grouping • Time-Series •
Histograms • Min/Max/Avg Statistical Evaluation • MapReduce (coming soon!)

• How many users viewed my post on an android
tablet in portrait mode within 10 miles of Denton, TX? • What is the average time from start of page-load to first click for readers on linux desktops between 3am and 5am? • Given two sets of link texts, which has the higher CTR for a randomized sample of readers on tablet devices?

Browser to Browser in < 5 seconds

But wait… Is that “real-time”?

Real-Time for Real • Live analysis of data as it
is collected • Active visualization of very short-term trends in data

Potential Problems • Small sample sizes for new datasets /
small analysis windows • Data volumes too high for end-user comprehension • Data volumes too high for end-user hardware/network connections

ElasticSearch RabbitMQ Collector Collector Collector Processor/ Router Websocket Server

D3JS • Open-source data visualization library written in JavaScript

By the way… xn = x + (r * COS(2π
* n / v)) yn = y + (r * COS(2π * n / v)) where n = ordinal of vertex and where v = number of vertices and x,y = center of the polygon

var views = 0; var socket = io(); socket.on('pageview', function(point)
{ views++; }); function tick() { data.push(views); views = 0; path .attr("d", line) .attr("transform", null) .transition() .duration(500) .ease("linear") .attr("transform", "translate(" + x(0) + ",0)") .each("end", tick); data.shift(); } tick();

Pageview Heartbeat

Receiver Layer Receiver Buffer/Transit Processing & Routing Layer Processing &
Routing Transit Storage Engine End-User Consumable Queues Layers are • Geographically decoupled • Capable of independent scaling • Fully encapsulated with no cross- layer dependencies

Interfaces Input Stream (Java) Routing (node.js) Filtering (node.js) Aggregation (PHP)
Visualization (D3JS) MV Testing (PHP)

Languages & Tools Rabbit MQ Hadoop Elastic Search PHP JS
(node) JS (D3) Java MySQL

Where are we Now? • It took 8 months to
build a rock-solid data pipeline • Entry points from: • User data collectors • Application code

That was the easy part.

What’s next?

• Live debugging & runtime profiling • Embeddable visualizations •
On-demand stream filters

@ieatkillerbees http://samanthaquinones.com

Drinking from the Firehose

Drinking from the Firehose

More Decks by Samantha Quiñones

Other Decks in Technology

Featured

Transcript