Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Drinking from the Firehose

Drinking from the Firehose

Samantha Quiñones

May 15, 2015
Tweet

More Decks by Samantha Quiñones

Other Decks in Technology

Transcript

  1. “How would you let editors test how well different headlines

    perform for the same piece of content?”
  2. • Are requests being handled efficiently? • Are adequate resources

    available at each layer of the stack? • Is cache being utilized in an efficient manner?
  3. • How long does it take for a page to

    be “ready” for the user? • Is the page responsive to user input? • Does the page require an excessive number of requests to complete?
  4. Delay Perception <100ms Instantaneous <300ms Perceptible Delay <1000ms System “Working”

    <10000ms System “Slow” >10000ms System “Down” Source: O’Reilly Media
  5. Effects of Latency Amazon Sales: -1% sales per 100ms increased

    latency Sales (USD) 0 500 1000 1500 2000 Seconds of Latency 0 1 2 3 5 6 7 8 9 10 11 12 13 14 15 S = S - (msL*1%) Linden, G. (2006, December 3). Make Data Useful. Data Mining (CS345). Lecture conducted from Stanford University, Stanford, CA.
  6. Effects of Latency Google Search Experiment (4-6 Weeks) % Fewer

    Searches per Day -1 -0.75 -0.5 -0.25 0 Milliseconds of Additional Latency 50 100 200 400 Schurman, E., Brutlag, J. (2009, June 23). The User and Business Impact of Server Delays, Additional Bytes, and HTTP Chunking in Web Search. O'Reilly Velocity. Lecture conducted from O'Reilly Media, San Jose, CA.
  7. Multivariate Testing • Sort all users in to groups •

    1 control group receives unaltered content • 1 or more groups receive altered content • Measure behavioral statistics (CTR, abandon rate, time on page, scroll depth) for each group
  8. Hadoop • Framework for distributed storage and processing of data

    • Designed to make managing very large datasets simple with… • Well-documented, open-source, common libraries • Optimizing for commodity hardware
  9. Hadoop Distributed File System • Modeled after Google File System

    • Stores logical files across multiple systems • Rack-aware • No read-write concurrency
  10. Map <?php $document = "I'm a little teapot short and

    stout here is my handle here is my spout"; /** * Outputs: [0,0,0,0,0,0,0,0,1,0,0,0,1,0,0] */ function map($target_word, $document) { return array_map( function ($word) use ($target_word) { if ($word === $target_word) { return 1; } return 0; }, preg_split('/\s+/', $document) ); } echo json_encode(map("is", $document)) . PHP_EOL;
  11. Reduce <?php $data = [0,0,0,0,0,0,0,0,1,0,0,0,1,0,0]; /** * Outputs: 2 */

    function reduce($data) { return array_reduce( $data, function ($count, $value) { return $count + $value; } ); } echo reduce($data) . PHP_EOL;
  12. Hadoop Limitations • Hadoop jobs are batched and take significant

    time to run • Data may not be available for 1+ hours after collection
  13. “How would you let editors test how well different headlines

    perform for the same piece of content?”
  14. Collector Collector Collector Collector Collector Collector Collector Collector Collector Collector

    Collector Collector Rabbit MQ Farm Collector Collector Collector Streamer Collector Collector Collector Streamer Collector Collector Collector Streamer
  15. Version 1 (PoC) Collector Collector Collector Streamer Collector Collector Collector

    Receiver Collector Collector Collector StatsD Cluster ElasticSearch
  16. this.visit = function(record) { if (record.userAgent) { var parser =

    new UAParser(); parser.setUA(record.userAgent); var user_agent = parser.getResult(); return { user_agent: user_agent } } return {}; };
  17. Findings • Max throughput per collector: 300 events/second • ~70

    receivers needed for prod • StatsD key format creates data redundancy and reduced data richness
  18. Version 1 (PoC) Collector Collector Collector Streamer Collector Collector Collector

    Receiver Collector Collector Collector StatsD Cluster ElasticSearch
  19. Transits & Terminals • Transits - Short-term, in-memory, volatile storage

    for data with a life-span up to a few seconds • Terminals - Destinations for data that either store, abandon, or transmit
  20. An efficient real-time data pathway consists of a network of

    transits and terminals, where no single node acts as both a transit and a terminal at the same time.
  21. StatsD • Acts as a transit, taking data and passing

    it along… • BUT • Acts as a terminal, aggregating keys in memory and becoming a transit after a time or buffer threshold.
  22. Version 2 • Eliminated eventing and improved performance • Replaced

    StatsD with RabbitMQ • Data records are kept together • No longer works with Kibana (sadface)
  23. RabbitMQ • Lightweight message broker • Allows complex message routing

    without application-level logic • Can buffer 90-120 seconds of traffic
  24. while (buffer.length > 0) { var char = buffer.shift(); if

    ('\n' === char) { queue.push(new Buffer(outbuf.join(''))); continue; } outbuf.push(char); } var i = 0; var tBuf = buffer.slice(); while (i < buffer.length) { var char = tBuf[i++]; if ('\n' === char) { queue.push(new Buffer(outbuf.join(''))); } outbuf.push(char); }
  25. Findings • Max throughput per collector: 600 events/second • ~35

    receivers needed for prod • Micro-optimized code became increasingly brittle and hard to maintain as custom logic was needed for every edge case
  26. Need to Get Serious • Very high throughput • Multi-threaded

    worker pool with large memory buffers • Static & dynamic optimization • Efficient memory management for extremely volatile in-memory data • Eliminate any processing overhead. Receiver must be a Transit
  27. And also… • Not GoLang (because no one on the

    team is familiar with it) • Not Rust (because no one on the team wants to be familiar with it) • Not C (because C)
  28. Why Java? • Solid static & dynamic analysis and optimizations

    in the S2BC & JIT compilers • Clients for the stuff I needed to talk to • Well-supported within AOL & within my team
  29. Version 3 Collector Collector Collector Streamer Collector Collector Collector Receiver

    ElasticSearch RabbitMQ Collector Collector Collector Processor/ Router
  30. public class StreamReader { private static final Logger logger =

    Logger.getLogger(StreamReader.class.getName()); private StreamerQueue queue = new StreamerQueue(); private StreamProcessor processor; private List<StreamReader.BeaconWorkerThread> workerThreads = new ArrayList(); private RtStreamerClient client; public StreamReader(String streamerURI, AmqpClient amqpClient, String appID, String tpcFltrs, String rfFltrs, String bt) { ArrayList queueList = new ArrayList(); this.processor = new StreamProcessor(amqpClient); byte numThreads = 8; for(int i = 0; i < numThreads; ++i) { StreamReader.BeaconWorkerThread worker = new StreamReader.BeaconWorkerThread(); this.workerThreads.add(worker); worker.start(); } queueList.add(this.queue); this.client = new RtStreamerClient(streamerURI, appID, tpcFltrs, rfFltrs, bt, queueList); } }
  31. public class StreamProcessor { private static final Logger logger =

    Logger.getLogger(StreamProcessor.class.getName()); private AmqpClient amqpClient; public StreamProcessor(AmqpClient amqpClient) { this.amqpClient = amqpClient; } public void send(String data) throws Exception { this.amqpClient.send(data.getBytes()); logger.debug("Sent event " + data + " to AMQP"); } }
  32. Queue Queue Queue Queue Queue Queue Queue Queue Queue Queue

    Queue Network Input Network Output Linked List Queues
  33. Why ElasticSearch • Open-source Lucene search engine • Highly-distributed storage

    engine • Clusters nicely • Built-in aggregations like whoa
  34. Aggregations • Geographic Boxing & Radius Grouping • Time-Series •

    Histograms • Min/Max/Avg Statistical Evaluation • MapReduce (coming soon!)
  35. • How many users viewed my post on an android

    tablet in portrait mode within 10 miles of Denton, TX? • What is the average time from start of page-load to first click for readers on linux desktops between 3am and 5am? • Given two sets of link texts, which has the higher CTR for a randomized sample of readers on tablet devices?
  36. Real-Time for Real • Live analysis of data as it

    is collected • Active visualization of very short-term trends in data
  37. Potential Problems • Small sample sizes for new datasets /

    small analysis windows • Data volumes too high for end-user comprehension • Data volumes too high for end-user hardware/network connections
  38. Version 4 Collector Collector Collector Streamer Collector Collector Collector Receiver

    ElasticSearch RabbitMQ Collector Collector Collector Processor/ Router Websocket Server
  39. function plot(point) { var points = svg.selectAll("circle") .data([point], function(d) {

    return d.id; }); points.enter() .append("circle") .attr("cx", function (d) { return projection([parseInt(d.location.geopoint.lon), parseInt(d.location.geopoint.lat)])[0] }) .attr("cy", function (d) { return projection([parseInt(d.location.geopoint.lon), parseInt(d.location.geopoint.lat)])[1] }) .attr("r", function (d) { return 1; }) .style('fill', 'red') .style('fill-opacity', 1) .style('stroke', 'red') .style('stroke-width', '0.5px') .style('stroke-opacity', 1) .transition() .duration(10000) .style('fill-opacity', 0) .style('stroke-opacity', 0) .attr('r', '32px').remove(); } var buffer = []; var socket = io(); socket.on('geopoint', function(point) { if (point.location.geopoint) { plot(point); } });
  40. By the way… xn = x + (r * COS(2π

    * n / v)) yn = y + (r * COS(2π * n / v)) where n = ordinal of vertex and where v = number of vertices and x,y = center of the polygon
  41. var views = 0; var socket = io(); socket.on('pageview', function(point)

    { views++; }); function tick() { data.push(views); views = 0; path .attr("d", line) .attr("transform", null) .transition() .duration(500) .ease("linear") .attr("transform", "translate(" + x(0) + ",0)") .each("end", tick); data.shift(); } tick();
  42. Receiver Layer Receiver Buffer/Transit Processing & Routing Layer Processing &

    Routing Transit Storage Engine End-User Consumable Queues Layers are • Geographically decoupled • Capable of independent scaling • Fully encapsulated with no cross- layer dependencies
  43. Where are we Now? • It took 8 months to

    build a rock-solid data pipeline • Entry points from: • User data collectors • Application code
  44. ???