Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Drinking from the Firehose

Drinking from the Firehose

Samantha Quiñones

May 15, 2015
Tweet

More Decks by Samantha Quiñones

Other Decks in Technology

Transcript

  1. Drinking from the Firehose
    Real-Time Metrics
    Samantha Quiñones

    View full-size slide

  2. @ieatkillerbees
    http://samanthaquinones.com

    View full-size slide

  3. “How would you let editors test how well
    different headlines perform for the same
    piece of content?”

    View full-size slide

  4. Areas of Interest
    • Quantified Application Performance
    • Perceived Application Performance (User Experience)
    • User Behavior

    View full-size slide

  5. Quantified Application
    Performance

    View full-size slide

  6. • Are requests being handled efficiently?
    • Are adequate resources available at each layer of the stack?
    • Is cache being utilized in an efficient manner?

    View full-size slide

  7. CPU Time Per Request

    View full-size slide

  8. Stack Performance

    View full-size slide

  9. Perceived User Performance

    View full-size slide

  10. • How long does it take for a page to be “ready” for the user?
    • Is the page responsive to user input?
    • Does the page require an excessive number of requests to complete?

    View full-size slide

  11. Delay Perception
    <100ms Instantaneous
    <300ms Perceptible Delay
    <1000ms System “Working”
    <10000ms System “Slow”
    >10000ms System “Down”
    Source: O’Reilly Media

    View full-size slide

  12. Effects of Latency
    Amazon Sales: -1% sales per 100ms increased latency
    Sales (USD)
    0
    500
    1000
    1500
    2000
    Seconds of Latency
    0 1 2 3 5 6 7 8 9 10 11 12 13 14 15
    S = S - (msL*1%)
    Linden, G. (2006, December 3). Make Data Useful. Data Mining (CS345). Lecture conducted from Stanford University, Stanford, CA.

    View full-size slide

  13. Effects of Latency
    Google Search Experiment (4-6 Weeks)
    % Fewer Searches per Day
    -1
    -0.75
    -0.5
    -0.25
    0
    Milliseconds of Additional Latency
    50 100 200 400
    Schurman, E., Brutlag, J. (2009, June 23). The User and Business Impact of Server Delays, Additional Bytes, and HTTP Chunking in Web Search. O'Reilly Velocity. Lecture conducted from O'Reilly Media, San Jose, CA.

    View full-size slide

  14. Interpreting Web
    Profiler Information

    View full-size slide

  15. User Behavior

    View full-size slide

  16. Measuring User Behavior
    • Application path
    • Use patterns
    • Mouse & attention tracking

    View full-size slide

  17. Multivariate Testing
    • Sort all users in to groups
    • 1 control group receives unaltered content
    • 1 or more groups receive altered content
    • Measure behavioral statistics (CTR, abandon rate, time on page, scroll
    depth) for each group

    View full-size slide

  18. Managing Big Data

    View full-size slide

  19. How big is big?

    View full-size slide

  20. 1,300,000,000,000
    events per
    DAY

    View full-size slide

  21. ~40 datapoints
    per
    EVENT

    View full-size slide

  22. ~15,000 records
    per
    SECOND

    View full-size slide

  23. ~600,000
    datapoints
    Containing

    View full-size slide

  24. 25 megabytes /
    second
    At a rate up to

    View full-size slide

  25. Collector
    Collector
    Collector
    Collector Collector
    Collector
    Collector
    Collector Collector
    Collector
    Collector
    Collector
    Rabbit MQ Farm

    View full-size slide

  26. Rabbit MQ Farm
    Hadoop

    View full-size slide

  27. Hadoop
    • Framework for distributed storage and processing of data
    • Designed to make managing very large datasets simple with…
    • Well-documented, open-source, common libraries
    • Optimizing for commodity hardware

    View full-size slide

  28. Hadoop Distributed File System
    • Modeled after Google File System
    • Stores logical files across multiple systems
    • Rack-aware
    • No read-write concurrency

    View full-size slide

  29. MapReduce
    • Framework for massively parallel data processing tasks

    View full-size slide

  30. Map
    $document = "I'm a little teapot short and stout here is my handle here is my spout";
    /**
    * Outputs: [0,0,0,0,0,0,0,0,1,0,0,0,1,0,0]
    */
    function map($target_word, $document) {
    return array_map(
    function ($word) use ($target_word) {
    if ($word === $target_word) {
    return 1;
    }
    return 0;
    },
    preg_split('/\s+/', $document)
    );
    }
    echo json_encode(map("is", $document)) . PHP_EOL;

    View full-size slide

  31. Reduce
    $data = [0,0,0,0,0,0,0,0,1,0,0,0,1,0,0];
    /**
    * Outputs: 2
    */
    function reduce($data) {
    return array_reduce(
    $data,
    function ($count, $value) {
    return $count + $value;
    }
    );
    }
    echo reduce($data) . PHP_EOL;

    View full-size slide

  32. Hadoop Limitations
    • Hadoop jobs are batched and take significant time to run
    • Data may not be available for 1+ hours after collection

    View full-size slide

  33. “How would you let editors test how well
    different headlines perform for the same
    piece of content?”

    View full-size slide

  34. Consider Shelf-life
    • Most articles are relevant for < 24 hours
    • Interest peaks < 3 hours

    View full-size slide

  35. Real-Time Pipelines

    View full-size slide

  36. Collector
    Collector
    Collector
    Collector
    Collector
    Collector
    Collector
    Collector
    Collector
    Collector
    Collector
    Collector
    Rabbit
    MQ
    Farm
    Collector
    Collector
    Collector
    Streamer
    Collector
    Collector
    Collector
    Streamer
    Collector
    Collector
    Collector
    Streamer

    View full-size slide

  37. Version 1 (PoC)
    Collector
    Collector
    Collector
    Streamer Collector
    Collector
    Collector
    Receiver Collector
    Collector
    Collector
    StatsD
    Cluster
    ElasticSearch

    View full-size slide

  38. this.visit = function(record) {
    if (record.userAgent) {
    var parser = new UAParser();
    parser.setUA(record.userAgent);
    var user_agent = parser.getResult();
    return { user_agent: user_agent }
    }
    return {};
    };

    View full-size slide

  39. Findings
    • Max throughput per collector: 300 events/second
    • ~70 receivers needed for prod
    • StatsD key format creates data redundancy and reduced data richness

    View full-size slide

  40. Version 1 (PoC)
    Collector
    Collector
    Collector
    Streamer Collector
    Collector
    Collector
    Receiver Collector
    Collector
    Collector
    StatsD
    Cluster
    ElasticSearch

    View full-size slide

  41. Transits & Terminals
    • Transits - Short-term, in-memory, volatile storage for data with a life-span
    up to a few seconds
    • Terminals - Destinations for data that either store, abandon, or transmit

    View full-size slide

  42. An efficient real-time data pathway consists
    of a network of transits and terminals, where
    no single node acts as both a transit and a
    terminal at the same time.

    View full-size slide

  43. StatsD
    • Acts as a transit, taking data and passing it along…
    • BUT
    • Acts as a terminal, aggregating keys in memory and becoming a transit
    after a time or buffer threshold.

    View full-size slide

  44. Version 2
    Collector
    Collector
    Collector
    Streamer Collector
    Collector
    Collector
    Receiver ElasticSearch
    RabbitMQ

    View full-size slide

  45. Version 2
    • Eliminated eventing and improved performance
    • Replaced StatsD with RabbitMQ
    • Data records are kept together
    • No longer works with Kibana (sadface)

    View full-size slide

  46. RabbitMQ
    • Lightweight message broker
    • Allows complex message routing without application-level logic
    • Can buffer 90-120 seconds of traffic

    View full-size slide

  47. while (buffer.length > 0) {
    var char = buffer.shift();
    if ('\n' === char) {
    queue.push(new Buffer(outbuf.join('')));
    continue;
    }
    outbuf.push(char);
    }
    var i = 0;
    var tBuf = buffer.slice();
    while (i < buffer.length) {
    var char = tBuf[i++];
    if ('\n' === char) {
    queue.push(new Buffer(outbuf.join('')));
    }
    outbuf.push(char);
    }

    View full-size slide

  48. Findings
    • Max throughput per collector: 600 events/second
    • ~35 receivers needed for prod
    • Micro-optimized code became increasingly brittle and hard to maintain as
    custom logic was needed for every edge case

    View full-size slide

  49. Version 2
    Collector
    Collector
    Collector
    Streamer Collector
    Collector
    Collector
    Receiver ElasticSearch
    RabbitMQ

    View full-size slide

  50. Need to Get Serious
    • Very high throughput
    • Multi-threaded worker pool with large memory buffers
    • Static & dynamic optimization
    • Efficient memory management for extremely volatile in-memory data
    • Eliminate any processing overhead. Receiver must be a Transit

    View full-size slide

  51. And also…
    • Not GoLang (because no one on the team is familiar with it)
    • Not Rust (because no one on the team wants to be familiar with it)
    • Not C (because C)

    View full-size slide

  52. Why Java?
    • Solid static & dynamic analysis and optimizations in the S2BC & JIT
    compilers
    • Clients for the stuff I needed to talk to
    • Well-supported within AOL & within my team

    View full-size slide

  53. Version 3
    Collector
    Collector
    Collector
    Streamer Collector
    Collector
    Collector
    Receiver
    ElasticSearch
    RabbitMQ
    Collector
    Collector
    Collector
    Processor/
    Router

    View full-size slide

  54. public class StreamReader {
    private static final Logger logger = Logger.getLogger(StreamReader.class.getName());
    private StreamerQueue queue = new StreamerQueue();
    private StreamProcessor processor;
    private List workerThreads = new ArrayList();
    private RtStreamerClient client;
    public StreamReader(String streamerURI, AmqpClient amqpClient, String appID, String tpcFltrs, String
    rfFltrs, String bt) {
    ArrayList queueList = new ArrayList();
    this.processor = new StreamProcessor(amqpClient);
    byte numThreads = 8;
    for(int i = 0; i < numThreads; ++i) {
    StreamReader.BeaconWorkerThread worker = new StreamReader.BeaconWorkerThread();
    this.workerThreads.add(worker);
    worker.start();
    }
    queueList.add(this.queue);
    this.client = new RtStreamerClient(streamerURI, appID, tpcFltrs, rfFltrs, bt, queueList);
    }
    }

    View full-size slide

  55. public class StreamProcessor {
    private static final Logger logger =
    Logger.getLogger(StreamProcessor.class.getName());
    private AmqpClient amqpClient;
    public StreamProcessor(AmqpClient amqpClient) {
    this.amqpClient = amqpClient;
    }
    public void send(String data) throws Exception {
    this.amqpClient.send(data.getBytes());
    logger.debug("Sent event " + data + " to AMQP");
    }
    }

    View full-size slide

  56. Queue
    Queue
    Queue
    Queue
    Queue
    Queue
    Queue
    Queue
    Queue
    Queue
    Queue
    Network Input
    Network Output
    Linked List Queues

    View full-size slide

  57. Findings
    • Max throughput per collector: 2600 events/second
    • ~10 receivers needed for prod

    View full-size slide

  58. Why ElasticSearch
    • Open-source Lucene search engine
    • Highly-distributed storage engine
    • Clusters nicely
    • Built-in aggregations like whoa

    View full-size slide

  59. Aggregations
    • Geographic Boxing & Radius Grouping
    • Time-Series
    • Histograms
    • Min/Max/Avg Statistical Evaluation
    • MapReduce (coming soon!)

    View full-size slide

  60. • How many users viewed my post on an android tablet in portrait mode
    within 10 miles of Denton, TX?
    • What is the average time from start of page-load to first click for readers
    on linux desktops between 3am and 5am?
    • Given two sets of link texts, which has the higher CTR for a randomized
    sample of readers on tablet devices?

    View full-size slide

  61. Browser to Browser in < 5
    seconds

    View full-size slide

  62. But wait…
    Is that “real-time”?

    View full-size slide

  63. Real-Time for Real
    • Live analysis of data as it is collected
    • Active visualization of very short-term trends in data

    View full-size slide

  64. Potential Problems
    • Small sample sizes for new datasets / small analysis windows
    • Data volumes too high for end-user comprehension
    • Data volumes too high for end-user hardware/network connections

    View full-size slide

  65. Version 4
    Collector
    Collector
    Collector
    Streamer Collector
    Collector
    Collector
    Receiver
    ElasticSearch
    RabbitMQ
    Collector
    Collector
    Collector
    Processor/
    Router
    Websocket
    Server

    View full-size slide

  66. D3JS
    • Open-source data visualization library written in JavaScript

    View full-size slide

  67. function plot(point) {
    var points = svg.selectAll("circle")
    .data([point], function(d) {
    return d.id;
    });
    points.enter()
    .append("circle")
    .attr("cx", function (d) { return projection([parseInt(d.location.geopoint.lon), parseInt(d.location.geopoint.lat)])[0] })
    .attr("cy", function (d) { return projection([parseInt(d.location.geopoint.lon), parseInt(d.location.geopoint.lat)])[1] })
    .attr("r", function (d) { return 1; })
    .style('fill', 'red')
    .style('fill-opacity', 1)
    .style('stroke', 'red')
    .style('stroke-width', '0.5px')
    .style('stroke-opacity', 1)
    .transition()
    .duration(10000)
    .style('fill-opacity', 0)
    .style('stroke-opacity', 0)
    .attr('r', '32px').remove();
    }
    var buffer = [];
    var socket = io();
    socket.on('geopoint', function(point) {
    if (point.location.geopoint) {
    plot(point);
    }
    });

    View full-size slide

  68. By the way…
    xn = x + (r * COS(2π * n / v))
    yn = y + (r * COS(2π * n / v))
    where n = ordinal of vertex and
    where v = number of vertices and
    x,y = center of the polygon

    View full-size slide

  69. var views = 0;
    var socket = io();
    socket.on('pageview', function(point) {
    views++;
    });
    function tick() {
    data.push(views);
    views = 0;
    path
    .attr("d", line)
    .attr("transform", null)
    .transition()
    .duration(500)
    .ease("linear")
    .attr("transform", "translate(" + x(0) + ",0)")
    .each("end", tick);
    data.shift();
    }
    tick();

    View full-size slide

  70. Pageview Heartbeat

    View full-size slide

  71. Receiver Layer
    Receiver Buffer/Transit
    Processing & Routing Layer
    Processing & Routing Transit
    Storage Engine End-User Consumable Queues
    Layers are
    • Geographically decoupled
    • Capable of independent scaling
    • Fully encapsulated with no cross-
    layer dependencies

    View full-size slide

  72. Interfaces
    Input Stream (Java)
    Routing (node.js)
    Filtering (node.js)
    Aggregation (PHP)
    Visualization (D3JS)
    MV Testing (PHP)

    View full-size slide

  73. Languages & Tools
    Rabbit
    MQ
    Hadoop
    Elastic
    Search
    PHP
    JS
    (node)
    JS (D3)
    Java
    MySQL

    View full-size slide

  74. Where are we Now?
    • It took 8 months to build a rock-solid data pipeline
    • Entry points from:
    • User data collectors
    • Application code

    View full-size slide

  75. That was the easy part.

    View full-size slide

  76. What’s next?

    View full-size slide

  77. • Live debugging & runtime profiling
    • Embeddable visualizations
    • On-demand stream filters

    View full-size slide

  78. @ieatkillerbees
    http://samanthaquinones.com

    View full-size slide