Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Drinking from the Firehose

Drinking from the Firehose

Samantha Quiñones

May 15, 2015
Tweet

More Decks by Samantha Quiñones

Other Decks in Technology

Transcript

  1. Drinking from the Firehose
    Real-Time Metrics
    Samantha Quiñones

    View Slide

  2. @ieatkillerbees
    http://samanthaquinones.com

    View Slide

  3. View Slide

  4. “How would you let editors test how well
    different headlines perform for the same
    piece of content?”

    View Slide

  5. Areas of Interest
    • Quantified Application Performance
    • Perceived Application Performance (User Experience)
    • User Behavior

    View Slide

  6. Quantified Application
    Performance

    View Slide

  7. • Are requests being handled efficiently?
    • Are adequate resources available at each layer of the stack?
    • Is cache being utilized in an efficient manner?

    View Slide

  8. CPU Time Per Request

    View Slide

  9. Stack Performance

    View Slide

  10. Perceived User Performance

    View Slide

  11. • How long does it take for a page to be “ready” for the user?
    • Is the page responsive to user input?
    • Does the page require an excessive number of requests to complete?

    View Slide

  12. View Slide

  13. Delay Perception
    <100ms Instantaneous
    <300ms Perceptible Delay
    <1000ms System “Working”
    <10000ms System “Slow”
    >10000ms System “Down”
    Source: O’Reilly Media

    View Slide

  14. Effects of Latency
    Amazon Sales: -1% sales per 100ms increased latency
    Sales (USD)
    0
    500
    1000
    1500
    2000
    Seconds of Latency
    0 1 2 3 5 6 7 8 9 10 11 12 13 14 15
    S = S - (msL*1%)
    Linden, G. (2006, December 3). Make Data Useful. Data Mining (CS345). Lecture conducted from Stanford University, Stanford, CA.

    View Slide

  15. Effects of Latency
    Google Search Experiment (4-6 Weeks)
    % Fewer Searches per Day
    -1
    -0.75
    -0.5
    -0.25
    0
    Milliseconds of Additional Latency
    50 100 200 400
    Schurman, E., Brutlag, J. (2009, June 23). The User and Business Impact of Server Delays, Additional Bytes, and HTTP Chunking in Web Search. O'Reilly Velocity. Lecture conducted from O'Reilly Media, San Jose, CA.

    View Slide

  16. Interpreting Web
    Profiler Information

    View Slide

  17. View Slide

  18. User Behavior

    View Slide

  19. Measuring User Behavior
    • Application path
    • Use patterns
    • Mouse & attention tracking

    View Slide

  20. Multivariate Testing
    • Sort all users in to groups
    • 1 control group receives unaltered content
    • 1 or more groups receive altered content
    • Measure behavioral statistics (CTR, abandon rate, time on page, scroll
    depth) for each group

    View Slide

  21. Managing Big Data

    View Slide

  22. How big is big?

    View Slide

  23. 1,300,000,000,000
    events per
    DAY

    View Slide

  24. ~40 datapoints
    per
    EVENT

    View Slide

  25. ~15,000 records
    per
    SECOND

    View Slide

  26. ~600,000
    datapoints
    Containing

    View Slide

  27. 25 megabytes /
    second
    At a rate up to

    View Slide

  28. Collector
    Collector
    Collector
    Collector Collector
    Collector
    Collector
    Collector Collector
    Collector
    Collector
    Collector
    Rabbit MQ Farm

    View Slide

  29. Rabbit MQ Farm
    Hadoop

    View Slide

  30. Hadoop
    • Framework for distributed storage and processing of data
    • Designed to make managing very large datasets simple with…
    • Well-documented, open-source, common libraries
    • Optimizing for commodity hardware

    View Slide

  31. Hadoop Distributed File System
    • Modeled after Google File System
    • Stores logical files across multiple systems
    • Rack-aware
    • No read-write concurrency

    View Slide

  32. MapReduce
    • Framework for massively parallel data processing tasks

    View Slide

  33. Map
    $document = "I'm a little teapot short and stout here is my handle here is my spout";
    /**
    * Outputs: [0,0,0,0,0,0,0,0,1,0,0,0,1,0,0]
    */
    function map($target_word, $document) {
    return array_map(
    function ($word) use ($target_word) {
    if ($word === $target_word) {
    return 1;
    }
    return 0;
    },
    preg_split('/\s+/', $document)
    );
    }
    echo json_encode(map("is", $document)) . PHP_EOL;

    View Slide

  34. Reduce
    $data = [0,0,0,0,0,0,0,0,1,0,0,0,1,0,0];
    /**
    * Outputs: 2
    */
    function reduce($data) {
    return array_reduce(
    $data,
    function ($count, $value) {
    return $count + $value;
    }
    );
    }
    echo reduce($data) . PHP_EOL;

    View Slide

  35. Hadoop Limitations
    • Hadoop jobs are batched and take significant time to run
    • Data may not be available for 1+ hours after collection

    View Slide

  36. “How would you let editors test how well
    different headlines perform for the same
    piece of content?”

    View Slide

  37. Consider Shelf-life
    • Most articles are relevant for < 24 hours
    • Interest peaks < 3 hours

    View Slide

  38. Real-Time Pipelines

    View Slide

  39. Collector
    Collector
    Collector
    Collector
    Collector
    Collector
    Collector
    Collector
    Collector
    Collector
    Collector
    Collector
    Rabbit
    MQ
    Farm
    Collector
    Collector
    Collector
    Streamer
    Collector
    Collector
    Collector
    Streamer
    Collector
    Collector
    Collector
    Streamer

    View Slide

  40. View Slide

  41. Version 1 (PoC)
    Collector
    Collector
    Collector
    Streamer Collector
    Collector
    Collector
    Receiver Collector
    Collector
    Collector
    StatsD
    Cluster
    ElasticSearch

    View Slide

  42. View Slide

  43. this.visit = function(record) {
    if (record.userAgent) {
    var parser = new UAParser();
    parser.setUA(record.userAgent);
    var user_agent = parser.getResult();
    return { user_agent: user_agent }
    }
    return {};
    };

    View Slide

  44. View Slide

  45. Findings
    • Max throughput per collector: 300 events/second
    • ~70 receivers needed for prod
    • StatsD key format creates data redundancy and reduced data richness

    View Slide

  46. Version 1 (PoC)
    Collector
    Collector
    Collector
    Streamer Collector
    Collector
    Collector
    Receiver Collector
    Collector
    Collector
    StatsD
    Cluster
    ElasticSearch

    View Slide

  47. Transits & Terminals
    • Transits - Short-term, in-memory, volatile storage for data with a life-span
    up to a few seconds
    • Terminals - Destinations for data that either store, abandon, or transmit

    View Slide

  48. An efficient real-time data pathway consists
    of a network of transits and terminals, where
    no single node acts as both a transit and a
    terminal at the same time.

    View Slide

  49. StatsD
    • Acts as a transit, taking data and passing it along…
    • BUT
    • Acts as a terminal, aggregating keys in memory and becoming a transit
    after a time or buffer threshold.

    View Slide

  50. Version 2
    Collector
    Collector
    Collector
    Streamer Collector
    Collector
    Collector
    Receiver ElasticSearch
    RabbitMQ

    View Slide

  51. View Slide

  52. Version 2
    • Eliminated eventing and improved performance
    • Replaced StatsD with RabbitMQ
    • Data records are kept together
    • No longer works with Kibana (sadface)

    View Slide

  53. RabbitMQ
    • Lightweight message broker
    • Allows complex message routing without application-level logic
    • Can buffer 90-120 seconds of traffic

    View Slide

  54. while (buffer.length > 0) {
    var char = buffer.shift();
    if ('\n' === char) {
    queue.push(new Buffer(outbuf.join('')));
    continue;
    }
    outbuf.push(char);
    }
    var i = 0;
    var tBuf = buffer.slice();
    while (i < buffer.length) {
    var char = tBuf[i++];
    if ('\n' === char) {
    queue.push(new Buffer(outbuf.join('')));
    }
    outbuf.push(char);
    }

    View Slide

  55. Findings
    • Max throughput per collector: 600 events/second
    • ~35 receivers needed for prod
    • Micro-optimized code became increasingly brittle and hard to maintain as
    custom logic was needed for every edge case

    View Slide

  56. Version 2
    Collector
    Collector
    Collector
    Streamer Collector
    Collector
    Collector
    Receiver ElasticSearch
    RabbitMQ

    View Slide

  57. Need to Get Serious
    • Very high throughput
    • Multi-threaded worker pool with large memory buffers
    • Static & dynamic optimization
    • Efficient memory management for extremely volatile in-memory data
    • Eliminate any processing overhead. Receiver must be a Transit

    View Slide

  58. And also…
    • Not GoLang (because no one on the team is familiar with it)
    • Not Rust (because no one on the team wants to be familiar with it)
    • Not C (because C)

    View Slide

  59. View Slide

  60. mfw java :(

    View Slide

  61. Why Java?
    • Solid static & dynamic analysis and optimizations in the S2BC & JIT
    compilers
    • Clients for the stuff I needed to talk to
    • Well-supported within AOL & within my team

    View Slide

  62. Version 3
    Collector
    Collector
    Collector
    Streamer Collector
    Collector
    Collector
    Receiver
    ElasticSearch
    RabbitMQ
    Collector
    Collector
    Collector
    Processor/
    Router

    View Slide

  63. View Slide

  64. public class StreamReader {
    private static final Logger logger = Logger.getLogger(StreamReader.class.getName());
    private StreamerQueue queue = new StreamerQueue();
    private StreamProcessor processor;
    private List workerThreads = new ArrayList();
    private RtStreamerClient client;
    public StreamReader(String streamerURI, AmqpClient amqpClient, String appID, String tpcFltrs, String
    rfFltrs, String bt) {
    ArrayList queueList = new ArrayList();
    this.processor = new StreamProcessor(amqpClient);
    byte numThreads = 8;
    for(int i = 0; i < numThreads; ++i) {
    StreamReader.BeaconWorkerThread worker = new StreamReader.BeaconWorkerThread();
    this.workerThreads.add(worker);
    worker.start();
    }
    queueList.add(this.queue);
    this.client = new RtStreamerClient(streamerURI, appID, tpcFltrs, rfFltrs, bt, queueList);
    }
    }

    View Slide

  65. public class StreamProcessor {
    private static final Logger logger =
    Logger.getLogger(StreamProcessor.class.getName());
    private AmqpClient amqpClient;
    public StreamProcessor(AmqpClient amqpClient) {
    this.amqpClient = amqpClient;
    }
    public void send(String data) throws Exception {
    this.amqpClient.send(data.getBytes());
    logger.debug("Sent event " + data + " to AMQP");
    }
    }

    View Slide

  66. Queue
    Queue
    Queue
    Queue
    Queue
    Queue
    Queue
    Queue
    Queue
    Queue
    Queue
    Network Input
    Network Output
    Linked List Queues

    View Slide

  67. Findings
    • Max throughput per collector: 2600 events/second
    • ~10 receivers needed for prod

    View Slide

  68. View Slide

  69. Why ElasticSearch
    • Open-source Lucene search engine
    • Highly-distributed storage engine
    • Clusters nicely
    • Built-in aggregations like whoa

    View Slide

  70. Aggregations
    • Geographic Boxing & Radius Grouping
    • Time-Series
    • Histograms
    • Min/Max/Avg Statistical Evaluation
    • MapReduce (coming soon!)

    View Slide

  71. • How many users viewed my post on an android tablet in portrait mode
    within 10 miles of Denton, TX?
    • What is the average time from start of page-load to first click for readers
    on linux desktops between 3am and 5am?
    • Given two sets of link texts, which has the higher CTR for a randomized
    sample of readers on tablet devices?

    View Slide

  72. Browser to Browser in < 5
    seconds

    View Slide

  73. But wait…
    Is that “real-time”?

    View Slide

  74. Real-Time for Real
    • Live analysis of data as it is collected
    • Active visualization of very short-term trends in data

    View Slide

  75. Potential Problems
    • Small sample sizes for new datasets / small analysis windows
    • Data volumes too high for end-user comprehension
    • Data volumes too high for end-user hardware/network connections

    View Slide

  76. Version 4
    Collector
    Collector
    Collector
    Streamer Collector
    Collector
    Collector
    Receiver
    ElasticSearch
    RabbitMQ
    Collector
    Collector
    Collector
    Processor/
    Router
    Websocket
    Server

    View Slide

  77. View Slide

  78. D3JS
    • Open-source data visualization library written in JavaScript

    View Slide

  79. function plot(point) {
    var points = svg.selectAll("circle")
    .data([point], function(d) {
    return d.id;
    });
    points.enter()
    .append("circle")
    .attr("cx", function (d) { return projection([parseInt(d.location.geopoint.lon), parseInt(d.location.geopoint.lat)])[0] })
    .attr("cy", function (d) { return projection([parseInt(d.location.geopoint.lon), parseInt(d.location.geopoint.lat)])[1] })
    .attr("r", function (d) { return 1; })
    .style('fill', 'red')
    .style('fill-opacity', 1)
    .style('stroke', 'red')
    .style('stroke-width', '0.5px')
    .style('stroke-opacity', 1)
    .transition()
    .duration(10000)
    .style('fill-opacity', 0)
    .style('stroke-opacity', 0)
    .attr('r', '32px').remove();
    }
    var buffer = [];
    var socket = io();
    socket.on('geopoint', function(point) {
    if (point.location.geopoint) {
    plot(point);
    }
    });

    View Slide

  80. View Slide

  81. View Slide

  82. By the way…
    xn = x + (r * COS(2π * n / v))
    yn = y + (r * COS(2π * n / v))
    where n = ordinal of vertex and
    where v = number of vertices and
    x,y = center of the polygon

    View Slide

  83. var views = 0;
    var socket = io();
    socket.on('pageview', function(point) {
    views++;
    });
    function tick() {
    data.push(views);
    views = 0;
    path
    .attr("d", line)
    .attr("transform", null)
    .transition()
    .duration(500)
    .ease("linear")
    .attr("transform", "translate(" + x(0) + ",0)")
    .each("end", tick);
    data.shift();
    }
    tick();

    View Slide

  84. Pageview Heartbeat

    View Slide

  85. View Slide

  86. Receiver Layer
    Receiver Buffer/Transit
    Processing & Routing Layer
    Processing & Routing Transit
    Storage Engine End-User Consumable Queues
    Layers are
    • Geographically decoupled
    • Capable of independent scaling
    • Fully encapsulated with no cross-
    layer dependencies

    View Slide

  87. Interfaces
    Input Stream (Java)
    Routing (node.js)
    Filtering (node.js)
    Aggregation (PHP)
    Visualization (D3JS)
    MV Testing (PHP)

    View Slide

  88. Languages & Tools
    Rabbit
    MQ
    Hadoop
    Elastic
    Search
    PHP
    JS
    (node)
    JS (D3)
    Java
    MySQL

    View Slide

  89. Where are we Now?
    • It took 8 months to build a rock-solid data pipeline
    • Entry points from:
    • User data collectors
    • Application code

    View Slide

  90. That was the easy part.

    View Slide

  91. What’s next?

    View Slide

  92. • Live debugging & runtime profiling
    • Embeddable visualizations
    • On-demand stream filters

    View Slide

  93. ???

    View Slide

  94. @ieatkillerbees
    http://samanthaquinones.com

    View Slide