Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Streaming data pipelines @ Slack

Streaming data pipelines @ Slack

Slack data platform evolves from the batch system to near real-time. I will also touch base on how Samza helps us to build low latency data pipelines & Experimentation framework.

Ananth Packkildurai

December 04, 2017
Tweet

More Decks by Ananth Packkildurai

Other Decks in Programming

Transcript

  1. Ananth Packkildurai
    1
    Streaming data pipeline @ Slack

    View Slide

  2. Public launch: 2014 800+ employees across
    7 countries worldwide
    HQ in San Francisco
    Diverse set of industries
    including software/technology, retail, media,
    telecom and professional services.
    About Slack

    View Slide

  3. An unprecedented adoption rate

    View Slide

  4. Agenda
    1. A bit history
    2. NRT infrastructure & Use cases
    3. Challenges

    View Slide

  5. A bit history

    View Slide

  6. March 2016
    5 350+ 2M
    Data Engineers Slack employees Active users

    View Slide

  7. October 2017
    10 800+ 6M
    Data Engineers Slack employees Active users

    View Slide

  8. Data usage
    1 in 3 per
    week
    500+
    tables
    400k
    access data
    warehouse
    Tables Events per sec at
    peak

    View Slide

  9. It is all about Slogs

    View Slide

  10. Well, not exactly

    View Slide

  11. Slog

    View Slide

  12. Slog

    View Slide

  13. NRT infrastructure & usecases

    View Slide

  14. What can go wrong?

    View Slide

  15. We want more...

    View Slide

  16. Performance & Experimentation
    ● Engineering & CE team should be able to detect the performance
    bottleneck proactively.
    ● Engineers should be able to see their experimentation performance in
    near real-time.

    View Slide

  17. Near Real time Pipeline

    View Slide

  18. Keep the load in DW Kafka predictable.
    More comfortable to upgrade and verify newer Kafka version.
    Smaller Kafka cluster is relatively more straightforward to operate.
    Why Analytics Kafka

    View Slide

  19. Samza pipeline design

    View Slide

  20. ● Content-based Router
    ● Router: deserialize Kafka events and add instrumentation.
    ● Processor: The processor represents abstraction for a streaming job. Add
    sink operation and instrumentation.
    ● Converter: implements business logic (join, filter, projection etc)
    Samza pipeline design

    View Slide

  21. Performance monitoring

    View Slide

  22. ● Approx percentile using Druid Histogram extension
    [http://jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf]
    ● Unique users based on Druid HyperLogLog implementation
    ● Slack bot integration to alert based on performance metrics
    Performance monitoring

    View Slide

  23. Experimentation framework

    View Slide

  24. ● Both the Streams hash partitioned by Team & User
    ● RocksDB Store exposure table (team_users_experimentation mapping).
    ● Metrics events range join with exposure table.
    ● A periodic snapshot of RocksDB to quality check with batch system
    Experimentation Framework

    View Slide

  25. Challenges

    View Slide

  26. Cascading failures

    View Slide

  27. Version mismatch among samza,
    kafka, scala & pants build

    View Slide

  28. Streaming Metrics Adoption

    View Slide

  29. Multi-Instance kafka clusters?

    View Slide

  30. Bridge the gap between batch and
    realtime tables.

    View Slide

  31. Thank You!
    31

    View Slide