Streaming data pipelines @ Slack

Streaming data pipelines @ Slack

Slack data platform evolves from the batch system to near real-time. I will also touch base on how Samza helps us to build low latency data pipelines & Experimentation framework.

2c4b23630d3e6ee69efb4db16186d266?s=128

Ananth Packkildurai

December 04, 2017
Tweet

Transcript

  1. 2.

    Public launch: 2014 800+ employees across 7 countries worldwide HQ

    in San Francisco Diverse set of industries including software/technology, retail, media, telecom and professional services. About Slack
  2. 8.

    Data usage 1 in 3 per week 500+ tables 400k

    access data warehouse Tables Events per sec at peak
  3. 11.
  4. 12.
  5. 16.

    Performance & Experimentation • Engineering & CE team should be

    able to detect the performance bottleneck proactively. • Engineers should be able to see their experimentation performance in near real-time.
  6. 18.

    Keep the load in DW Kafka predictable. More comfortable to

    upgrade and verify newer Kafka version. Smaller Kafka cluster is relatively more straightforward to operate. Why Analytics Kafka
  7. 20.

    • Content-based Router • Router: deserialize Kafka events and add

    instrumentation. • Processor: The processor represents abstraction for a streaming job. Add sink operation and instrumentation. • Converter: implements business logic (join, filter, projection etc) Samza pipeline design
  8. 22.

    • Approx percentile using Druid Histogram extension [http://jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf] • Unique

    users based on Druid HyperLogLog implementation • Slack bot integration to alert based on performance metrics Performance monitoring
  9. 24.

    • Both the Streams hash partitioned by Team & User

    • RocksDB Store exposure table (team_users_experimentation mapping). • Metrics events range join with exposure table. • A periodic snapshot of RocksDB to quality check with batch system Experimentation Framework