Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Streaming data pipelines @ Slack

Streaming data pipelines @ Slack

Slack data platform evolves from the batch system to near real-time. I will also touch base on how Samza helps us to build low latency data pipelines & Experimentation framework.

Ananth Packkildurai

December 04, 2017
Tweet

More Decks by Ananth Packkildurai

Other Decks in Programming

Transcript

  1. Public launch: 2014 800+ employees across 7 countries worldwide HQ

    in San Francisco Diverse set of industries including software/technology, retail, media, telecom and professional services. About Slack
  2. Data usage 1 in 3 per week 500+ tables 400k

    access data warehouse Tables Events per sec at peak
  3. Performance & Experimentation • Engineering & CE team should be

    able to detect the performance bottleneck proactively. • Engineers should be able to see their experimentation performance in near real-time.
  4. Keep the load in DW Kafka predictable. More comfortable to

    upgrade and verify newer Kafka version. Smaller Kafka cluster is relatively more straightforward to operate. Why Analytics Kafka
  5. • Content-based Router • Router: deserialize Kafka events and add

    instrumentation. • Processor: The processor represents abstraction for a streaming job. Add sink operation and instrumentation. • Converter: implements business logic (join, filter, projection etc) Samza pipeline design
  6. • Approx percentile using Druid Histogram extension [http://jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf] • Unique

    users based on Druid HyperLogLog implementation • Slack bot integration to alert based on performance metrics Performance monitoring
  7. • Both the Streams hash partitioned by Team & User

    • RocksDB Store exposure table (team_users_experimentation mapping). • Metrics events range join with exposure table. • A periodic snapshot of RocksDB to quality check with batch system Experimentation Framework