Streaming data pipelines @ Slack

Ananth Packkildurai 1 Streaming data pipeline @ Slack

Public launch: 2014 800+ employees across 7 countries worldwide HQ
in San Francisco Diverse set of industries including software/technology, retail, media, telecom and professional services. About Slack

An unprecedented adoption rate

Agenda 1. A bit history 2. NRT infrastructure & Use
cases 3. Challenges

A bit history

March 2016 5 350+ 2M Data Engineers Slack employees Active
users

October 2017 10 800+ 6M Data Engineers Slack employees Active
users

Data usage 1 in 3 per week 500+ tables 400k
access data warehouse Tables Events per sec at peak

It is all about Slogs

Well, not exactly

NRT infrastructure & usecases

What can go wrong?

We want more...

Performance & Experimentation • Engineering & CE team should be
able to detect the performance bottleneck proactively. • Engineers should be able to see their experimentation performance in near real-time.

Near Real time Pipeline

Keep the load in DW Kafka predictable. More comfortable to
upgrade and verify newer Kafka version. Smaller Kafka cluster is relatively more straightforward to operate. Why Analytics Kafka

Samza pipeline design

• Content-based Router • Router: deserialize Kafka events and add
instrumentation. • Processor: The processor represents abstraction for a streaming job. Add sink operation and instrumentation. • Converter: implements business logic (join, filter, projection etc) Samza pipeline design

Performance monitoring

• Approx percentile using Druid Histogram extension [http://jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf] • Unique
users based on Druid HyperLogLog implementation • Slack bot integration to alert based on performance metrics Performance monitoring

Experimentation framework

• Both the Streams hash partitioned by Team & User
• RocksDB Store exposure table (team_users_experimentation mapping). • Metrics events range join with exposure table. • A periodic snapshot of RocksDB to quality check with batch system Experimentation Framework

Challenges

Cascading failures

Version mismatch among samza, kafka, scala & pants build

Streaming Metrics Adoption

Multi-Instance kafka clusters?

Bridge the gap between batch and realtime tables.

Thank You! 31

Streaming data pipelines @ Slack

Streaming data pipelines @ Slack

Ananth Packkildurai

More Decks by Ananth Packkildurai

Other Decks in Programming

Featured

Transcript

Ananth Packkildurai 1 Streaming data pipeline @ Slack

Public launch: 2014 800+ employees across 7 countries worldwide HQ

An unprecedented adoption rate

Agenda 1. A bit history 2. NRT infrastructure & Use

A bit history

March 2016 5 350+ 2M Data Engineers Slack employees Active

October 2017 10 800+ 6M Data Engineers Slack employees Active

Data usage 1 in 3 per week 500+ tables 400k

It is all about Slogs

Well, not exactly

Slog

Slog

NRT infrastructure & usecases

What can go wrong?

We want more...

Performance & Experimentation • Engineering & CE team should be

Near Real time Pipeline

Keep the load in DW Kafka predictable. More comfortable to

Samza pipeline design

• Content-based Router • Router: deserialize Kafka events and add

Performance monitoring

• Approx percentile using Druid Histogram extension [http://jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf] • Unique

Experimentation framework

• Both the Streams hash partitioned by Team & User

Challenges

Cascading failures

Version mismatch among samza, kafka, scala & pants build

Streaming Metrics Adoption

Multi-Instance kafka clusters?

Bridge the gap between batch and realtime tables.

Thank You! 31