Streaming data pipelines @ Slack

Slide 1

Slide 1 text

Ananth Packkildurai 1 Streaming data pipeline @ Slack

Slide 2

Slide 2 text

Public launch: 2014 800+ employees across 7 countries worldwide HQ in San Francisco Diverse set of industries including software/technology, retail, media, telecom and professional services. About Slack

Slide 3

Slide 3 text

An unprecedented adoption rate

Slide 4

Slide 4 text

Agenda 1. A bit history 2. NRT infrastructure & Use cases 3. Challenges

Slide 5

Slide 5 text

A bit history

Slide 6

Slide 6 text

March 2016 5 350+ 2M Data Engineers Slack employees Active users

Slide 7

Slide 7 text

October 2017 10 800+ 6M Data Engineers Slack employees Active users

Slide 8

Slide 8 text

Data usage 1 in 3 per week 500+ tables 400k access data warehouse Tables Events per sec at peak

Slide 9

Slide 9 text

It is all about Slogs

Slide 10

Slide 10 text

Well, not exactly

Slide 11

Slide 11 text

Slog

Slide 12

Slide 12 text

Slog

Slide 13

Slide 13 text

NRT infrastructure & usecases

Slide 14

Slide 14 text

What can go wrong?

Slide 15

Slide 15 text

We want more...

Slide 16

Slide 16 text

Performance & Experimentation ● Engineering & CE team should be able to detect the performance bottleneck proactively. ● Engineers should be able to see their experimentation performance in near real-time.

Slide 17

Slide 17 text

Near Real time Pipeline

Slide 18

Slide 18 text

Keep the load in DW Kafka predictable. More comfortable to upgrade and verify newer Kafka version. Smaller Kafka cluster is relatively more straightforward to operate. Why Analytics Kafka

Slide 19

Slide 19 text

Samza pipeline design

Slide 20

Slide 20 text

● Content-based Router ● Router: deserialize Kafka events and add instrumentation. ● Processor: The processor represents abstraction for a streaming job. Add sink operation and instrumentation. ● Converter: implements business logic (join, filter, projection etc) Samza pipeline design

Slide 21

Slide 21 text

Performance monitoring

Slide 22

Slide 22 text

● Approx percentile using Druid Histogram extension [http://jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf] ● Unique users based on Druid HyperLogLog implementation ● Slack bot integration to alert based on performance metrics Performance monitoring

Slide 23

Slide 23 text

Experimentation framework

Slide 24

Slide 24 text

● Both the Streams hash partitioned by Team & User ● RocksDB Store exposure table (team_users_experimentation mapping). ● Metrics events range join with exposure table. ● A periodic snapshot of RocksDB to quality check with batch system Experimentation Framework