Ananth Packkildurai
1
Streaming data pipeline @ Slack
Slide 2
Slide 2 text
Public launch: 2014 800+ employees across
7 countries worldwide
HQ in San Francisco
Diverse set of industries
including software/technology, retail, media,
telecom and professional services.
About Slack
Slide 3
Slide 3 text
An unprecedented adoption rate
Slide 4
Slide 4 text
Agenda
1. A bit history
2. NRT infrastructure & Use cases
3. Challenges
Slide 5
Slide 5 text
A bit history
Slide 6
Slide 6 text
March 2016
5 350+ 2M
Data Engineers Slack employees Active users
Slide 7
Slide 7 text
October 2017
10 800+ 6M
Data Engineers Slack employees Active users
Slide 8
Slide 8 text
Data usage
1 in 3 per
week
500+
tables
400k
access data
warehouse
Tables Events per sec at
peak
Slide 9
Slide 9 text
It is all about Slogs
Slide 10
Slide 10 text
Well, not exactly
Slide 11
Slide 11 text
Slog
Slide 12
Slide 12 text
Slog
Slide 13
Slide 13 text
NRT infrastructure & usecases
Slide 14
Slide 14 text
What can go wrong?
Slide 15
Slide 15 text
We want more...
Slide 16
Slide 16 text
Performance & Experimentation
● Engineering & CE team should be able to detect the performance
bottleneck proactively.
● Engineers should be able to see their experimentation performance in
near real-time.
Slide 17
Slide 17 text
Near Real time Pipeline
Slide 18
Slide 18 text
Keep the load in DW Kafka predictable.
More comfortable to upgrade and verify newer Kafka version.
Smaller Kafka cluster is relatively more straightforward to operate.
Why Analytics Kafka
Slide 19
Slide 19 text
Samza pipeline design
Slide 20
Slide 20 text
● Content-based Router
● Router: deserialize Kafka events and add instrumentation.
● Processor: The processor represents abstraction for a streaming job. Add
sink operation and instrumentation.
● Converter: implements business logic (join, filter, projection etc)
Samza pipeline design
Slide 21
Slide 21 text
Performance monitoring
Slide 22
Slide 22 text
● Approx percentile using Druid Histogram extension
[http://jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf]
● Unique users based on Druid HyperLogLog implementation
● Slack bot integration to alert based on performance metrics
Performance monitoring
Slide 23
Slide 23 text
Experimentation framework
Slide 24
Slide 24 text
● Both the Streams hash partitioned by Team & User
● RocksDB Store exposure table (team_users_experimentation mapping).
● Metrics events range join with exposure table.
● A periodic snapshot of RocksDB to quality check with batch system
Experimentation Framework
Slide 25
Slide 25 text
Challenges
Slide 26
Slide 26 text
Cascading failures
Slide 27
Slide 27 text
Version mismatch among samza,
kafka, scala & pants build