Slide 1

Slide 1 text

Building an SLA tracking tool Aish Raj Dahal PagerDuty

Slide 2

Slide 2 text

Aish

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

Never build your own monitoring. Someone on Hacker News “

Slide 5

Slide 5 text

But……why ? A future comment on Hacker News “

Slide 6

Slide 6 text

…because sometimes there are a few “special” cases

Slide 7

Slide 7 text

Chapter I The problem

Slide 8

Slide 8 text

Enqueuer Pre-processor Processor Incoming Record Cassandra Poll REST API State of the world
 Circa 2014

Slide 9

Slide 9 text

How many of these entities were actually processed ?

Slide 10

Slide 10 text

How do we know if the processing speed is not fast enough ?

Slide 11

Slide 11 text

Some… None… All of them… ?

Slide 12

Slide 12 text

Chapter II The original solution

Slide 13

Slide 13 text

Enqueuer Pre-processor Processor Incoming Record Cassandra Poll REST API Pre metrics
 Circa 2014

Slide 14

Slide 14 text

Enqueuer Pre-processor Processor Incoming Record Cassandra Poll REST API Metrics Processor Cassandra Poll REST API Poll Post metrics
 Circa 2015

Slide 15

Slide 15 text

What was it ?

Slide 16

Slide 16 text

It was a handcrafted custom script

Slide 17

Slide 17 text

What did it do ?

Slide 18

Slide 18 text

Cassandra Metrics Processor Processor RPC REST API Poll The insides
 Or how we polled all the time

Slide 19

Slide 19 text

What did this mean?

Slide 20

Slide 20 text

We were polling our Cassandra Queue

Slide 21

Slide 21 text

We were polling our processor’s db.

Slide 22

Slide 22 text

Additional load made things worse

Slide 23

Slide 23 text

Our polling exacerbated the problem

Slide 24

Slide 24 text

Chapter III Enter Elixir

Slide 25

Slide 25 text

We replaced Cassandra based
 in-house queue with Kafka

Slide 26

Slide 26 text

Complexity is the root cause of the vast majority of problems with software today. Out of the Tar Pit (2006) “

Slide 27

Slide 27 text

Enqueuer Pre-processor Processor Incoming Record Kafka Poll REST API State of the world
 Circa 2017

Slide 28

Slide 28 text

This meant…

Slide 29

Slide 29 text

RPC based lookups were off the table

Slide 30

Slide 30 text

Polling the processing database was off the table

Slide 31

Slide 31 text

We needed a new way of measuring the processing status

Slide 32

Slide 32 text

Approach I Let’s store everything in ElasticSearch

Slide 33

Slide 33 text

Rather than querying state in two places we queried state in one place.

Slide 34

Slide 34 text

We built a pipeline to send the records and their state to a downstream store, ElasticSearch

Slide 35

Slide 35 text

Incoming Records Phoenix App Kafka Processed Records Elastic Search Approach I ElasticSearch Metrics Collector

Slide 36

Slide 36 text

It was okay until we had one small problem

Slide 37

Slide 37 text

The system crumbled when load increased.

Slide 38

Slide 38 text

We had just moved from polling one datastore to polling another

Slide 39

Slide 39 text

Approach II ETS all Streams

Slide 40

Slide 40 text

All our records were events

Slide 41

Slide 41 text

Processed event record was incoming event with some additional metadata

Slide 42

Slide 42 text

All events were streams

Slide 43

Slide 43 text

Solution: zipWith two Streams

Slide 44

Slide 44 text

Incoming Records ETS Kafka Processed Records Approach II
 ETS Metrics Collector

Slide 45

Slide 45 text

Why did we do it?

Slide 46

Slide 46 text

Memory access is thousands of times faster than random disk access

Slide 47

Slide 47 text

Modeling the data in its natural form as events was more intuitive

Slide 48

Slide 48 text

We only wanted the processing status for events within a window

Slide 49

Slide 49 text

We had fewer moving parts. Simple is beautiful indeed.

Slide 50

Slide 50 text

Epilogue Current State

Slide 51

Slide 51 text

Dealing with crashes

Slide 52

Slide 52 text

Recreate state from Kafka

Slide 53

Slide 53 text

Reconcile if required with a secondary datastore.

Slide 54

Slide 54 text

And if stuff still breaks…get paged

Slide 55

Slide 55 text

aishrajdahal PS: PagerDuty’s hiring