×
Copy
Open
Link
Embed
Share
Beginning
This slide
Copy link URL
Copy link URL
Copy iframe embed code
Copy iframe embed code
Copy javascript embed code
Copy javascript embed code
Share
Tweet
Share
Tweet
Slide 1
Slide 1 text
Building an SLA tracking tool Aish Raj Dahal PagerDuty
Slide 2
Slide 2 text
Aish
Slide 3
Slide 3 text
No content
Slide 4
Slide 4 text
Never build your own monitoring. Someone on Hacker News “
Slide 5
Slide 5 text
But……why ? A future comment on Hacker News “
Slide 6
Slide 6 text
…because sometimes there are a few “special” cases
Slide 7
Slide 7 text
Chapter I The problem
Slide 8
Slide 8 text
Enqueuer Pre-processor Processor Incoming Record Cassandra Poll REST API State of the world Circa 2014
Slide 9
Slide 9 text
How many of these entities were actually processed ?
Slide 10
Slide 10 text
How do we know if the processing speed is not fast enough ?
Slide 11
Slide 11 text
Some… None… All of them… ?
Slide 12
Slide 12 text
Chapter II The original solution
Slide 13
Slide 13 text
Enqueuer Pre-processor Processor Incoming Record Cassandra Poll REST API Pre metrics Circa 2014
Slide 14
Slide 14 text
Enqueuer Pre-processor Processor Incoming Record Cassandra Poll REST API Metrics Processor Cassandra Poll REST API Poll Post metrics Circa 2015
Slide 15
Slide 15 text
What was it ?
Slide 16
Slide 16 text
It was a handcrafted custom script
Slide 17
Slide 17 text
What did it do ?
Slide 18
Slide 18 text
Cassandra Metrics Processor Processor RPC REST API Poll The insides Or how we polled all the time
Slide 19
Slide 19 text
What did this mean?
Slide 20
Slide 20 text
We were polling our Cassandra Queue
Slide 21
Slide 21 text
We were polling our processor’s db.
Slide 22
Slide 22 text
Additional load made things worse
Slide 23
Slide 23 text
Our polling exacerbated the problem
Slide 24
Slide 24 text
Chapter III Enter Elixir
Slide 25
Slide 25 text
We replaced Cassandra based in-house queue with Kafka
Slide 26
Slide 26 text
Complexity is the root cause of the vast majority of problems with software today. Out of the Tar Pit (2006) “
Slide 27
Slide 27 text
Enqueuer Pre-processor Processor Incoming Record Kafka Poll REST API State of the world Circa 2017
Slide 28
Slide 28 text
This meant…
Slide 29
Slide 29 text
RPC based lookups were off the table
Slide 30
Slide 30 text
Polling the processing database was off the table
Slide 31
Slide 31 text
We needed a new way of measuring the processing status
Slide 32
Slide 32 text
Approach I Let’s store everything in ElasticSearch
Slide 33
Slide 33 text
Rather than querying state in two places we queried state in one place.
Slide 34
Slide 34 text
We built a pipeline to send the records and their state to a downstream store, ElasticSearch
Slide 35
Slide 35 text
Incoming Records Phoenix App Kafka Processed Records Elastic Search Approach I ElasticSearch Metrics Collector
Slide 36
Slide 36 text
It was okay until we had one small problem
Slide 37
Slide 37 text
The system crumbled when load increased.
Slide 38
Slide 38 text
We had just moved from polling one datastore to polling another
Slide 39
Slide 39 text
Approach II ETS all Streams
Slide 40
Slide 40 text
All our records were events
Slide 41
Slide 41 text
Processed event record was incoming event with some additional metadata
Slide 42
Slide 42 text
All events were streams
Slide 43
Slide 43 text
Solution: zipWith two Streams
Slide 44
Slide 44 text
Incoming Records ETS Kafka Processed Records Approach II ETS Metrics Collector
Slide 45
Slide 45 text
Why did we do it?
Slide 46
Slide 46 text
Memory access is thousands of times faster than random disk access
Slide 47
Slide 47 text
Modeling the data in its natural form as events was more intuitive
Slide 48
Slide 48 text
We only wanted the processing status for events within a window
Slide 49
Slide 49 text
We had fewer moving parts. Simple is beautiful indeed.
Slide 50
Slide 50 text
Epilogue Current State
Slide 51
Slide 51 text
Dealing with crashes
Slide 52
Slide 52 text
Recreate state from Kafka
Slide 53
Slide 53 text
Reconcile if required with a secondary datastore.
Slide 54
Slide 54 text
And if stuff still breaks…get paged
Slide 55
Slide 55 text
aishrajdahal PS: PagerDuty’s hiring