Lessons learned from building serverless, distributed architecture

Slide 1

Slide 1 text

Slide 2

Slide 2 text

Introduction I am Jalem Raj Rohit. Works on Devops and Machine Learning full-time. - Moderates the DevOps and DataScience sites of StackOverflow - Contributes to random OSS projects

Slide 3

Slide 3 text

Setting the context - Serverless, distributed system for processing ML workloads - Upto 900 servers every run. - Batch architecture

Slide 4

Slide 4 text

LESSONS LEARNED

Slide 5

Slide 5 text

LESSON #1 Always return your Lambda functions

Slide 6

Slide 6 text

Always return your lambda functions - The cost of lambda functions can go from ‘meh’ to ‘OMFG’ really quick - A function which has not been returned is considered a failure by Lambda, and it keeps on retrying. [5 times]

Slide 7

Slide 7 text

LESSON #2 Monitoring and Logging is still an unconquered beast

Slide 8

Slide 8 text

Monitoring and logging - Monitoring a serverless system is very tricky. - Adding the distributed systems paradigm to it doesn’t really help - Having a hosted server for monitoring serverless systems?

Slide 9

Slide 9 text

No content

Slide 10

Slide 10 text

Monitoring and logging (cont...) - Monitor the orchestration rather than trying to monitor all the servers - Use the cloud provider’s dashboard as much as possible - For logging, the closest best practise is to zip the log file and send to a data store before the server termination task

Slide 11

Slide 11 text

LESSON #3 Super-high scalability with relative ease

Slide 12

Slide 12 text

Super high scalability - Super high scalability at a fraction of the costs - Can be made to scale seamlessly with demand

Slide 13

Slide 13 text

LESSON #4 If it is a distributed serverless system, it needs to be self-healing

Slide 14

Slide 14 text

Self-healing - Debugging for a lost file or a faulty file in a distributed system is like finding a needle in a haystack - Thus, self-healing

Slide 15

Slide 15 text

LESSON #5 Having distributed system doesn’t necessarily mean the load is distributed equally

Slide 16

Slide 16 text

Load Balancing - Improper or poorly done load balancing defeats the whole purpose of having distributed systems - Have proper load balancing techniques or algorithms in place wherever data is getting ingested

Slide 17

Slide 17 text

LESSON #6 Compliance automation is good. Let’s do more of it

Slide 18

Slide 18 text

Compliance Automation - Boon for teams which have very strict compliance - No need to worry about the number of systems in production - Tag-based and boundary-based detection

Slide 19

Slide 19 text

LESSON #7 Debugging and fixing serverless distributed systems is extremely difficult

Slide 20

Slide 20 text

Horrors of debugging/fixing serverless distributed systems - These systems run in a nohup mode - All the servers get terminated once the orchestration is completed - So, if late in killing the process, one needs to start all over again from the beginning

Slide 21

Slide 21 text

Horrors of debugging/fixing serverless distributed systems - Watching the tail of the log file would save a lot of headache - The more distributed the workload is, the bigger hell it is for the developer