Lessons learned from building serverless, distr...

Lessons learned from building serverless, distributed architecture

Presented at DevOps Days India 2017

Jalem Raj Rohit

September 15, 2017

  1. Introduction I am Jalem Raj Rohit. Works on Devops and

    Machine Learning full-time. - Moderates the DevOps and DataScience sites of StackOverflow - Contributes to random OSS projects
  2. Setting the context - Serverless, distributed system for processing ML

    workloads - Upto 900 servers every run. - Batch architecture
  3. Always return your lambda functions - The cost of lambda

    functions can go from ‘meh’ to ‘OMFG’ really quick - A function which has not been returned is considered a failure by Lambda, and it keeps on retrying. [5 times]
  4. Monitoring and logging - Monitoring a serverless system is very

    tricky. - Adding the distributed systems paradigm to it doesn’t really help - Having a hosted server for monitoring serverless systems?
  5. Monitoring and logging (cont...) - Monitor the orchestration rather than

    trying to monitor all the servers - Use the cloud provider’s dashboard as much as possible - For logging, the closest best practise is to zip the log file and send to a data store before the server termination task
  6. Super high scalability - Super high scalability at a fraction

    of the costs - Can be made to scale seamlessly with demand
  7. Self-healing - Debugging for a lost file or a faulty

    file in a distributed system is like finding a needle in a haystack - Thus, self-healing
  8. Load Balancing - Improper or poorly done load balancing defeats

    the whole purpose of having distributed systems - Have proper load balancing techniques or algorithms in place wherever data is getting ingested
  9. Compliance Automation - Boon for teams which have very strict

    compliance - No need to worry about the number of systems in production - Tag-based and boundary-based detection
  10. Horrors of debugging/fixing serverless distributed systems - These systems run

    in a nohup mode - All the servers get terminated once the orchestration is completed - So, if late in killing the process, one needs to start all over again from the beginning
  11. Horrors of debugging/fixing serverless distributed systems - Watching the tail

    of the log file would save a lot of headache - The more distributed the workload is, the bigger hell it is for the developer