Upgrade to Pro — share decks privately, control downloads, hide ads and more …

With Great Scalability Comes Great Responsibility

With Great Scalability Comes Great Responsibility

This is a story of how I took down one of our vendor’s services with an innocent serverless application. I wanted to retrieve data from one of our monitoring platforms to analyze SPS Commerce’s software performance. Initially, I wrote a script to collect the data using python multiprocessing. To gather this data in a more scalable, fast, and efficient way, I decided to pivot to a serverless architecture. Unfortunately, my solution ended up spawning requests faster than the REST API could handle. In this talk, we will cover the contextual pros and cons of a number of architectural patterns given real world scalability constraints; from orchestrating Lambdas with AWS step functions to multiprocessing with S3 triggers to rate limiting with queues like SQS.

Dana Engebretson

November 15, 2017
Tweet

More Decks by Dana Engebretson

Other Decks in Technology

Transcript

  1. @BigDana Attempt 0: Python Multiprocessing on an EC2 Instance Attempt

    0.5: Python Multiprocessing on a Spot Instance Attempt 1: Lambdas Orchestrated by Step Functions Attempt 2: Using Event Triggers with S3 Attempt 3: Rate limiting with Queues and Cloudwatch triggered Lambdas Building a Data Pipeline
  2. @BigDana Attempt 0: Python Multiprocessing on an EC2 Instance Attempt

    0.5: Python Multiprocessing on a Spot Instance Attempt 1: Lambdas Orchestrated by Step Functions Attempt 2: Using Event Triggers with S3 Attempt 3: Rate limiting with Queues and Cloudwatch triggered Lambdas Building a Data Pipeline
  3. @BigDana Attempt 0: Python Multiprocessing on an EC2 Instance Attempt

    0.5: Python Multiprocessing on a Spot Instance Attempt 1: Lambdas Orchestrated by Step Functions Attempt 2: Using Event Triggers with S3 Attempt 3: Rate limiting with Queues and Cloudwatch triggered Lambdas Building a Data Pipeline
  4. @BigDana Attempt 0: Python Multiprocessing on an EC2 Instance Attempt

    0.5: Python Multiprocessing on a Spot Instance Attempt 1: Lambdas Orchestrated by Step Functions Attempt 2: Using Event Triggers with S3 Attempt 3: Rate limiting with Queues and Cloudwatch triggered Lambdas Building a Data Pipeline
  5. @BigDana Attempt 0: Python Multiprocessing on an EC2 Instance Attempt

    0.5: Python Multiprocessing on a Spot Instance Attempt 1: Lambdas Orchestrated by Step Functions Attempt 2: Using Event Triggers with S3 Attempt 3: Rate limiting with Queues and Cloudwatch triggered Lambdas Building a Data Pipeline
  6. @BigDana Attempt 0: Python Multiprocessing on an EC2 Instance Attempt

    0.5: Python Multiprocessing on a Spot Instance Attempt 1: Lambdas Orchestrated by Step Functions Attempt 2: Using Event Triggers with S3 Attempt 3: Rate limiting with Queues and Cloudwatch triggered Lambdas Building a Data Pipeline
  7. @BigDana FETCH ME ALL MY DATA FOR ALL MY AMAZON

    WARRIORS PLEASE! The Oracle* *one of our monitoring vendors Hermes: the messenger God* *Our Vendor’s Api
  8. @BigDana HERE IS ALL YOUR DATA! The Oracle* *one of

    our monitoring vendors Hermes: the messenger God* *Our Vendor’s Api
  9. @BigDana Hermes: the messenger God* *Our Vendor’s Api @BigDana FETCH

    ME ALL MY AMAZON WARRIOR TROOPS* PLEASE! *grouped by micro-service The Oracle* *one of our monitoring vendors
  10. @BigDana The Oracle* *one of our monitoring vendors Hermes: the

    messenger God* *Our Vendor’s Api @BigDana
  11. @BigDana The Oracle* *one of our monitoring vendors @BigDana Hermes:

    the messenger God* *Our Vendor’s Api @BigDana
  12. TROUP 2 @BigDana @BigDana Hermes: the messenger God* *Our Vendor’s

    Api @BigDana @BigDana The Oracle* *one of our monitoring vendors TROUP 2 troop 3 TROUP 1 HERE ARE YOUR AMAZON WARRIOR TROOPS!
  13. @BigDana @BigDana Hermes: the messenger God* *Our Vendor’s Api @BigDana

    @BigDana @BigDana The Oracle* *one of our monitoring vendors troop 2 troop 3 troop 1
  14. @BigDana z FOR TROOP 1, FETCH ME MY AMAZON WARRIORS

    PLEASE! The Oracle* *one of our monitoring vendors Hermes: the messenger God* *Our Vendor’s Api troop 2 troop 3 troop 1
  15. @BigDana The Oracle* *one of our monitoring vendors @BigDana Hermes:

    the messenger God* *Our Vendor’s Api @BigDana troop 2 troop 3 troop 1
  16. @BigDana The Oracle* *one of our monitoring vendors Hermes: the

    messenger God* *Our Vendor’s Api troop 2 troop 3 troop 1
  17. @BigDana The Oracle* *one of our monitoring vendors Hermes: the

    messenger God* *Our Vendor’s Api TROUP 1 amazon warrior 1 amazon warrior 2 TROUP 2 TROUP 3 TROUP 1 troop 2 troop 3 troop 1 HERE ARE YOUR AMAZON WARRIORS FOR TROOP 1!
  18. @BigDana amazon warrior 1 amazon warrior 2 The Oracle* *one

    of our monitoring vendors Hermes: the messenger God* *Our Vendor’s Api troop 2 troop 3 troop 1
  19. amazon warrior 1 amazon warrior 2 Hermes: the messenger God*

    *Our Vendor’s Api @BigDana FOR AMAZON WARRIOR 1, FETCH ME MY DATA GROUPS PLEASE! The Oracle* *one of our monitoring vendors troop 2 troop 3 troop 1
  20. @BigDana The Oracle* *one of our monitoring vendors Hermes: the

    messenger God* *Our Vendor’s Api amazon warrior 1 amazon warrior 2 troop 2 troop 3 troop 1
  21. @BigDana The Oracle* *one of our monitoring vendors amazon warrior

    1 amazon warrior 2 Hermes: the messenger God* *Our Vendor’s Api troop 2 troop 3 troop 1
  22. TROUP 2 TROUP 1 amazon warrior 2 TROUP 2 Hermes:

    the messenger God* *Our Vendor’s Api @BigDana data group 1 data group 2 The Oracle* *one of our monitoring vendors amazon warrior 1 data group 3 troop 2 troop 3 troop 1 HERE ARE YOUR DATA GROUPS FOR AMAZON WARRIOR 1!
  23. @BigDana The Oracle* *one of our monitoring vendors amazon warrior

    2 Hermes: the messenger God* *Our Vendor’s Api @BigDana data group 1 data group 2 amazon warrior 1 data group 3 troop 2 troop 3 troop 1
  24. amazon warrior 2 Hermes: the messenger God* *Our Vendor’s Api

    @BigDana data group 1 data group 2 amazon warrior 1 data group 3 @BigDana FOR AMAZON WARRIOR 1, AND DATA GROUP 1, FETCH ME THE DATA PLEASE! The Oracle* *one of our monitoring vendors troop 2 troop 3 troop 1
  25. @BigDana The Oracle* *one of our monitoring vendors @BigDana Hermes:

    the messenger God* *Our Vendor’s Api amazon warrior 2 data group 1 data group 2 data group 3 amazon warrior 1 troop 2 troop 3 troop 1
  26. @BigDana The Oracle* *one of our monitoring vendors amazon warrior

    2 Hermes: the messenger God* *Our Vendor’s Api @BigDana data group 1 data group 2 amazon warrior 1 data group 3 troop 2 troop 3 troop 1
  27. amazon warrior 2 Hermes: the messenger God* *Our Vendor’s Api

    data group 2 amazon warrior 1 data group 3 @BigDana The Oracle* *one of our monitoring vendors data data group 1 TROUP 1 troop 2 troop 3 troop 1 HERE IS YOUR DATA FOR AMAZON WARRIOR 1 AND DATA GROUP 1!
  28. amazon warrior 2 Hermes: the messenger God* *Our Vendor’s Api

    data group 2 amazon warrior 1 data group 3 @BigDana The Oracle* *one of our monitoring vendors data data group 1 troop 2 troop 3 troop 1
  29. @BigDana Attempt 0: Python Multiprocessing on an EC2 Instance Attempt

    0.5: Python Multiprocessing on a Spot Instance Attempt 1: Lambdas Orchestrated by Step Functions Attempt 2: Using Event Triggers with S3 Attempt 3: Rate limiting with Queues and Cloudwatch triggered Lambdas Building a Data Pipeline
  30. Initial Solution: Python Multiprocessing on an EC2 instance - continuously

    running Easy for me to implement not cheap horizontal scaling seemed overkill on-going maintenance and management @BigDana
  31. Initial Solution: Python Multiprocessing on an EC2 instance - intermittently

    running Easy for me to implement cheaper still not cheap enough horizontal scaling seemed overkill on-going maintenance and management @BigDana
  32. @BigDana Attempt 0: Python Multiprocessing on an EC2 Instance Attempt

    0.5: Python Multiprocessing on a Spot Instance Attempt 1: Lambdas Orchestrated by Step Functions Attempt 2: Using Event Triggers with S3 Attempt 3: Rate limiting with Queues and Cloudwatch triggered Lambdas Building a Data Pipeline
  33. @BigDana Attempt 0: Python Multiprocessing on an EC2 Instance Attempt

    0.5: Python Multiprocessing on a Spot Instance Attempt 1: Lambdas Orchestrated by Step Functions Attempt 2: Using Event Triggers with S3 Attempt 3: Rate limiting with Queues and Cloudwatch triggered Lambdas Building a Data Pipeline
  34. Revision: significantly cheaper You need to configure how to restart

    where the previous process left off @BigDana Python Multiprocessing on an EC2 instance a Spot instance
  35. The Cloud : Leasing a car Owning your own server

    : Owning your own car Spot Instances: Renting a car @BigDana
  36. The Cloud: Leasing a car Owning your own server: Owning

    your own car Spot Instances: Renting a car Lambdas: car2go @BigDana
  37. @BigDana Attempt 0: Python Multiprocessing on an EC2 Instance Attempt

    0.5: Python Multiprocessing on a Spot Instance Attempt 1: Lambdas Orchestrated by Step Functions Attempt 2: Using Event Triggers with S3 Attempt 3: Rate limiting with Queues and Cloudwatch triggered Lambdas Building a Data Pipeline
  38. @BigDana Attempt 0: Python Multiprocessing on an EC2 Instance Attempt

    0.5: Python Multiprocessing on a Spot Instance Attempt 1: Lambdas Orchestrated by Step Functions Attempt 2: Using Event Triggers with S3 Attempt 3: Rate limiting with Queues and Cloudwatch triggered Lambdas Building a Data Pipeline
  39. Lambda: a Function as a Service Compute: proportional to memory

    Time: max of 5 minutes Memory: max of 1536 MB @BigDana
  40. Lambda: Function as a Service Compute Time Memory Dependency Zip

    File size 250 MB uncompressed code/dependencies @BigDana
  41. Lambda: Function as a Service Compute Time Memory Dependency Zip

    File size 250 MB uncompressed code/dependencies @BigDana fastparquet didn’t fit
  42. Lambda: Function as a Service Compute Time Memory Dependency Zip

    File size @BigDana Ephemeral Disk Capacity: max of 512 MB
  43. Lambda: Function as a Service Compute Time Memory Ephemeral Disk

    Capacity Dependency Zip File size @BigDana concurrent executions: default max of 1000 per account
  44. Lambda: Function as a Service Compute Time Memory Ephemeral Disk

    Capacity Dependency Zip File size # of file descriptors # of processes and threads Invoke request body payload size @BigDana concurrent executions
  45. Lambda: Function as a Service Compute Time Memory Ephemeral Disk

    Capacity Dependency Zip File size # of file descriptors # of processes and threads Invoke request body payload size Linux @BigDana concurrent executions
  46. start end start end … Dynamic Fan Out Parallel Processing

    Parallel Processing Step Functions can do vs @BigDana
  47. start send to an EC2 instance @BigDana start Defined Fan

    Out Parallel Processing or EC2 … …
  48. Step Functions easy to implement workflow management for lambdas and

    ec2 instances built-in back off policy for retrying lambdas doesn’t yet natively support a dynamic one-to-many fan out architecture of lambdas @BigDana
  49. @BigDana Attempt 0: Python Multiprocessing on an EC2 Instance Attempt

    0.5: Python Multiprocessing on a Spot Instance Attempt 1: Lambdas Orchestrated by Step Functions Attempt 2: Using Event Triggers with S3 Attempt 3: Rate limiting with Queues and Cloudwatch triggered Lambdas Building a Data Pipeline
  50. @BigDana Attempt 0: Python Multiprocessing on an EC2 Instance Attempt

    0.5: Python Multiprocessing on a Spot Instance Attempt 1: Lambdas Orchestrated by Step Functions Attempt 2: Using Event Triggers with S3 Attempt 3: Rate limiting with Queues and Cloudwatch triggered Lambdas Building a Data Pipeline
  51. Using Event Triggers start S3 Bucket (Simple Storage Service) Circular

    dependency @BigDana troops/ amazon_warriors/ data_groups/ data/
  52. s3 is highly scalable and durable Nested event triggers creates

    super fast multiprocessing You can’t control the rate the lambdas get spawned It will only retry a lambda 2 times (can extend this manually) You can lose in-flight tasks if there is a system outage Using Event Triggers with S3 @BigDana
  53. @BigDana Attempt 0: Python Multiprocessing on an EC2 Instance Attempt

    0.5: Python Multiprocessing on a Spot Instance Attempt 1: Lambdas Orchestrated by Step Functions Attempt 2: Using Event Triggers with S3 Attempt 3: Rate limiting with Queues and Cloudwatch triggered Lambdas Building a Data Pipeline
  54. @BigDana Attempt 0: Python Multiprocessing on an EC2 Instance Attempt

    0.5: Python Multiprocessing on a Spot Instance Attempt 1: Lambdas Orchestrated by Step Functions Attempt 2: Using Event Triggers with S3 Attempt 3: Rate limiting with Queues and Cloudwatch triggered Lambdas Building a Data Pipeline
  55. Troops Amazon Warriors Data Groups Rate limiting with Queues and

    Cloudwatch triggered Lambdas raw_data/ processed/ data/ @BigDana
  56. Rate limiting with Queues and Cloudwatch triggered Lambdas raw_data/ processed/

    raw_data/ processed/ @BigDana Data Groups Troops Amazon Warriors
  57. @BigDana Queue Depth Over Time 0 7500 15000 22500 30000

    12 AM 2 AM 4 AM 6 AM 8 AM 10 AM 12 PM 2 PM 4 PM 6 PM 8 PM 10 PM AMAZON WARRIOR TROOPS AMAZON WARRIORS DATA GROUPS
  58. raw_data/ processed/ Rate limiting with SQS Queues and Cloudwatch triggered

    Lambdas Dead Letter Queues @BigDana Data Groups Troops Amazon Warriors
  59. @BigDana Queue Depth Over Time 0 7500 15000 22500 30000

    12 AM 2 AM 4 AM 6 AM 8 AM 10 AM 12 PM 2 PM 4 PM 6 PM 8 PM 10 PM AMAZON WARRIOR TROOPS AMAZON WARRIORS DATA GROUPS
  60. @BigDana Queue Depth Over Time 0 7500 15000 22500 30000

    12 AM 2 AM 4 AM 6 AM 8 AM 10 AM 12 PM 2 PM 4 PM 6 PM 8 PM 10 PM AMAZON WARRIOR TROOPS AMAZON WARRIORS DATA GROUPS
  61. highly scalable and resilient can store failed tasks indefinitely gives

    you control to rate limit the processing of the tasks in the queue a small task can wait in line on a queue that already has many tasks SQS Queues: @BigDana
  62. allows for quite specific schedules fastest schedule is every 1

    minute doesn’t scale responsively to your queue depth Lambdas triggered by Cloudwatch Rules @BigDana
  63. @BigDana IT TAKES REAL CHARACTER TO ADMIT ONES FAILURES —

    AND NOT A LITTLE WISDOM TO TAKE YOUR PROFITS FROM DEFEAT
  64. @BigDana Lambdas are not yet suitable for quick tasks if

    they require large dependencies Step Functions does not yet provide dynamic fan out parallel processing Event triggers don’t yet allow you to rate limit We can’t yet trigger a Lambda based off an SQS Queue depth Takeaways
  65. @BigDana Expect Limits Expect to Exceed those limits Expect you

    will have to handle failures Lessons Learned
  66. @BigDana Invest in training new hires Decouple the learning process

    of your build deploy pipeline and building serverless architecture Reflections for Managers
  67. @BigDana Victoria Pierce @thesecondshade June 5th 2017: Joined SPS Commerce

    as a Site Reliability Engineer Intern July 12th 2017: Deployed her first lambda As of November 15th 2017: Maintains 6 serverless applications