Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Take it to the LIMIT! Lessons Learned After 2(!) Years of Serverless in Production

Take it to the LIMIT! Lessons Learned After 2(!) Years of Serverless in Production

There are a number of limitation based considerations to make when deploying and maintaining an AWS Lambda based architecture at scale. In this talk, we will discuss a wide variety of constraints we considered or discovered and our workarounds or concessions for handling them. Specifically, we will talk about our experiences with: circumventing API Gateway timeouts and concurrent invocation limits experienced during load testing, contending with dependency artifact max size limits during deployment, minimizing cold start times (plus the subsequent maximum execution duration limits that resulted from it), and handling Glue ETL issues experienced when attempting to point Glue Crawlers to s3 artifacts generated from kinesis streams. In a lot of these scenarios, we did not completely solve the problem but instead came up with a solution that works “well enough”, so we will highlight tradeoffs and pro/cons of the approaches we’ve taken.

Mottaqui Karim

March 08, 2020
Tweet

Other Decks in Technology

Transcript

  1. Principal SWE Oracle Data Cloud, Moat Tech Lead Top Tier

    Programmatic Ad Exchange Taq Karim @taqkarim
  2. Game Plan Here's what we will discuss today. Background What

    did we build? Why? Architecture What did the thing actually look like? Constraints The fun stuff - what broke? How?
  3. 1 2 3 2+ billion Adrequests served / day 40,000+

    "Adunits" or screens in play Top 5 Demand Side Platforms partnered 5PROFIT! Grow programmatic OOH in leaps and bounds
  4. Game Plan Here's what we will discuss today. Background What

    did we build? Why? Architecture What did the thing actually look like? Constraints The fun stuff - what broke? How?
  5. Game Plan Here's what we will discuss today. Background What

    did we build? Why? Architecture What did the thing actually look like? Constraints The fun stuff - what broke? How?
  6. Constraints around AWS Limitations Here are the main points we

    will discuss 1 2 3 Concurrent Invocation Limitations API Gateway Limitations Function Duration Limitations 4 boto/Kinesis Limitations 5 λ Dependency Size Limit
  7. What happens? AWS begins to "throttle" functions, failure mode or

    retry depending on function Examples 429 status_code error, API Gateway Retry X times then fail, SNS, S3
  8. Increase Memory for CPU bound tasks This should decrease duration

    and alleviate concurrency issues Decrease Memory for IO bound tasks This will optimize cost
  9. Constraints around AWS Limitations Here are the main points we

    will discuss 1 2 3 Concurrent Invocation Limitations API Gateway Limitations Function Duration Limitations 4 boto/Kinesis Limitations 5 λ Dependency Size Limit
  10. Definition Lambdas invoked from API gateway must contend with API

    gateway limitations Constraint(s) 30s max timeout/request 10mb max page size limit
  11. Mitigation Strategies Make API Gateway λs dumb Ingest request ➡

    validate ➡ drop to queue ➡ respond Apply API design best practices Paginate or write to S3 bucket
  12. Pros Ensures faster lambda duration Better API UX Cons Cases

    where longer computation times are needed (file transfer, etc) May introduce the need for maintaining state
  13. 1 2 3 You might not need λ ! More

    often than not, instrumenting a λ may NOT be the way to go. Potentially long lived API calls Large Payloads Maintaining state
  14. Constraints around AWS Limitations Here are the main points we

    will discuss 1 2 3 Concurrent Invocation Limitations API Gateway Limitations Function Duration Limitations 4 boto/Kinesis Limitations 5 λ Dependency Size Limit
  15. Definition A λ function can only run for a max

    upper limit of time Constraint 15 minute max duration
  16. First CRON λ ➡ Query ElasticSearch Then CRON λ ➡

    Push Records to Kinesis Finally CRON λ ➡ Update ElasticSearch
  17. What if too many records? CRON λ times out CRON

    λ ➡ Query ElasticSearch ➡ Too many records Failure Mode: retry Continues to retry same invocation ➡ ES runs out of disk (infinite loop!)
  18. First CRON λ ➡ Query ElasticSearch TIMEOUT CRON λ ➡

    Retries Finally Kinesis ends up with dupes ES not updated correctly
  19. Vertically Scale Limit / decrease batch size λ ➡ ElasticSearch

    Fewer records, will complete in time But, "actual work" takes longer This is ok! Technically a "long lived" process
  20. Horizontally Scale Currently λ CRON is a single process Provision

    multiple CRON lambdas Allocate strategy for distributing load For ElasticSearch, use Slices
  21. This is a neat solution! ...That just happened to work

    out since dependency was Elastic Search
  22. Constraints around AWS Limitations Here are the main points we

    will discuss 1 2 3 Concurrent Invocation Limitations API Gateway Limitations Function Duration Limitations 4 boto/Kinesis Limitations 5 λ Dependency Size Limit
  23. Mitigation Strategies Leverage KPL Queues up messages and sends in

    1Mb batches or, as large as possible For python, leverage the Kinesis Producer Library package
  24. At high load, odd behavior Unusually high # of Kinesis

    send failures Failures expected if another λ is interacting with stream Issue? KPL default setting execute_on_new_thread=True !!! This is culprit KPL tries to send large batches in new threads For failure mode, multiple nearly simultaneous large sends push - all fail
  25. Pros Ensures efficient PUTs to Kinesis Cons At or near

    limit, fine tuning required Generally, AWS tends to "stealth" update libraries - somewhat unreliable
  26. Constraints around AWS Limitations Here are the main points we

    will discuss 1 2 3 Concurrent Invocation Limitations API Gateway Limitations Function Duration Limitations boto/Kinesis Limitations 5 λ Dependency Size Limit
  27. Definition There is a max size limit for all λ

    dependency artifacts Constraint 50Mb, zipped Including layers
  28. All λs come with temp storage /tmp folder, max storage

    size 512Mb Zip 2x ➡ move to /tmp ➡ unzip
  29. Pros A lot more space available! Cons Runs up cold

    start times Horrendous for short lived λs such as API Gateway invocations
  30. Pros A fair middle ground Might get away without having

    to do the double zip hack Cons Only really works for python apps Doesn't solve problem, only delays it
  31. Constraints around AWS Limitations Here are the main points we

    will discuss 1 2 3 Concurrent Invocation Limitations API Gateway Limitations Function Duration Limitations boto/Kinesis Limitations λ Dependency Size Limit
  32. λs are not long lived StatsD / DogStatsd are long

    lived processes Emitting metrics incurs additional duration time Lacking UID, metrics may overwrite each other
  33. Pros Operate on batches, similar to Kinesis consumer or CRON

    λ No need for blocking HTTP calls in λ invocation Cons Single threaded - only ONE Subscription Filter allowed / log group
  34. Constraints are GOOD! Enforces responsible use of resources Not all

    use cases suit λs Stateful? Long lived? Not λ. Trade offs, Trade offs, Trade offs! At scale, compromise! Recap Some key ideas to take away from today
  35. Have Qs about λs? I'd be happy to expound as

    a professional courtesy. Architecture I helped design and implement critical serverless subsystems Packaging and Deployment Best Practices I established best practices for CI/CD pipeline integration and automated testing Telemetry and Alerting I worked with key stakeholders to establish metric schemas and alerting practices
  36. Time for Coffee What are your next steps and goals?

    How much support do you need from investors and what will it get you?