Take it to the LIMIT! Lessons Learned After 2(!) Years of Serverless in Production

by Mottaqui Karim

Slide 1

Slide 1 text

Slide 2

Slide 2 text

Principal SWE Oracle Data Cloud, Moat Tech Lead Top Tier Programmatic Ad Exchange Taq Karim @taqkarim

Slide 3

Slide 3 text

Game Plan Here's what we will discuss today. Background What did we build? Why? Architecture What did the thing actually look like? Constraints The fun stuff - what broke? How?

Slide 4

Slide 4 text

A Dollar and a Dream Or, an introduction to the oldest form of advertising

Slide 5

Slide 5 text

O O H M E C C A

Slide 6

Slide 6 text

Programmatic Mobile/Web Ad.

Slide 7

Slide 7 text

OUT OF HOME + PROGRAMMATIC ADS == RIVERSIDE HIGH SCHOOL COMPUTER CLUB

Slide 8

Slide 8 text

1Find Canvas Any screen with electricity and internet connection will do SALE Wild UP TO 50% OFF!

Slide 9

Slide 9 text

2Transform Ad We want to plug in existing programmatic resources and transform for OOH screen.

Slide 10

Slide 10 text

No content

Slide 11

Slide 11 text

3Add Content! Useful content - like current weather or sports statistics are great.

Slide 12

Slide 12 text

No content

Slide 13

Slide 13 text

4??? Wait for industry to catch up and adopt this tech.

Slide 14

Slide 14 text

No content

Slide 15

Slide 15 text

1 2 3 2+ billion Adrequests served / day 40,000+ "Adunits" or screens in play Top 5 Demand Side Platforms partnered 5PROFIT! Grow programmatic OOH in leaps and bounds

Slide 16

Slide 16 text

Game Plan Here's what we will discuss today. Background What did we build? Why? Architecture What did the thing actually look like? Constraints The fun stuff - what broke? How?

Slide 17

Slide 17 text

Guts of an Adrequest

Slide 18

Slide 18 text

No content

Slide 19

Slide 19 text

Architecture AWS Lambda AWS Kinesis AWS Athena AWS Glue

Slide 20

Slide 20 text

The whole shebang SYSTEM ARCHITECTURE

Slide 21

Slide 21 text

Data Collection API ENDPOINTS

Slide 22

Slide 22 text

The whole shebang SYSTEM ARCHITECTURE

Slide 23

Slide 23 text

Data Processing and Reporting ETL JOBS AND AD-HOC BIG DATA QUERYING

Slide 24

Slide 24 text

The whole shebang SYSTEM ARCHITECTURE

Slide 25

Slide 25 text

Logging + Metrics MONITORING / TELEMETRY

Slide 26

Slide 26 text

Game Plan Here's what we will discuss today. Background What did we build? Why? Architecture What did the thing actually look like? Constraints The fun stuff - what broke? How?

Slide 27

Slide 27 text

Everything is always just a little broken Or, constraints at scale and how we managed them

Slide 28

Slide 28 text

Constraints around AWS Limitations Here are the main points we will discuss 1 2 3 Concurrent Invocation Limitations API Gateway Limitations Function Duration Limitations 4 boto/Kinesis Limitations 5 λ Dependency Size Limit

Slide 29

Slide 29 text

Definition How many simultaneous function invocations are available/region. Constraint 1000 requests.

Slide 30

Slide 30 text

What happens? AWS begins to "throttle" functions, failure mode or retry depending on function Examples 429 status_code error, API Gateway Retry X times then fail, SNS, S3

Slide 31

Slide 31 text

Mitigation Strategy Leverage "reserved concurrency" Allow certain functions to "reserve" percentage of available concurrency

Slide 32

Slide 32 text

At least 100 invocations must remain unreserved/account

Slide 33

Slide 33 text

Bulkhead Method Reserved Concurrency as fault tolerance

Slide 34

Slide 34 text

Measurement UnreservedConcurrentExecutions vs Invocations/function metric

Slide 35

Slide 35 text

Divide and conquer API Gateway functions? Mitigate. Kinesis workers? Let fail + retry

Slide 36

Slide 36 text

Pros Prevent runaway invocations from burst activity Cons Account wide limit Cannot access "rest" of concurrency

Slide 37

Slide 37 text

Mitigation Strategy Increase Concurrency Limit Ask AWS to increase concurrency limit via support ticket

Slide 38

Slide 38 text

Pros Easy! For some usecases, legitimate need Cons Not really solving the problem

Slide 39

Slide 39 text

Mitigation Strategy Measure + optimize perf characteristics Using monitoring, find and solve for perf bottlenecks

Slide 40

Slide 40 text

Decrease Function Duration CPU Allocation increases with memory

Slide 41

Slide 41 text

Fine tune/function Cloudwatch Logs publishes a REPORT line

Slide 42

Slide 42 text

Increase Memory for CPU bound tasks This should decrease duration and alleviate concurrency issues Decrease Memory for IO bound tasks This will optimize cost

Slide 43

Slide 43 text

Pros Cost efficiencies Best effort, "lazy" optimizing

Slide 44

Slide 44 text

Slide 45

Slide 45 text

Definition Lambdas invoked from API gateway must contend with API gateway limitations Constraint(s) 30s max timeout/request 10mb max page size limit

Slide 46

Slide 46 text

Mitigation Strategies Make API Gateway λs dumb Ingest request ➡ validate ➡ drop to queue ➡ respond Apply API design best practices Paginate or write to S3 bucket

Slide 47

Slide 47 text

Pros Ensures faster lambda duration Better API UX Cons Cases where longer computation times are needed (file transfer, etc) May introduce the need for maintaining state

Slide 48

Slide 48 text

1 2 3 You might not need λ ! More often than not, instrumenting a λ may NOT be the way to go. Potentially long lived API calls Large Payloads Maintaining state

Slide 49

Slide 49 text

Slide 50

Slide 50 text

Definition A λ function can only run for a max upper limit of time Constraint 15 minute max duration

Slide 51

Slide 51 text

First CRON λ ➡ Query ElasticSearch Then CRON λ ➡ Push Records to Kinesis Finally CRON λ ➡ Update ElasticSearch

Slide 52

Slide 52 text

What if too many records? CRON λ times out CRON λ ➡ Query ElasticSearch ➡ Too many records Failure Mode: retry Continues to retry same invocation ➡ ES runs out of disk (infinite loop!)

Slide 53

Slide 53 text

First CRON λ ➡ Query ElasticSearch TIMEOUT CRON λ ➡ Retries Finally Kinesis ends up with dupes ES not updated correctly

Slide 54

Slide 54 text

Mitigation Strategies Impossible to bump λ timeout Vertical Scaling Horizontal Scaling Instead, tweak around the λ, either by -

Slide 55

Slide 55 text

Vertically Scale Limit / decrease batch size λ ➡ ElasticSearch Fewer records, will complete in time But, "actual work" takes longer This is ok! Technically a "long lived" process

Slide 56

Slide 56 text

Horizontally Scale Currently λ CRON is a single process Provision multiple CRON lambdas Allocate strategy for distributing load For ElasticSearch, use Slices

Slide 57

Slide 57 text

No content

Slide 58

Slide 58 text

This is a neat solution! ...That just happened to work out since dependency was Elastic Search

Slide 59

Slide 59 text

Generally, two options available Parallelize / scale λ dependencies You might not needs λ!

Slide 60

Slide 60 text

Slide 61

Slide 61 text

Definition Kinesis asserts max size limits for # of records added/second Constraint 1Mb/s max

Slide 62

Slide 62 text

Mitigation Strategies Leverage KPL Queues up messages and sends in 1Mb batches or, as large as possible For python, leverage the Kinesis Producer Library package

Slide 63

Slide 63 text

At high load, odd behavior Unusually high # of Kinesis send failures Failures expected if another λ is interacting with stream Issue? KPL default setting execute_on_new_thread=True !!! This is culprit KPL tries to send large batches in new threads For failure mode, multiple nearly simultaneous large sends push - all fail

Slide 64

Slide 64 text

Fix is simple, set to `False`

Slide 65

Slide 65 text

Pros Ensures efficient PUTs to Kinesis Cons At or near limit, fine tuning required Generally, AWS tends to "stealth" update libraries - somewhat unreliable

Slide 66

Slide 66 text

Constraints around AWS Limitations Here are the main points we will discuss 1 2 3 Concurrent Invocation Limitations API Gateway Limitations Function Duration Limitations boto/Kinesis Limitations 5 λ Dependency Size Limit

Slide 67

Slide 67 text

Definition There is a max size limit for all λ dependency artifacts Constraint 50Mb, zipped Including layers

Slide 68

Slide 68 text

All λs come with temp storage /tmp folder, max storage size 512Mb Zip 2x ➡ move to /tmp ➡ unzip

Slide 69

Slide 69 text

Pros A lot more space available! Cons Runs up cold start times Horrendous for short lived λs such as API Gateway invocations

Slide 70

Slide 70 text

Mitigation Strategies Remove all unnecessary files Technically, only .pyc files are needed for dependencies to run...

Slide 71

Slide 71 text

Script for removal Walk through dependencies and remove doc, tests, metadata, .py files

Slide 72

Slide 72 text

Pros A fair middle ground Might get away without having to do the double zip hack Cons Only really works for python apps Doesn't solve problem, only delays it

Slide 73

Slide 73 text

Slide 74

Slide 74 text

Collecting Metrics from λ Functions

Slide 75

Slide 75 text

Cannot leverage StatsD for monitoring λs lack "hostname"s / other unique ID per function

Slide 76

Slide 76 text

λs are not long lived StatsD / DogStatsd are long lived processes Emitting metrics incurs additional duration time Lacking UID, metrics may overwrite each other

Slide 77

Slide 77 text

Mitigation Strategies Use more λs! Cloudwatch Subscription Filter λ Push metrics downstream from CW logs

Slide 78

Slide 78 text

Logging + Metrics MONITORING / TELEMETRY

Slide 79

Slide 79 text

Pros Operate on batches, similar to Kinesis consumer or CRON λ No need for blocking HTTP calls in λ invocation Cons Single threaded - only ONE Subscription Filter allowed / log group

Slide 80

Slide 80 text

Constraints are GOOD! Enforces responsible use of resources Not all use cases suit λs Stateful? Long lived? Not λ. Trade offs, Trade offs, Trade offs! At scale, compromise! Recap Some key ideas to take away from today

Slide 81

Slide 81 text

Have Qs about λs? I'd be happy to expound as a professional courtesy. Architecture I helped design and implement critical serverless subsystems Packaging and Deployment Best Practices I established best practices for CI/CD pipeline integration and automated testing Telemetry and Alerting I worked with key stakeholders to establish metric schemas and alerting practices

Slide 82

Slide 82 text

Social Media, etc Let's connect! Linkedin http://taq.website/in Twitter @taqkarim Slides http://taq.website/pytn

Slide 83

Slide 83 text

Time for Coffee What are your next steps and goals? How much support do you need from investors and what will it get you?