Take it to the LIMIT! Lessons Learned After 2(!) Years of Serverless in Production

Take it to the LIMIT! Lessons Learned After 2(!) Years
of Serverless in Production

Principal SWE Oracle Data Cloud, Moat Tech Lead Top Tier
Programmatic Ad Exchange Taq Karim @taqkarim

Game Plan Here's what we will discuss today. Background What
did we build? Why? Architecture What did the thing actually look like? Constraints The fun stuff - what broke? How?

A Dollar and a Dream Or, an introduction to the
oldest form of advertising

O O H M E C C A

Programmatic Mobile/Web Ad.

OUT OF HOME + PROGRAMMATIC ADS == RIVERSIDE HIGH SCHOOL
COMPUTER CLUB

1Find Canvas Any screen with electricity and internet connection will
do SALE Wild UP TO 50% OFF!

2Transform Ad We want to plug in existing programmatic resources
and transform for OOH screen.

3Add Content! Useful content - like current weather or sports
statistics are great.

4??? Wait for industry to catch up and adopt this
tech.

1 2 3 2+ billion Adrequests served / day 40,000+
"Adunits" or screens in play Top 5 Demand Side Platforms partnered 5PROFIT! Grow programmatic OOH in leaps and bounds

Guts of an Adrequest

Architecture AWS Lambda AWS Kinesis AWS Athena AWS Glue

The whole shebang SYSTEM ARCHITECTURE

Data Collection API ENDPOINTS

Data Processing and Reporting ETL JOBS AND AD-HOC BIG DATA
QUERYING

Logging + Metrics MONITORING / TELEMETRY

Everything is always just a little broken Or, constraints at
scale and how we managed them

Constraints around AWS Limitations Here are the main points we
will discuss 1 2 3 Concurrent Invocation Limitations API Gateway Limitations Function Duration Limitations 4 boto/Kinesis Limitations 5 λ Dependency Size Limit

Definition How many simultaneous function invocations are available/region. Constraint 1000
requests.

What happens? AWS begins to "throttle" functions, failure mode or
retry depending on function Examples 429 status_code error, API Gateway Retry X times then fail, SNS, S3

Mitigation Strategy Leverage "reserved concurrency" Allow certain functions to "reserve"
percentage of available concurrency

At least 100 invocations must remain unreserved/account

Bulkhead Method Reserved Concurrency as fault tolerance

Measurement UnreservedConcurrentExecutions vs Invocations/function metric

Divide and conquer API Gateway functions? Mitigate. Kinesis workers? Let
fail + retry

Pros Prevent runaway invocations from burst activity Cons Account wide
limit Cannot access "rest" of concurrency

Mitigation Strategy Increase Concurrency Limit Ask AWS to increase concurrency
limit via support ticket

Pros Easy! For some usecases, legitimate need Cons Not really
solving the problem

Mitigation Strategy Measure + optimize perf characteristics Using monitoring, find
and solve for perf bottlenecks

Decrease Function Duration CPU Allocation increases with memory

Fine tune/function Cloudwatch Logs publishes a REPORT line

Increase Memory for CPU bound tasks This should decrease duration
and alleviate concurrency issues Decrease Memory for IO bound tasks This will optimize cost

Pros Cost efficiencies Best effort, "lazy" optimizing

Definition Lambdas invoked from API gateway must contend with API
gateway limitations Constraint(s) 30s max timeout/request 10mb max page size limit

Mitigation Strategies Make API Gateway λs dumb Ingest request ➡
validate ➡ drop to queue ➡ respond Apply API design best practices Paginate or write to S3 bucket

Pros Ensures faster lambda duration Better API UX Cons Cases
where longer computation times are needed (file transfer, etc) May introduce the need for maintaining state

1 2 3 You might not need λ ! More
often than not, instrumenting a λ may NOT be the way to go. Potentially long lived API calls Large Payloads Maintaining state

Definition A λ function can only run for a max
upper limit of time Constraint 15 minute max duration

First CRON λ ➡ Query ElasticSearch Then CRON λ ➡
Push Records to Kinesis Finally CRON λ ➡ Update ElasticSearch

What if too many records? CRON λ times out CRON
λ ➡ Query ElasticSearch ➡ Too many records Failure Mode: retry Continues to retry same invocation ➡ ES runs out of disk (infinite loop!)

First CRON λ ➡ Query ElasticSearch TIMEOUT CRON λ ➡
Retries Finally Kinesis ends up with dupes ES not updated correctly

Mitigation Strategies Impossible to bump λ timeout Vertical Scaling Horizontal
Scaling Instead, tweak around the λ, either by -

Vertically Scale Limit / decrease batch size λ ➡ ElasticSearch
Fewer records, will complete in time But, "actual work" takes longer This is ok! Technically a "long lived" process

Horizontally Scale Currently λ CRON is a single process Provision
multiple CRON lambdas Allocate strategy for distributing load For ElasticSearch, use Slices

This is a neat solution! ...That just happened to work
out since dependency was Elastic Search

Generally, two options available Parallelize / scale λ dependencies You
might not needs λ!

Definition Kinesis asserts max size limits for # of records
added/second Constraint 1Mb/s max

Mitigation Strategies Leverage KPL Queues up messages and sends in
1Mb batches or, as large as possible For python, leverage the Kinesis Producer Library package

At high load, odd behavior Unusually high # of Kinesis
send failures Failures expected if another λ is interacting with stream Issue? KPL default setting execute_on_new_thread=True !!! This is culprit KPL tries to send large batches in new threads For failure mode, multiple nearly simultaneous large sends push - all fail

Fix is simple, set to `False`

Pros Ensures efficient PUTs to Kinesis Cons At or near
limit, fine tuning required Generally, AWS tends to "stealth" update libraries - somewhat unreliable

will discuss 1 2 3 Concurrent Invocation Limitations API Gateway Limitations Function Duration Limitations boto/Kinesis Limitations 5 λ Dependency Size Limit

Definition There is a max size limit for all λ
dependency artifacts Constraint 50Mb, zipped Including layers

All λs come with temp storage /tmp folder, max storage
size 512Mb Zip 2x ➡ move to /tmp ➡ unzip

Pros A lot more space available! Cons Runs up cold
start times Horrendous for short lived λs such as API Gateway invocations

Mitigation Strategies Remove all unnecessary files Technically, only .pyc files
are needed for dependencies to run...

Script for removal Walk through dependencies and remove doc, tests,
metadata, .py files

Pros A fair middle ground Might get away without having
to do the double zip hack Cons Only really works for python apps Doesn't solve problem, only delays it

will discuss 1 2 3 Concurrent Invocation Limitations API Gateway Limitations Function Duration Limitations boto/Kinesis Limitations λ Dependency Size Limit

Collecting Metrics from λ Functions

Cannot leverage StatsD for monitoring λs lack "hostname"s / other
unique ID per function

λs are not long lived StatsD / DogStatsd are long
lived processes Emitting metrics incurs additional duration time Lacking UID, metrics may overwrite each other

Mitigation Strategies Use more λs! Cloudwatch Subscription Filter λ Push
metrics downstream from CW logs

Logging + Metrics MONITORING / TELEMETRY

Pros Operate on batches, similar to Kinesis consumer or CRON
λ No need for blocking HTTP calls in λ invocation Cons Single threaded - only ONE Subscription Filter allowed / log group

Constraints are GOOD! Enforces responsible use of resources Not all
use cases suit λs Stateful? Long lived? Not λ. Trade offs, Trade offs, Trade offs! At scale, compromise! Recap Some key ideas to take away from today

Have Qs about λs? I'd be happy to expound as
a professional courtesy. Architecture I helped design and implement critical serverless subsystems Packaging and Deployment Best Practices I established best practices for CI/CD pipeline integration and automated testing Telemetry and Alerting I worked with key stakeholders to establish metric schemas and alerting practices

Social Media, etc Let's connect! Linkedin http://taq.website/in Twitter @taqkarim Slides
http://taq.website/pytn

Time for Coffee What are your next steps and goals?
How much support do you need from investors and what will it get you?

Take it to the LIMIT! Lessons Learned After 2(!...

Take it to the LIMIT! Lessons Learned After 2(!) Years of Serverless in Production

Other Decks in Technology

Featured

Transcript