Take it to
the LIMIT!
Lessons Learned After 2(!) Years of Serverless
in Production
Slide 2
Slide 2 text
Principal SWE
Oracle Data Cloud, Moat
Tech Lead
Top Tier Programmatic Ad Exchange
Taq Karim
@taqkarim
Slide 3
Slide 3 text
Game Plan
Here's what we will discuss today.
Background
What did we build? Why?
Architecture
What did the thing actually look like?
Constraints
The fun stuff - what broke? How?
Slide 4
Slide 4 text
A Dollar
and a
Dream
Or, an introduction to the oldest form
of advertising
Slide 5
Slide 5 text
O O H
M E C C A
Slide 6
Slide 6 text
Programmatic
Mobile/Web
Ad.
Slide 7
Slide 7 text
OUT OF
HOME +
PROGRAMMATIC ADS ==
RIVERSIDE HIGH SCHOOL
COMPUTER CLUB
Slide 8
Slide 8 text
1Find Canvas
Any screen with electricity and internet
connection will do
SALE
Wild
UP TO 50% OFF!
Slide 9
Slide 9 text
2Transform Ad
We want to plug in existing programmatic
resources and
transform for OOH screen.
Slide 10
Slide 10 text
No content
Slide 11
Slide 11 text
3Add Content!
Useful content - like current weather or
sports statistics are great.
Slide 12
Slide 12 text
No content
Slide 13
Slide 13 text
4???
Wait for industry to catch up and adopt this
tech.
Slide 14
Slide 14 text
No content
Slide 15
Slide 15 text
1
2
3
2+ billion
Adrequests served /
day
40,000+
"Adunits" or screens
in play
Top 5
Demand Side
Platforms partnered
5PROFIT!
Grow programmatic OOH in
leaps and bounds
Slide 16
Slide 16 text
Game Plan
Here's what we will discuss today.
Background
What did we build? Why?
Architecture
What did the thing actually look like?
Constraints
The fun stuff - what broke? How?
Data Processing
and Reporting
ETL JOBS AND AD-HOC BIG DATA QUERYING
Slide 24
Slide 24 text
The whole
shebang
SYSTEM ARCHITECTURE
Slide 25
Slide 25 text
Logging +
Metrics
MONITORING / TELEMETRY
Slide 26
Slide 26 text
Game Plan
Here's what we will discuss today.
Background
What did we build? Why?
Architecture
What did the thing actually look like?
Constraints
The fun stuff - what broke? How?
Slide 27
Slide 27 text
Everything is
always just a
little broken
Or, constraints at scale and how we
managed them
Slide 28
Slide 28 text
Constraints around
AWS Limitations
Here are the main points we will discuss
1
2
3
Concurrent Invocation
Limitations
API Gateway
Limitations
Function Duration
Limitations
4
boto/Kinesis
Limitations
5
λ Dependency Size
Limit
Slide 29
Slide 29 text
Definition
How many simultaneous function
invocations are available/region.
Constraint
1000 requests.
Slide 30
Slide 30 text
What happens?
AWS begins to "throttle" functions, failure
mode or retry depending on function
Examples
429 status_code error, API Gateway
Retry X times then fail, SNS, S3
Slide 31
Slide 31 text
Mitigation Strategy
Leverage "reserved
concurrency"
Allow certain functions to "reserve"
percentage of available concurrency
Slide 32
Slide 32 text
At least 100 invocations must remain
unreserved/account
Slide 33
Slide 33 text
Bulkhead Method
Reserved Concurrency as fault tolerance
Slide 34
Slide 34 text
Measurement
UnreservedConcurrentExecutions vs
Invocations/function metric
Slide 35
Slide 35 text
Divide and conquer
API Gateway functions? Mitigate.
Kinesis workers? Let fail + retry
Slide 36
Slide 36 text
Pros
Prevent runaway invocations from burst
activity
Cons
Account wide limit
Cannot access "rest" of concurrency
Slide 37
Slide 37 text
Mitigation Strategy
Increase Concurrency Limit
Ask AWS to increase concurrency limit via
support ticket
Slide 38
Slide 38 text
Pros
Easy!
For some usecases, legitimate need
Cons
Not really solving the problem
Slide 39
Slide 39 text
Mitigation Strategy
Measure + optimize perf
characteristics
Using monitoring, find and solve for perf
bottlenecks
Slide 40
Slide 40 text
Decrease Function Duration
CPU Allocation increases with memory
Slide 41
Slide 41 text
Fine tune/function
Cloudwatch Logs publishes a REPORT line
Slide 42
Slide 42 text
Increase Memory for CPU
bound tasks
This should decrease duration and alleviate
concurrency issues
Decrease Memory for IO
bound tasks
This will optimize cost
Slide 43
Slide 43 text
Pros
Cost efficiencies
Best effort, "lazy" optimizing
Slide 44
Slide 44 text
Constraints around
AWS Limitations
Here are the main points we will discuss
1
2
3
Concurrent Invocation
Limitations
API Gateway
Limitations
Function Duration
Limitations
4
boto/Kinesis
Limitations
5
λ Dependency Size
Limit
Slide 45
Slide 45 text
Definition
Lambdas invoked from API gateway must
contend with API gateway limitations
Constraint(s)
30s max timeout/request
10mb max page size limit
Slide 46
Slide 46 text
Mitigation Strategies
Make API Gateway λs dumb
Ingest request
➡ validate
➡ drop to queue
➡ respond
Apply API design best practices
Paginate or write to S3 bucket
Slide 47
Slide 47 text
Pros
Ensures faster lambda duration
Better API UX
Cons
Cases where longer computation times
are needed (file transfer, etc)
May introduce the need for maintaining
state
Slide 48
Slide 48 text
1
2
3
You might not
need λ !
More often than not, instrumenting a λ
may NOT be the way to go.
Potentially
long lived
API calls
Large
Payloads
Maintaining
state
Slide 49
Slide 49 text
Constraints around
AWS Limitations
Here are the main points we will discuss
1
2
3
Concurrent Invocation
Limitations
API Gateway
Limitations
Function Duration
Limitations
4
boto/Kinesis
Limitations
5
λ Dependency Size
Limit
Slide 50
Slide 50 text
Definition
A λ function can only run for a max upper
limit of time
Constraint
15 minute max duration
Slide 51
Slide 51 text
First
CRON λ
➡
Query ElasticSearch
Then
CRON λ
➡
Push Records to Kinesis
Finally
CRON λ
➡ Update ElasticSearch
Slide 52
Slide 52 text
What if too many records?
CRON λ times out
CRON λ
➡ Query ElasticSearch
➡ Too many
records
Failure Mode: retry
Continues to retry same invocation
➡ ES
runs out of disk (infinite loop!)
Slide 53
Slide 53 text
First
CRON λ
➡
Query ElasticSearch
TIMEOUT
CRON λ
➡
Retries
Finally
Kinesis ends up with dupes
ES not updated correctly
Slide 54
Slide 54 text
Mitigation Strategies
Impossible to bump λ timeout
Vertical Scaling
Horizontal Scaling
Instead, tweak around the λ, either by -
Slide 55
Slide 55 text
Vertically Scale
Limit / decrease batch size
λ
➡ ElasticSearch
Fewer records, will complete in time
But, "actual work" takes longer
This is ok! Technically a "long lived"
process
Slide 56
Slide 56 text
Horizontally Scale
Currently λ CRON is a single
process
Provision multiple CRON lambdas
Allocate strategy for distributing load
For ElasticSearch, use Slices
Slide 57
Slide 57 text
No content
Slide 58
Slide 58 text
This is a neat solution!
...That just happened to work out since
dependency was Elastic Search
Slide 59
Slide 59 text
Generally, two options available
Parallelize / scale λ dependencies
You might not needs λ!
Slide 60
Slide 60 text
Constraints around
AWS Limitations
Here are the main points we will discuss
1
2
3
Concurrent Invocation
Limitations
API Gateway
Limitations
Function Duration
Limitations
4
boto/Kinesis
Limitations
5
λ Dependency Size
Limit
Slide 61
Slide 61 text
Definition
Kinesis asserts max size limits for # of
records added/second
Constraint
1Mb/s max
Slide 62
Slide 62 text
Mitigation Strategies
Leverage KPL
Queues up messages and sends in 1Mb
batches
or, as large as possible
For python, leverage the Kinesis Producer
Library package
Slide 63
Slide 63 text
At high load, odd behavior
Unusually high # of Kinesis send failures
Failures expected if another λ is
interacting with stream
Issue? KPL default setting
execute_on_new_thread=True
!!! This is culprit
KPL tries to send large batches in new threads
For failure mode, multiple nearly simultaneous
large sends push - all fail
Slide 64
Slide 64 text
Fix is simple, set to `False`
Slide 65
Slide 65 text
Pros
Ensures efficient PUTs to Kinesis
Cons
At or near limit, fine tuning required
Generally, AWS tends to "stealth" update
libraries - somewhat unreliable
Slide 66
Slide 66 text
Constraints around
AWS Limitations
Here are the main points we will discuss
1
2
3
Concurrent Invocation
Limitations
API Gateway
Limitations
Function Duration
Limitations
boto/Kinesis
Limitations
5
λ Dependency Size
Limit
Slide 67
Slide 67 text
Definition
There is a max size limit for all λ dependency
artifacts
Constraint
50Mb, zipped
Including layers
Slide 68
Slide 68 text
All λs come with temp storage
/tmp folder, max storage size 512Mb
Zip 2x
➡ move to /tmp
➡ unzip
Slide 69
Slide 69 text
Pros
A lot more space available!
Cons
Runs up cold start times
Horrendous for short lived λs such as API
Gateway invocations
Slide 70
Slide 70 text
Mitigation Strategies
Remove all unnecessary files
Technically, only .pyc files are needed for
dependencies to run...
Slide 71
Slide 71 text
Script for removal
Walk through
dependencies and remove
doc, tests,
metadata, .py files
Slide 72
Slide 72 text
Pros
A fair middle ground
Might get away without having to do the
double zip hack
Cons
Only really works for python apps
Doesn't solve problem, only delays it
Slide 73
Slide 73 text
Constraints around
AWS Limitations
Here are the main points we will discuss
1
2
3
Concurrent Invocation
Limitations
API Gateway
Limitations
Function Duration
Limitations
boto/Kinesis
Limitations
λ Dependency Size
Limit
Slide 74
Slide 74 text
Collecting
Metrics from
λ Functions
Slide 75
Slide 75 text
Cannot leverage StatsD for
monitoring
λs lack "hostname"s / other unique ID per
function
Slide 76
Slide 76 text
λs are not long lived
StatsD / DogStatsd are long lived
processes
Emitting metrics incurs additional duration
time
Lacking UID, metrics may overwrite each
other
Slide 77
Slide 77 text
Mitigation Strategies
Use more λs!
Cloudwatch Subscription Filter λ
Push metrics downstream from CW logs
Slide 78
Slide 78 text
Logging +
Metrics
MONITORING / TELEMETRY
Slide 79
Slide 79 text
Pros
Operate on batches, similar to Kinesis
consumer or CRON λ
No need for blocking HTTP calls in λ
invocation
Cons
Single threaded - only ONE Subscription
Filter allowed / log group
Slide 80
Slide 80 text
Constraints are
GOOD!
Enforces responsible use of resources
Not all use cases
suit λs
Stateful? Long lived? Not λ.
Trade offs,
Trade offs,
Trade offs!
At scale, compromise!
Recap
Some key ideas to take away
from today
Slide 81
Slide 81 text
Have Qs about λs?
I'd be happy to expound as a professional courtesy.
Architecture
I helped design and implement critical
serverless subsystems
Packaging and Deployment
Best Practices
I established best practices for CI/CD pipeline
integration and automated testing
Telemetry and Alerting
I worked with key stakeholders to establish
metric schemas and alerting practices