Serverless quantified: Development issues & The great 100ms barrier

Zürcher Fachhochschule Serverless quantified: Development issues & The great 100ms
barrier Josef Spillner <[email protected]> Service Prototyping Lab (blog.zhaw.ch/splab) Sep 04, 2019 | 2nd Tampere Serverless Meetup

2 Developers having issues - news @ 11 Conventional issues
& pains Now serverless solves most of them ... why bother? [https://blog.grio.com/2016/04/the-importance-of-good-posture-for-software-developers.html] young audience: also that kind of pain awaits you... SAD instead of RAD high cost for just trying auto-scaling logic stale/leaky behaviour too much boilerplate manual resource config intermediate images functionality-cost unclear monolithergence

3 Specific serverless / FaaS issues Mixed-method study conducted in
2017/18 https://peerj.com/preprints/27005/ https://doi.org/10.1016/j.jss.2018.12.013

4 FaaS numbers & patterns 5 prevalent patterns • function
pinging: periodically pinging functions with artificial payloads to keep containers warm • FaaS constraint: scheduling priorities • function chain: chaining functions to circumvent maximum execution time limits by increasing timeouts • FaaS constraint: few-minutes timeouts • routing function: a central function is configured to receive all requests and dispatch them • FaaS constraint: API gateway pricing per registered function • externalizes state: all state is stored in an external database • FaaS constraint: statelessness • oversized function: excessive memory for higher speed • FaaS constraint: no profiles

5 FaaS issues granularity → is current granularity (esp. timing)
still a good fit?

6 Accounting & billing periods timeline

7 Utility computing → utility billing Utility computing [Yeo et
al. 2010]: • provide computing services on-demand • charge based on usage • (charge based on service quality)

8 Are 100ms intervals an issue? Acknowledge Support the hypothesis
Challenge Advance towards finer granularity Exploit Make the best out of it

9 Are 100ms intervals an issue? Details. Major use cases
for serverless [https://iot.do/ngd-openfog-fog-computing-2016-10] sensor data ingestion mobile app notification cloud service glue code load billing loss occurr. 55ms 100ms 45% |||||||||| 155ms 100ms 23% ||||| 255ms 100ms 15% |||||| ?

10 Are 100ms intervals an issue? Details. = 10.89 USD
load + 8.91 USD idle “penalty“

11 Some data from SAM experiments • SAM = Serverless
Application Model = „deployment for Lambda-based applications“ • experiment: generic invocation of >300 SAMs from SAR • result: «While failed functions (often due to timeouts) often take longer than 100ms, all successful functions have an average execution time of less, often even less than 50ms. Moreover, the used memory is only about a fourth of the minimum allocation of 128 MB.»

12 Some data from Binaris “FaaS-SO“ • Stack Overflow emulation
as if served over FaaS with HTTP trigger • ca. 80 concurrent HTTP requests • result: medium response time 70 ms, minimum 50 ms ☑ Acknowledge Support the hypothesis [https://blog.binaris.com/serverless-at-scale/]

13 Exploit: Problem statement Any interval billing period will lead
to monetary losses (for the consumer) or gains (for the provider). Can the consumer offset the losses by clever scheduling, i.e., reducing idle periods? Aggravated problem: Predictive solution impossible in real deployments. [Malawski‘16] [Malawski et. al.‘18] 100ms

14 Exploit 1: Memory-duration reshaping Cost := duration * memory
Duration :=~ memory (e.g. in AWS) Idea: change duration/memory rectangle until “idle loss“ minimised Limitations: coarse-grained memory stepping; static memory allocation (but dynamic input data)

15 Exploit 2: Solution approach

16 Algorithmic-economic considerations Scenario: “Bag of tasks“ to be processed
• sequential • parallel (distributed systems mental model challenge) • combined Loss: duration n*100ms → avg. 1/2n Aim: reduce the idle time, converge against x*100ms barriers, minimise calls 100ms simulation → no loss if task can start within billing period

17 Algorithmic-economic considerations Simulation results: Analysis: • greater parallelism (beyond
4-core simulation) would be benefitial • idle times offset the gains, must be reduced significantly Two ways out (open applied research question): • prediction: know in advance how many tasks to schedule per function instance (FI) • cooperation: FI fetches tasks on its own

18 Algorithmic-economic considerations Implementation ideas: • function instances decide on
number of tasks (i.e. active pull instead of parameter push) • impliciation: leftover tasks → function instances can skip tasks • implication: avoid empty invocations → filtering in FaaS runtime or proxy function (3 conditions: small overhead cost, fast forwarding, small memory allocation) (double billing issue: filtering rate of 1:m = 1/m extra invocation cost) • rich context awareness: overall time limit, time already executed, time remaining (e.g. Lambda only reports last - calculate second, manually keep track of first) • double-heuristic calling - two unknowns: task execution time, invocations needed to empty queue

19 Algorithmic-economic considerations Implementation: faasproxy.py and faasconsumer.py

20 Algorithmic-economic considerations Implementation: faasproxy.py and faasconsumer.py

21 Preliminary results Data: • small savings possible with “greedy“
threshold • however, at the expense of parallelism (performance) Practical output: • simulation • emulation using Lambda cloud function ( ) Exploit ☑ Make the best out of it https://github.com/serviceprototypinglab/faas-timesharing

22 Preliminary results double-check Uncertainties remain... • somewhat convincing only
for highly-parallel workloads • (at expense of duration) • even with warm containers - startup times of language environments • low-latency fetch (e.g. Alluxio instead of S3) → better results expected

23 Challenge: Sub-ms FaaS offerings OS-level timers • Linux hi-res
timer: 100 Hz → 10ms intervals (common) 1000 Hz → 1ms intervals (Jan‘01, hrtimers: Nov‘07) • “tickless“ kernels + preemptible scheduling • real-time patch merged to mainline in Jul‘19 • LF Real-Time Linux project Container-level timers • Docker fair scheduler & real-time scheduler • per-container limits & priorities (cgroups-based) • no real-time metering • side-car container / auxiliary process needed Alternative isolation mechanisms • Singularity container engine • unikernels for faster startup times ( ) Challenge ☑ Advance towards finer granularity ongoing research - interested?

Serverless quantified: Development issues & The...

Serverless quantified: Development issues & The great 100ms barrier

Service Prototyping Research Slides

More Decks by Service Prototyping Research Slides

Other Decks in Research

Featured

Transcript

Zürcher Fachhochschule Serverless quantified: Development issues & The great 100ms

2 Developers having issues - news @ 11 Conventional issues

3 Specific serverless / FaaS issues Mixed-method study conducted in

4 FaaS numbers & patterns 5 prevalent patterns • function

5 FaaS issues granularity → is current granularity (esp. timing)

6 Accounting & billing periods timeline

7 Utility computing → utility billing Utility computing [Yeo et

8 Are 100ms intervals an issue? Acknowledge Support the hypothesis

9 Are 100ms intervals an issue? Details. Major use cases

10 Are 100ms intervals an issue? Details. = 10.89 USD

11 Some data from SAM experiments • SAM = Serverless

12 Some data from Binaris “FaaS-SO“ • Stack Overflow emulation

13 Exploit: Problem statement Any interval billing period will lead

14 Exploit 1: Memory-duration reshaping Cost := duration * memory

15 Exploit 2: Solution approach

16 Algorithmic-economic considerations Scenario: “Bag of tasks“ to be processed

17 Algorithmic-economic considerations Simulation results: Analysis: • greater parallelism (beyond

18 Algorithmic-economic considerations Implementation ideas: • function instances decide on

19 Algorithmic-economic considerations Implementation: faasproxy.py and faasconsumer.py

20 Algorithmic-economic considerations Implementation: faasproxy.py and faasconsumer.py

21 Preliminary results Data: • small savings possible with “greedy“

22 Preliminary results double-check Uncertainties remain... • somewhat convincing only

23 Challenge: Sub-ms FaaS offerings OS-level timers • Linux hi-res