Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Serverless quantified: Development issues & The great 100ms barrier

Serverless quantified: Development issues & The great 100ms barrier

We are probably past the hype curve on serverless technologies. All major cloud providers offer development support beyond just runtimes, and success stories of large applications designed and deployed according to serverless principles are arriving regularly. This talk conveys numbers based on interviews, surveys, web data and experiments: What are the painful limitations developers still have to work around? Which patterns are commonly implemented? And are 100ms microbilling periods the end of the game?

More Decks by Service Prototyping Research Slides

Other Decks in Research

Transcript

  1. Zürcher Fachhochschule Serverless quantified: Development issues & The great 100ms

    barrier Josef Spillner <[email protected]> Service Prototyping Lab (blog.zhaw.ch/splab) Sep 04, 2019 | 2nd Tampere Serverless Meetup
  2. 2 Developers having issues - news @ 11 Conventional issues

    & pains Now serverless solves most of them ... why bother? [https://blog.grio.com/2016/04/the-importance-of-good-posture-for-software-developers.html] young audience: also that kind of pain awaits you... SAD instead of RAD high cost for just trying auto-scaling logic stale/leaky behaviour too much boilerplate manual resource config intermediate images functionality-cost unclear monolithergence
  3. 3 Specific serverless / FaaS issues Mixed-method study conducted in

    2017/18 https://peerj.com/preprints/27005/ https://doi.org/10.1016/j.jss.2018.12.013
  4. 4 FaaS numbers & patterns 5 prevalent patterns • function

    pinging: periodically pinging functions with artificial payloads to keep containers warm • FaaS constraint: scheduling priorities • function chain: chaining functions to circumvent maximum execution time limits by increasing timeouts • FaaS constraint: few-minutes timeouts • routing function: a central function is configured to receive all requests and dispatch them • FaaS constraint: API gateway pricing per registered function • externalizes state: all state is stored in an external database • FaaS constraint: statelessness • oversized function: excessive memory for higher speed • FaaS constraint: no profiles
  5. 7 Utility computing → utility billing Utility computing [Yeo et

    al. 2010]: • provide computing services on-demand • charge based on usage • (charge based on service quality)
  6. 8 Are 100ms intervals an issue? Acknowledge Support the hypothesis

    Challenge Advance towards finer granularity Exploit Make the best out of it
  7. 9 Are 100ms intervals an issue? Details. Major use cases

    for serverless [https://iot.do/ngd-openfog-fog-computing-2016-10] sensor data ingestion mobile app notification cloud service glue code load billing loss occurr. 55ms 100ms 45% |||||||||| 155ms 100ms 23% ||||| 255ms 100ms 15% |||||| ?
  8. 10 Are 100ms intervals an issue? Details. = 10.89 USD

    load + 8.91 USD idle “penalty“
  9. 11 Some data from SAM experiments • SAM = Serverless

    Application Model = „deployment for Lambda-based applications“ • experiment: generic invocation of >300 SAMs from SAR • result: «While failed functions (often due to timeouts) often take longer than 100ms, all successful functions have an average execution time of less, often even less than 50ms. Moreover, the used memory is only about a fourth of the minimum allocation of 128 MB.»
  10. 12 Some data from Binaris “FaaS-SO“ • Stack Overflow emulation

    as if served over FaaS with HTTP trigger • ca. 80 concurrent HTTP requests • result: medium response time 70 ms, minimum 50 ms ☑ Acknowledge Support the hypothesis [https://blog.binaris.com/serverless-at-scale/]
  11. 13 Exploit: Problem statement Any interval billing period will lead

    to monetary losses (for the consumer) or gains (for the provider). Can the consumer offset the losses by clever scheduling, i.e., reducing idle periods? Aggravated problem: Predictive solution impossible in real deployments. [Malawski‘16] [Malawski et. al.‘18] 100ms
  12. 14 Exploit 1: Memory-duration reshaping Cost := duration * memory

    Duration :=~ memory (e.g. in AWS) Idea: change duration/memory rectangle until “idle loss“ minimised Limitations: coarse-grained memory stepping; static memory allocation (but dynamic input data)
  13. 16 Algorithmic-economic considerations Scenario: “Bag of tasks“ to be processed

    • sequential • parallel (distributed systems mental model challenge) • combined Loss: duration n*100ms → avg. 1/2n Aim: reduce the idle time, converge against x*100ms barriers, minimise calls 100ms simulation → no loss if task can start within billing period
  14. 17 Algorithmic-economic considerations Simulation results: Analysis: • greater parallelism (beyond

    4-core simulation) would be benefitial • idle times offset the gains, must be reduced significantly Two ways out (open applied research question): • prediction: know in advance how many tasks to schedule per function instance (FI) • cooperation: FI fetches tasks on its own
  15. 18 Algorithmic-economic considerations Implementation ideas: • function instances decide on

    number of tasks (i.e. active pull instead of parameter push) • impliciation: leftover tasks → function instances can skip tasks • implication: avoid empty invocations → filtering in FaaS runtime or proxy function (3 conditions: small overhead cost, fast forwarding, small memory allocation) (double billing issue: filtering rate of 1:m = 1/m extra invocation cost) • rich context awareness: overall time limit, time already executed, time remaining (e.g. Lambda only reports last - calculate second, manually keep track of first) • double-heuristic calling - two unknowns: task execution time, invocations needed to empty queue
  16. 21 Preliminary results Data: • small savings possible with “greedy“

    threshold • however, at the expense of parallelism (performance) Practical output: • simulation • emulation using Lambda cloud function ( ) Exploit ☑ Make the best out of it https://github.com/serviceprototypinglab/faas-timesharing
  17. 22 Preliminary results double-check Uncertainties remain... • somewhat convincing only

    for highly-parallel workloads • (at expense of duration) • even with warm containers - startup times of language environments • low-latency fetch (e.g. Alluxio instead of S3) → better results expected
  18. 23 Challenge: Sub-ms FaaS offerings OS-level timers • Linux hi-res

    timer: 100 Hz → 10ms intervals (common) 1000 Hz → 1ms intervals (Jan‘01, hrtimers: Nov‘07) • “tickless“ kernels + preemptible scheduling • real-time patch merged to mainline in Jul‘19 • LF Real-Time Linux project Container-level timers • Docker fair scheduler & real-time scheduler • per-container limits & priorities (cgroups-based) • no real-time metering • side-car container / auxiliary process needed Alternative isolation mechanisms • Singularity container engine • unikernels for faster startup times ( ) Challenge ☑ Advance towards finer granularity ongoing research - interested?