$30 off During Our Annual Pro Sale. View Details »

Serverless quantified: Development issues & The great 100ms barrier

Serverless quantified: Development issues & The great 100ms barrier

We are probably past the hype curve on serverless technologies. All major cloud providers offer development support beyond just runtimes, and success stories of large applications designed and deployed according to serverless principles are arriving regularly. This talk conveys numbers based on interviews, surveys, web data and experiments: What are the painful limitations developers still have to work around? Which patterns are commonly implemented? And are 100ms microbilling periods the end of the game?

More Decks by Service Prototyping Research Slides

Other Decks in Research

Transcript

  1. Zürcher Fachhochschule
    Serverless quantified: Development
    issues & The great 100ms barrier
    Josef Spillner
    Service Prototyping Lab (blog.zhaw.ch/splab)
    Sep 04, 2019 | 2nd Tampere Serverless Meetup

    View Slide

  2. 2
    Developers having issues - news @ 11
    Conventional issues & pains
    Now serverless solves most of them ... why bother?
    [https://blog.grio.com/2016/04/the-importance-of-good-posture-for-software-developers.html]
    young audience: also that kind of pain awaits you...
    SAD instead of RAD
    high cost for just trying
    auto-scaling logic
    stale/leaky behaviour
    too much boilerplate
    manual resource config
    intermediate images
    functionality-cost unclear
    monolithergence

    View Slide

  3. 3
    Specific serverless / FaaS issues
    Mixed-method study conducted in 2017/18
    https://peerj.com/preprints/27005/
    https://doi.org/10.1016/j.jss.2018.12.013

    View Slide

  4. 4
    FaaS numbers & patterns
    5 prevalent patterns

    function pinging: periodically
    pinging functions with artificial
    payloads to keep containers warm

    FaaS constraint: scheduling
    priorities

    function chain: chaining functions
    to circumvent maximum execution
    time limits by increasing timeouts

    FaaS constraint: few-minutes
    timeouts

    routing function: a central function
    is configured to receive all
    requests and dispatch them

    FaaS constraint: API gateway
    pricing per registered function

    externalizes state: all state is
    stored in an external database

    FaaS constraint: statelessness

    oversized function: excessive
    memory for higher speed

    FaaS constraint: no profiles

    View Slide

  5. 5
    FaaS issues
    granularity
    → is current
    granularity
    (esp. timing)
    still a good
    fit?

    View Slide

  6. 6
    Accounting & billing periods timeline

    View Slide

  7. 7
    Utility computing → utility billing
    Utility computing [Yeo et al. 2010]:

    provide computing services on-demand

    charge based on usage

    (charge based on service quality)

    View Slide

  8. 8
    Are 100ms intervals an issue?
    Acknowledge
    Support the hypothesis
    Challenge
    Advance towards finer granularity
    Exploit
    Make the best out of it

    View Slide

  9. 9
    Are 100ms intervals an issue? Details.
    Major use cases for serverless
    [https://iot.do/ngd-openfog-fog-computing-2016-10]
    sensor data ingestion
    mobile app notification
    cloud service glue code
    load billing loss occurr.
    55ms 100ms 45% ||||||||||
    155ms 100ms 23% |||||
    255ms 100ms 15% ||||||
    ?

    View Slide

  10. 10
    Are 100ms intervals an issue? Details.
    = 10.89 USD load + 8.91 USD idle “penalty“

    View Slide

  11. 11
    Some data from SAM experiments

    SAM = Serverless Application Model = „deployment for Lambda-based
    applications“

    experiment: generic invocation of >300 SAMs from SAR

    result: «While failed functions (often due to timeouts) often take longer than 100ms, all
    successful functions have an average execution time of less, often even less than 50ms.
    Moreover, the used memory is only about a fourth of the minimum allocation of 128 MB.»

    View Slide

  12. 12
    Some data from Binaris “FaaS-SO“

    Stack Overflow emulation as if served over FaaS with HTTP trigger

    ca. 80 concurrent HTTP requests
    ● result: medium response time 70 ms, minimum 50 ms
    ☑ Acknowledge
    Support the hypothesis
    [https://blog.binaris.com/serverless-at-scale/]

    View Slide

  13. 13
    Exploit: Problem statement
    Any interval billing period will lead to monetary losses (for the consumer) or
    gains (for the provider). Can the consumer offset the losses by clever
    scheduling, i.e., reducing idle periods?
    Aggravated problem: Predictive solution impossible in real deployments.
    [Malawski‘16]
    [Malawski et. al.‘18]
    100ms

    View Slide

  14. 14
    Exploit 1: Memory-duration reshaping
    Cost := duration * memory
    Duration :=~ memory (e.g. in AWS)
    Idea: change duration/memory rectangle until “idle loss“ minimised
    Limitations: coarse-grained memory stepping; static memory allocation
    (but dynamic input data)

    View Slide

  15. 15
    Exploit 2: Solution approach

    View Slide

  16. 16
    Algorithmic-economic considerations
    Scenario: “Bag of tasks“ to be processed

    sequential

    parallel
    (distributed systems
    mental model
    challenge)

    combined
    Loss: duration n*100ms → avg. 1/2n
    Aim: reduce the idle time, converge against x*100ms barriers, minimise calls
    100ms
    simulation
    → no loss if task can
    start within billing
    period

    View Slide

  17. 17
    Algorithmic-economic considerations
    Simulation results:
    Analysis:

    greater parallelism (beyond 4-core simulation) would be benefitial

    idle times offset the gains, must be reduced significantly
    Two ways out (open applied research question):

    prediction: know in advance how many tasks to schedule per function instance (FI)

    cooperation: FI fetches tasks on its own

    View Slide

  18. 18
    Algorithmic-economic considerations
    Implementation ideas:

    function instances decide on number of tasks (i.e. active pull instead of parameter
    push)

    impliciation: leftover tasks → function instances can skip tasks

    implication: avoid empty invocations → filtering in FaaS runtime or proxy function
    (3 conditions: small overhead cost, fast forwarding, small memory allocation)
    (double billing issue: filtering rate of 1:m = 1/m extra invocation cost)

    rich context awareness: overall time limit, time already executed, time remaining
    (e.g. Lambda only reports last - calculate second, manually keep track of first)

    double-heuristic calling - two unknowns: task execution time, invocations needed
    to empty queue

    View Slide

  19. 19
    Algorithmic-economic considerations
    Implementation: faasproxy.py and faasconsumer.py

    View Slide

  20. 20
    Algorithmic-economic considerations
    Implementation: faasproxy.py and faasconsumer.py

    View Slide

  21. 21
    Preliminary results
    Data:

    small savings possible with “greedy“ threshold

    however, at the expense of parallelism (performance)
    Practical output:
    ● simulation
    ● emulation
    using Lambda
    cloud function
    ( ) Exploit

    Make the best out of it
    https://github.com/serviceprototypinglab/faas-timesharing

    View Slide

  22. 22
    Preliminary results double-check
    Uncertainties remain...

    somewhat convincing only for highly-parallel workloads

    (at expense of duration)

    even with warm containers - startup times of language environments
    ● low-latency fetch (e.g. Alluxio instead of S3) → better results expected

    View Slide

  23. 23
    Challenge: Sub-ms FaaS offerings
    OS-level timers

    Linux hi-res timer: 100 Hz → 10ms intervals (common)
    1000 Hz → 1ms intervals (Jan‘01, hrtimers: Nov‘07)

    “tickless“ kernels + preemptible scheduling
    ● real-time patch merged to mainline in Jul‘19
    ● LF Real-Time Linux project
    Container-level timers

    Docker fair scheduler & real-time scheduler

    per-container limits & priorities (cgroups-based)

    no real-time metering

    side-car container / auxiliary process needed
    Alternative isolation mechanisms

    Singularity container engine
    ● unikernels for faster startup times
    ( ) Challenge

    Advance towards finer granularity
    ongoing research - interested?

    View Slide