Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Lessons Learned from Running Serverless Workloads on Mesos - Diana Arroyo & Alek Slominsk, IBM Research (MesosCon EU 2016 slides)

aslom
September 01, 2016

Lessons Learned from Running Serverless Workloads on Mesos - Diana Arroyo & Alek Slominsk, IBM Research (MesosCon EU 2016 slides)

Serverless computing platforms promise new capabilities that make writing scalable micro-services easier and more cost effective. These platforms provide a distributed compute service to execute application logic in response to events. Workload demands of serverless platforms can require thousands of concurrent short lived containers to be created and destroyed in milli-seconds. In this talk we will share the lessons learned when running these workloads in a Mesos environment to meet the performance demands of OpenWhisk, a serverless open-source computing platform. We will present our experience running workload experiments in Mesos, share Mesos tuning tips and workload generation code to make Mesos an ideal platform for serverless workloads.

aslom

September 01, 2016
Tweet

More Decks by aslom

Other Decks in Programming

Transcript

  1. Before we start … •  How many people have heard

    of serverless compuCng? •  How many people have tried or are planning to try a serverless plaEorm? •  Other than AWS Lambda? •  How many people are using a serverless plaEorm in producCon?? With Mesos???
  2. Serverless: Quo Vadis? •  We hope our serverless workload generator

    is useful tool for anybody looking on serverless services and want to compare them in more depth. •  Planning to write blog post about the experience of serverless workload generaCon and benchmarking and open source the benchmark. If you have some quesCons, feedback, or want to tell me where I am wrong? Aleksander Slominski @aslom
  3. Serverless Workload CharacterisCcs •  Serverless workloads can require thousands of

    concurrent short lived containers to be created and destroyed in milliseconds: •  Container aka AcCon aka FuncCon aka …. – Depends of servless service, framework, ... •  Required operaCons: – Start lot of acCons (short lived containers) – Generate work: send request, generate response, and repeat – AcCons run for some Cme to allow for reuse (cold vs. hot)
  4. Serverless Workload Benchmark Goals •  Simulate lifecycle of serverless acCon

    as it takes part in serverless workload •  Minimal scenario: –  Test serverless acCon start Cme –  Send N requests and validate response –  Pause / Resume acCon as needed –  Stop (kill) acCons •  Scenario parameters: how many acCons are started, when, for how long etc. •  Workload runs mulCple scenario (in sequence, parallel etc.) •  Gather staCsCcs about workload execuCon –  Enough to learn how well test environments are handling high such scenarios?
  5. Simple Scenario: WebSocket AcCon Test driver (overall workload) Scenario Instance

    1 Scenario Instance 2 Scenario Instance S … … AcCon … WebSocket WebSocket
  6. Extended Workload Scenario AcCon Test driver (overall aggregate scenario 1)

    Scenario Instance 1-1-2 Scenario Instance 1-1-2 Scenario Instance S … … AcCon … Aggregate Scenario 1-1 Test driver (overall workload)
  7. Simple Setup Scenario Setup: Docker •  Start test driver container

    when it starts running it opens listening sockets and starts S scenario containers –  docker run driver –e setup_for_scenario_containers •  Each scenario container connects using websocket to the driver and starts A acCon containers –  docker run scenario –e setup_for_acCon_containers –e WS_CALLBACK=ws://test_driver:port) •  Each acCon container when started connects using websocket back to scenario container to ask for requests –  docker run hello-acCon WS_CALLBACK=ws://scenario:port
  8. Simple Scenario ExecuCon ExecuCon: •  The test driver container aaer

    starCng S scenario containers waits on a websocket for results from scenario containers •  Each scenario container aaer starCng A acCon containers waits on a websocket from an acCon containers and then starts sending N requests and waits for responses •  Each acCon containers aaer starCng sends “ready” over websocket and then waits for requests, processes each request (sleep for M milliseconds) and sends response back End result: •  1 + S + S*A containers running (driver container + scenario containers + acCon containers) •  S *A * N requests processed •  Test duraCon: ideal Cme (with zero startup Cme): N * M milliseconds
  9. Docker Engine Compute Host Environment ConfiguraCon •  Swarm –  Version

    1.2.4** •  Mesos –  0.27** •  Docker –  1.10.2 Mesos Master Swarm Mesos Agent Mesos Docker Executor(s) Mesos Docker Executor(s)
  10. Current Results •  Swarm Sync issues and deadlock – PR2412: Fix

    double RLock in Mesos cluster •  Tuning – Mesos Master: decrease --allocaCon_interval – Swarm Framework: decrease mesos.offerrefuseCmeout •  Custom Executor – One executor per node vs. one executor per container to minimize startup costs
  11. Current Results •  Results –  Preliminary tesCng shows improved performance

    over Mesos Executor. 0.00 5.00 10.00 15.00 20.00 25.00 30.00 35.00 40.00 0 200 400 600 800 1000 Time (seconds) Number of Containers Whisk Requests per Second (Swarm +Mesos) Mesos Executor Custom Executor 0.00 50.00 100.00 150.00 200.00 250.00 300.00 350.00 400.00 0 200 400 600 800 1000 Time (seconds) Number of Containers Whisk Requests per Second (Swarm Only) Swarm Only
  12. Lessons learned •  Scaling becomes harder as size increases – 

    We can run easily 100s but run into issues when running 1000 containers •  Locking in Swarm –  Only shows with this workload (different Cming of some operaCons in Swarm-Mesos leads to deadlocks ….) •  LimitaCons in Docker engine –  It seems we hit some limits on how many processes can be started per second –  Different in different versions of Docker
  13. Reproducing results and other workloads •  We are making workload

    scripts available: – hpps://github.com/aslom/serverless-workload- scripts – AddiConal measurement available to track individual acCons startup and scripts to visualize results with pyplot •  The results are meaningful only in your environment and when you compare it to your workloads – Scripts are easy to modify and we will accept PR
  14. Future work •  OpCmized Docker executor for Mesos •  Other

    changes to Mesos to beper handle serverless workloads? •  Test and compare other serverless opCons: –  AWS Lambda, Azure FuncCons, Google Cloud FuncCons, IBM OpenWhisk, … –  How would Kubernetess handle workload like that? •  Also look for blog posts with more results –  We will tweet it etc. when it is posted