Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Capability-based scheduling of scientific workflows in the cloud

Capability-based scheduling of scientific workflows in the cloud

This is a presentation I held at the DATA conference 2020. The talk is about my research paper entitled "Capability-based scheduling of scientific workflows in the cloud".

I presented a distributed task scheduling algorithm and a software architecture for a system executing scientific workflows in the Cloud. The main challenges I addressed were (i) capability-based scheduling, which means that individual workflow tasks may require specific capabilities from highly heterogeneous compute machines in the Cloud, (ii) a dynamic environment where resources can be added and removed on demand, (iii) scalability in terms of scientific workflows consisting of hundreds of thousands of tasks, and (iv) fault tolerance because in the Cloud, faults can happen at any time. My software architecture consists of loosely coupled components communicating with each other through an event bus and a shared database. Workflow graphs are converted to process chains that can be scheduled independently. My scheduling algorithm collects distinct required capability sets for the process chains, asks the agents which of these sets they can manage, and then assigns process chains accordingly. I presented the results of four experiments I conducted to evaluate if my approach meets the aforementioned challenges. An implementation of my algorithm and software architecture is publicly available with the open-source workflow management system “Steep”.

Michel Krämer

July 07, 2020
Tweet

More Decks by Michel Krämer

Other Decks in Research

Transcript

  1. A B D E C Cloud-based Scienti ic Work lows

    Dynamic environment Scalability Fault tolerance Deelman et al. (2018). The future of scienti ic work lows A B D E C Scienti ic Work lows Automated data processing Independent services Distributed environments
  2. A B D E C Cloud-based Scienti ic Work lows

    Dynamic environment Scalability Fault tolerance Deelman et al. (2018). The future of scienti ic work lows A B D E C Scienti ic Work lows Automated data processing Independent services Distributed environments
  3. A B D E C Cloud-based Scienti ic Work lows

    Dynamic environment Scalability Fault tolerance Deelman et al. (2018). The future of scienti ic work lows A B D E C Scienti ic Work lows Automated data processing Independent services Distributed environments
  4. A B D E C Cloud-based Scienti ic Work lows

    Dynamic environment Scalability Fault tolerance Deelman et al. (2018). The future of scienti ic work lows A B D E C Scienti ic Work lows Automated data processing Independent services Distributed environments
  5. A B D E C Cloud-based Scienti ic Work lows

    Dynamic environment Scalability Fault tolerance Deelman et al. (2018). The future of scienti ic work lows A B D E C Scienti ic Work lows Automated data processing Independent services Distributed environments
  6. A B D E C Cloud-based Scienti ic Work lows

    Dynamic environment Scalability Fault tolerance Deelman et al. (2018). The future of scienti ic work lows A B D E C Scienti ic Work lows Automated data processing Independent services Distributed environments
  7. A B D E C Cloud-based Scienti ic Work lows

    Dynamic environment Scalability Fault tolerance Deelman et al. (2018). The future of scienti ic work lows A B D E C Scienti ic Work lows Automated data processing Independent services Distributed environments
  8. Challenges of cloud-based work low management Dynamic environment Scalability Fault

    tolerance Deelman et al. (2018). The future of scienti ic work lows
  9. A B D E C Cloud-based Scienti ic Work lows

    Dynamic environment Scalability Fault tolerance Deelman et al. (2018). The future of scienti ic work lows Challenges of cloud-based work low management Dynamic environment Scalability Fault tolerance Deelman et al. (2018). The future of scienti ic work lows
  10. A B D E C Cloud-based Scienti ic Work lows

    Dynamic environment Scalability Fault tolerance Deelman et al. (2018). The future of scienti ic work lows Challenges of cloud-based work low management Dynamic environment Scalability Fault tolerance Deelman et al. (2018). The future of scienti ic work lows
  11. Challenges of cloud-based work low management Dynamic environment Scalability Fault

    tolerance Deelman et al. (2018). The future of scienti ic work lows
  12. Naïve scheduling Service instances Virtual Machines ... ... A VM1

    VM2 VM3 VM4 VMn Docker A Docker B GPU C In-memory C In-memory D Docker GPU
  13. Naïve scheduling Service instances Virtual Machines ... ... A VM1

    VM2 VM3 VM4 VMn Docker A Docker B GPU C In-memory C In-memory D Docker GPU
  14. Naïve scheduling Service instances Virtual Machines ... ... A VM1

    VM2 VM3 VM4 VMn Docker Docker GPU In-memory ... Docker GPU In-memory ... Docker GPU In-memory ... Docker GPU In-memory ... Docker GPU In-memory ... A Docker B GPU C In-memory C In-memory D Docker GPU
  15. Overview HTTP server Controller Scheduler Instance 1 Instance n Agent

    ... Database Event bus Cloud manager H C S A M
  16. Overview HTTP server Controller Scheduler Instance 1 Instance n Agent

    ... Database Event bus Cloud manager H C S A M
  17. Overview HTTP server Controller Scheduler Instance 1 Instance n Agent

    ... Database Event bus Cloud manager H C S A M
  18. Overview HTTP server Controller Instance 1 Instance n Agent ...

    Database Event bus Cloud manager H C S A M Scheduler
  19. Overview HTTP server Scheduler Instance 1 Instance n Agent ...

    Database Event bus Cloud manager H C S A M Controller
  20. Overview HTTP server Controller Scheduler Instance 1 Instance n Agent

    ... Database Event bus Cloud manager H C S A M
  21. Overview HTTP server Controller Scheduler Instance 1 Instance n Agent

    ... Database Event bus Cloud manager H C S A M
  22. Overview HTTP server Controller Scheduler Instance 1 Instance n Agent

    ... Database Event bus Cloud manager H C S A M
  23. A B D E C Controller Read work low from

    database Generate new process chains Save process chains into database Wait for results
  24. A B D E C Controller Read work low from

    database Generate new process chains Save process chains into database Wait for results
  25. A B D E C Controller Read work low from

    database Generate new process chains Save process chains into database Wait for results
  26. A B D E C Controller Read work low from

    database Generate new process chains Save process chains into database Wait for results
  27. A B D E C Controller Read work low from

    database Generate new process chains Save process chains into database Wait for results
  28. A B D E C Controller Read work low from

    database Generate new process chains Save process chains into database Wait for results
  29. A B D E C Controller Read work low from

    database Generate new process chains Save process chains into database Wait for results
  30. A B D E C Controller Read work low from

    database Generate new process chains Save process chains into database Wait for results
  31. A B D E C Controller Read work low from

    database Generate new process chains Save process chains into database Wait for results
  32. 500 process chains Scheduler PC1 Docker A1 C++ PC2 Docker

    PCi Python TensorFlow PCj+1 Docker ... ... PCj GPU PCi+1 GPU ... A2 (busy) Docker 2 agents
  33. 500 process chains Scheduler distinct required capability sets PC1 Docker

    A1 C++ PC2 Docker PCi Python TensorFlow PCj+1 Docker ... ... PCj GPU PCi+1 GPU ... 2 agents A2 (busy) Docker
  34. 500 process chains 2 agents Scheduler PC1 Docker A1 C++

    A2 (busy) Docker PC2 Docker PCi Python TensorFlow PCj+1 Docker ... ... PCj GPU PCi+1 GPU ... Docker Python TensorFlow GPU
  35. 500 process chains 2 agents Scheduler PC1 Docker A1 C++

    A2 (busy) Docker PC2 Docker PCi Python TensorFlow PCj+1 Docker ... ... PCj GPU PCi+1 GPU ... Docker Python TensorFlow GPU
  36. 500 process chains 2 agents Scheduler PC1 Docker A1 C++

    A2 (busy) Docker PC2 Docker PCi Python TensorFlow PCj+1 Docker ... ... PCj GPU PCi+1 GPU ... busy ??? Docker Python TensorFlow GPU
  37. 500 process chains 2 agents Scheduler Cloud Manager PC1 Docker

    A1 C++ A2 (busy) Docker PC2 Docker PCi Python TensorFlow PCj+1 Docker ... ... PCj GPU PCi+1 GPU ... Docker Python TensorFlow GPU request agents
  38. 500 process chains 2 agents Scheduler Cloud Manager PC1 Docker

    A1 C++ A2 (busy) Docker PC2 Docker PCi Python TensorFlow PCj+1 Docker ... ... PCj GPU PCi+1 GPU ... create Docker Python TensorFlow GPU
  39. 500 process chains 5 agents Scheduler Cloud Manager PC1 Docker

    A1 C++ A3 Docker A5 GPU Docker A2 (busy) Docker PC2 Docker PCi Python TensorFlow A4 Python TensorFlow PCj+1 Docker ... ... PCj GPU PCi+1 GPU ... create Docker Python TensorFlow GPU
  40. 500 process chains 5 agents Scheduler PC1 Docker A1 C++

    A3 Docker A2 (busy) Docker PC2 Docker PCi Python TensorFlow A4 Python TensorFlow PCj+1 Docker ... ... PCj GPU PCi+1 GPU ... distinct required capability sets A5 GPU Docker
  41. 500 process chains 5 agents Scheduler PC1 Docker A1 C++

    A3 Docker A2 (busy) Docker PC2 Docker PCi Python TensorFlow A4 Python TensorFlow PCj+1 Docker ... ... PCj GPU PCi+1 GPU ... Docker Python TensorFlow GPU A5 GPU Docker
  42. 500 process chains 5 agents Scheduler PC1 Docker A1 C++

    A3 Docker A2 (busy) Docker PC2 Docker PCi Python TensorFlow A4 Python TensorFlow PCj+1 Docker ... ... PCj GPU PCi+1 GPU ... Docker Python TensorFlow GPU A5 GPU Docker
  43. 500 process chains 5 agents Scheduler PC1 Docker A1 C++

    A3 Docker A2 (busy) Docker PC2 Docker PCi Python TensorFlow A4 Python TensorFlow PCj+1 Docker ... ... PCj GPU PCi+1 GPU ... Docker Python TensorFlow GPU busy OK OK OK ??? A5 GPU Docker
  44. 500 process chains 5 agents Scheduler PC1 Docker A1 C++

    A3 Docker A3 Docker A2 (busy) Docker PC2 Docker PCi Python TensorFlow A4 Python TensorFlow A4 Python TensorFlow PCj+1 Docker ... ... PCj GPU PCi+1 GPU ... Docker Python TensorFlow GPU A5 GPU Docker A5 GPU Docker A5 GPU Docker
  45. 500 process chains 5 agents Scheduler PC1 Docker A1 C++

    A3 Docker A3 Docker A2 (busy) Docker PC2 Docker PCi Python TensorFlow A4 Python TensorFlow A4 Python TensorFlow PCj+1 Docker ... ... PCj GPU PCi+1 GPU ... Docker Python TensorFlow GPU A5 GPU Docker A5 GPU Docker
  46. 500 process chains 5 agents Scheduler PC1 Docker A1 C++

    A3 Docker A3 Docker A2 (busy) Docker PC2 Docker PCi Python TensorFlow A4 Python TensorFlow A4 Python TensorFlow PCj+1 Docker ... ... PCj GPU PCi+1 GPU ... Docker Docker Python TensorFlow GPU A5 GPU Docker A5 GPU Docker fetch process chain
  47. 500 process chains 5 agents Scheduler PC1 Docker A1 C++

    A3 Docker A3 Docker A2 (busy) Docker PC2 Docker PCi Python TensorFlow A4 Python TensorFlow A4 Python TensorFlow PCj+1 Docker ... ... PCj GPU PCi+1 GPU ... Docker Python TensorFlow GPU A5 GPU Docker A5 GPU Docker
  48. 500 process chains 5 agents Scheduler PC1 Docker A1 C++

    A3 (busy) Docker A3 Docker A2 (busy) Docker PC2 Docker PCi Python TensorFlow A4 Python TensorFlow A4 Python TensorFlow PCj+1 Docker ... ... PCj GPU PCi+1 GPU ... Docker Python TensorFlow GPU A5 GPU Docker A5 GPU Docker
  49. 500 process chains 5 agents Scheduler PC1 Docker A1 C++

    A3 (busy) Docker A2 (busy) Docker PCi Python TensorFlow A4 (busy) Python TensorFlow A4 Python TensorFlow PCj+1 Docker ... ... PCj GPU PCi+1 GPU ... Python TensorFlow GPU A5 GPU Docker A5 GPU Docker
  50. 500 process chains 5 agents Scheduler PC1 Docker A1 C++

    A3 (busy) Docker A2 (busy) Docker A4 (busy) Python TensorFlow PCj+1 Docker ... ... PCj GPU PCi+1 GPU ... GPU A5 (busy) GPU Docker A5 GPU Docker
  51. 500 process chains 5 agents Scheduler Repeat PC1 Docker A1

    C++ A3 (busy) Docker A2 (busy) Docker A4 (busy) Python TensorFlow PCj+1 Docker ... ... PCj GPU ... A5 (busy) GPU Docker
  52. Experiment 1 Capability-based scheduling 100 process chains 4 distinct capability

    sets Correct allocation R1 R2 R3 R4 R3+R4 Process chain Start End Agent killed Fault
  53. Experiment 2 Dynamic environment 1000 process chains 1 agent at

    the beginning 8 agents at the end R1 R2 R3 R4 R3+R4 Process chain Start End Agent killed Fault
  54. Experiment 3 Scalability (process chains) 150.000 process chains up to

    8 agents Load managed well R1 R2 R3 R4 R3+R4 Process chain Start End Agent killed Fault