Capability-based scheduling of scientific workflows in the cloud

Capability-based scheduling of scientific workflows in the cloud

This is a presentation I held at the DATA conference 2020. The talk is about my research paper entitled "Capability-based scheduling of scientific workflows in the cloud".

I presented a distributed task scheduling algorithm and a software architecture for a system executing scientific workflows in the Cloud. The main challenges I addressed were (i) capability-based scheduling, which means that individual workflow tasks may require specific capabilities from highly heterogeneous compute machines in the Cloud, (ii) a dynamic environment where resources can be added and removed on demand, (iii) scalability in terms of scientific workflows consisting of hundreds of thousands of tasks, and (iv) fault tolerance because in the Cloud, faults can happen at any time. My software architecture consists of loosely coupled components communicating with each other through an event bus and a shared database. Workflow graphs are converted to process chains that can be scheduled independently. My scheduling algorithm collects distinct required capability sets for the process chains, asks the agents which of these sets they can manage, and then assigns process chains accordingly. I presented the results of four experiments I conducted to evaluate if my approach meets the aforementioned challenges. An implementation of my algorithm and software architecture is publicly available with the open-source workflow management system “Steep”.

Bdcf8af7892cb0147cb22828d37e872f?s=128

Michel Krämer

July 07, 2020
Tweet

Transcript

  1. CAPABILITY-BASED SCHEDULING OF SCIENTIFIC WORKFLOWS IN THE CLOUD MICHEL KRÄMER

  2. THAT’S ME

  3. Data processing requirements Very large data sets Heterogeneous services Automated

    data processing
  4. Data processing requirements Very large data sets Heterogeneous services Automated

    data processing
  5. Data processing requirements Very large data sets Heterogeneous services Automated

    data processing
  6. Data processing requirements Very large data sets Heterogeneous services Automated

    data processing
  7. A B D E C Cloud-based Scienti ic Work lows

    Dynamic environment Scalability Fault tolerance Deelman et al. (2018). The future of scienti ic work lows A B D E C Scienti ic Work lows Automated data processing Independent services Distributed environments
  8. A B D E C Cloud-based Scienti ic Work lows

    Dynamic environment Scalability Fault tolerance Deelman et al. (2018). The future of scienti ic work lows A B D E C Scienti ic Work lows Automated data processing Independent services Distributed environments
  9. A B D E C Cloud-based Scienti ic Work lows

    Dynamic environment Scalability Fault tolerance Deelman et al. (2018). The future of scienti ic work lows A B D E C Scienti ic Work lows Automated data processing Independent services Distributed environments
  10. A B D E C Cloud-based Scienti ic Work lows

    Dynamic environment Scalability Fault tolerance Deelman et al. (2018). The future of scienti ic work lows A B D E C Scienti ic Work lows Automated data processing Independent services Distributed environments
  11. A B D E C Cloud-based Scienti ic Work lows

    Dynamic environment Scalability Fault tolerance Deelman et al. (2018). The future of scienti ic work lows A B D E C Scienti ic Work lows Automated data processing Independent services Distributed environments
  12. A B D E C Cloud-based Scienti ic Work lows

    Dynamic environment Scalability Fault tolerance Deelman et al. (2018). The future of scienti ic work lows A B D E C Scienti ic Work lows Automated data processing Independent services Distributed environments
  13. A B D E C Cloud-based Scienti ic Work lows

    Dynamic environment Scalability Fault tolerance Deelman et al. (2018). The future of scienti ic work lows A B D E C Scienti ic Work lows Automated data processing Independent services Distributed environments
  14. Challenges of cloud-based work low management Dynamic environment Scalability Fault

    tolerance Deelman et al. (2018). The future of scienti ic work lows
  15. A B D E C Cloud-based Scienti ic Work lows

    Dynamic environment Scalability Fault tolerance Deelman et al. (2018). The future of scienti ic work lows Challenges of cloud-based work low management Dynamic environment Scalability Fault tolerance Deelman et al. (2018). The future of scienti ic work lows
  16. A B D E C Cloud-based Scienti ic Work lows

    Dynamic environment Scalability Fault tolerance Deelman et al. (2018). The future of scienti ic work lows Challenges of cloud-based work low management Dynamic environment Scalability Fault tolerance Deelman et al. (2018). The future of scienti ic work lows
  17. Challenges of cloud-based work low management Dynamic environment Scalability Fault

    tolerance Deelman et al. (2018). The future of scienti ic work lows
  18. How to schedule heterogeneous processing services?

  19. Naïve scheduling Service instances Virtual Machines ... ... A VM1

    VM2 VM3 VM4 VMn Docker A Docker B GPU C In-memory C In-memory D Docker GPU
  20. Naïve scheduling Service instances Virtual Machines ... ... A VM1

    VM2 VM3 VM4 VMn Docker A Docker B GPU C In-memory C In-memory D Docker GPU
  21. Naïve scheduling Service instances Virtual Machines ... ... A VM1

    VM2 VM3 VM4 VMn Docker Docker GPU In-memory ... Docker GPU In-memory ... Docker GPU In-memory ... Docker GPU In-memory ... Docker GPU In-memory ... A Docker B GPU C In-memory C In-memory D Docker GPU
  22. Contribution Dynamic environment Scalability Fault tolerance

  23. Contribution Dynamic environment Capability-based scheduling Scalability Fault tolerance

  24. Contribution Dynamic environment Capability-based scheduling Scalability Fault tolerance Software architecture

    + Algorithm
  25. Software architecture

  26. Overview HTTP server Controller Scheduler Instance 1 Instance n Agent

    ... Database Event bus Cloud manager H C S A M
  27. Overview HTTP server Controller Scheduler Instance 1 Instance n Agent

    ... Database Event bus Cloud manager H C S A M
  28. Overview HTTP server Controller Scheduler Instance 1 Instance n Agent

    ... Database Event bus Cloud manager H C S A M
  29. Overview HTTP server Controller Instance 1 Instance n Agent ...

    Database Event bus Cloud manager H C S A M Scheduler
  30. Overview HTTP server Scheduler Instance 1 Instance n Agent ...

    Database Event bus Cloud manager H C S A M Controller
  31. Overview HTTP server Controller Scheduler Instance 1 Instance n Agent

    ... Database Event bus Cloud manager H C S A M
  32. Overview HTTP server Controller Scheduler Instance 1 Instance n Agent

    ... Database Event bus Cloud manager H C S A M
  33. Overview HTTP server Controller Scheduler Instance 1 Instance n Agent

    ... Database Event bus Cloud manager H C S A M
  34. Overview HTTP server Controller Scheduler Instance n Agent ... Database

    Event bus Cloud manager H C S A M Instance 1
  35. A B D E C Controller Read work low from

    database Generate new process chains Save process chains into database Wait for results
  36. A B D E C Controller Read work low from

    database Generate new process chains Save process chains into database Wait for results
  37. A B D E C Controller Read work low from

    database Generate new process chains Save process chains into database Wait for results
  38. A B D E C Controller Read work low from

    database Generate new process chains Save process chains into database Wait for results
  39. A B D E C Controller Read work low from

    database Generate new process chains Save process chains into database Wait for results
  40. A B D E C Controller Read work low from

    database Generate new process chains Save process chains into database Wait for results
  41. A B D E C Controller Read work low from

    database Generate new process chains Save process chains into database Wait for results
  42. A B D E C Controller Read work low from

    database Generate new process chains Save process chains into database Wait for results
  43. A B D E C Controller Read work low from

    database Generate new process chains Save process chains into database Wait for results
  44. Scheduling algorithm

  45. 500 process chains Scheduler PC1 Docker A1 C++ PC2 Docker

    PCi Python TensorFlow PCj+1 Docker ... ... PCj GPU PCi+1 GPU ... A2 (busy) Docker 2 agents
  46. 500 process chains Scheduler distinct required capability sets PC1 Docker

    A1 C++ PC2 Docker PCi Python TensorFlow PCj+1 Docker ... ... PCj GPU PCi+1 GPU ... 2 agents A2 (busy) Docker
  47. 500 process chains 2 agents Scheduler PC1 Docker A1 C++

    A2 (busy) Docker PC2 Docker PCi Python TensorFlow PCj+1 Docker ... ... PCj GPU PCi+1 GPU ... Docker Python TensorFlow GPU
  48. 500 process chains 2 agents Scheduler PC1 Docker A1 C++

    A2 (busy) Docker PC2 Docker PCi Python TensorFlow PCj+1 Docker ... ... PCj GPU PCi+1 GPU ... Docker Python TensorFlow GPU
  49. 500 process chains 2 agents Scheduler PC1 Docker A1 C++

    A2 (busy) Docker PC2 Docker PCi Python TensorFlow PCj+1 Docker ... ... PCj GPU PCi+1 GPU ... busy ??? Docker Python TensorFlow GPU
  50. 500 process chains 2 agents Scheduler Cloud Manager PC1 Docker

    A1 C++ A2 (busy) Docker PC2 Docker PCi Python TensorFlow PCj+1 Docker ... ... PCj GPU PCi+1 GPU ... Docker Python TensorFlow GPU request agents
  51. 500 process chains 2 agents Scheduler Cloud Manager PC1 Docker

    A1 C++ A2 (busy) Docker PC2 Docker PCi Python TensorFlow PCj+1 Docker ... ... PCj GPU PCi+1 GPU ... create Docker Python TensorFlow GPU
  52. 500 process chains 5 agents Scheduler Cloud Manager PC1 Docker

    A1 C++ A3 Docker A5 GPU Docker A2 (busy) Docker PC2 Docker PCi Python TensorFlow A4 Python TensorFlow PCj+1 Docker ... ... PCj GPU PCi+1 GPU ... create Docker Python TensorFlow GPU
  53. 500 process chains 5 agents Scheduler PC1 Docker A1 C++

    A3 Docker A2 (busy) Docker PC2 Docker PCi Python TensorFlow A4 Python TensorFlow PCj+1 Docker ... ... PCj GPU PCi+1 GPU ... distinct required capability sets A5 GPU Docker
  54. 500 process chains 5 agents Scheduler PC1 Docker A1 C++

    A3 Docker A2 (busy) Docker PC2 Docker PCi Python TensorFlow A4 Python TensorFlow PCj+1 Docker ... ... PCj GPU PCi+1 GPU ... Docker Python TensorFlow GPU A5 GPU Docker
  55. 500 process chains 5 agents Scheduler PC1 Docker A1 C++

    A3 Docker A2 (busy) Docker PC2 Docker PCi Python TensorFlow A4 Python TensorFlow PCj+1 Docker ... ... PCj GPU PCi+1 GPU ... Docker Python TensorFlow GPU A5 GPU Docker
  56. 500 process chains 5 agents Scheduler PC1 Docker A1 C++

    A3 Docker A2 (busy) Docker PC2 Docker PCi Python TensorFlow A4 Python TensorFlow PCj+1 Docker ... ... PCj GPU PCi+1 GPU ... Docker Python TensorFlow GPU busy OK OK OK ??? A5 GPU Docker
  57. 500 process chains 5 agents Scheduler PC1 Docker A1 C++

    A3 Docker A3 Docker A2 (busy) Docker PC2 Docker PCi Python TensorFlow A4 Python TensorFlow A4 Python TensorFlow PCj+1 Docker ... ... PCj GPU PCi+1 GPU ... Docker Python TensorFlow GPU A5 GPU Docker A5 GPU Docker A5 GPU Docker
  58. 500 process chains 5 agents Scheduler PC1 Docker A1 C++

    A3 Docker A3 Docker A2 (busy) Docker PC2 Docker PCi Python TensorFlow A4 Python TensorFlow A4 Python TensorFlow PCj+1 Docker ... ... PCj GPU PCi+1 GPU ... Docker Python TensorFlow GPU A5 GPU Docker A5 GPU Docker
  59. 500 process chains 5 agents Scheduler PC1 Docker A1 C++

    A3 Docker A3 Docker A2 (busy) Docker PC2 Docker PCi Python TensorFlow A4 Python TensorFlow A4 Python TensorFlow PCj+1 Docker ... ... PCj GPU PCi+1 GPU ... Docker Docker Python TensorFlow GPU A5 GPU Docker A5 GPU Docker fetch process chain
  60. 500 process chains 5 agents Scheduler PC1 Docker A1 C++

    A3 Docker A3 Docker A2 (busy) Docker PC2 Docker PCi Python TensorFlow A4 Python TensorFlow A4 Python TensorFlow PCj+1 Docker ... ... PCj GPU PCi+1 GPU ... Docker Python TensorFlow GPU A5 GPU Docker A5 GPU Docker
  61. 500 process chains 5 agents Scheduler PC1 Docker A1 C++

    A3 (busy) Docker A3 Docker A2 (busy) Docker PC2 Docker PCi Python TensorFlow A4 Python TensorFlow A4 Python TensorFlow PCj+1 Docker ... ... PCj GPU PCi+1 GPU ... Docker Python TensorFlow GPU A5 GPU Docker A5 GPU Docker
  62. 500 process chains 5 agents Scheduler PC1 Docker A1 C++

    A3 (busy) Docker A2 (busy) Docker PCi Python TensorFlow A4 (busy) Python TensorFlow A4 Python TensorFlow PCj+1 Docker ... ... PCj GPU PCi+1 GPU ... Python TensorFlow GPU A5 GPU Docker A5 GPU Docker
  63. 500 process chains 5 agents Scheduler PC1 Docker A1 C++

    A3 (busy) Docker A2 (busy) Docker A4 (busy) Python TensorFlow PCj+1 Docker ... ... PCj GPU PCi+1 GPU ... GPU A5 (busy) GPU Docker A5 GPU Docker
  64. 500 process chains 5 agents Scheduler Repeat PC1 Docker A1

    C++ A3 (busy) Docker A2 (busy) Docker A4 (busy) Python TensorFlow PCj+1 Docker ... ... PCj GPU ... A5 (busy) GPU Docker
  65. Evaluation

  66. Contribution Dynamic environment Capability-based scheduling Scalability Fault tolerance Software architecture

    + Algorithm
  67. Experiment 1 Capability-based scheduling 100 process chains 4 distinct capability

    sets Correct allocation R1 R2 R3 R4 R3+R4 Process chain Start End Agent killed Fault
  68. Experiment 2 Dynamic environment 1000 process chains 1 agent at

    the beginning 8 agents at the end R1 R2 R3 R4 R3+R4 Process chain Start End Agent killed Fault
  69. Experiment 3 Scalability (process chains) 150.000 process chains up to

    8 agents Load managed well R1 R2 R3 R4 R3+R4 Process chain Start End Agent killed Fault
  70. Experiment 4 Fault tolerance 1000 process chains Agents randomly killed

    Successful recovery
  71. Implementation of the software architecture and algorithm Open Source https://steep-wms.github.io/

    Steep
  72. Thanks for listening! MICHEL KRÄMER Fraunhofer IGD, Germany michel.kraemer@igd.fraunhofer.de github.com/michel-kraemer

    steep-wms.github.io Icons by Freepik from www. laticon.com