Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Capability-based scheduling of scientific workflows in the cloud

Capability-based scheduling of scientific workflows in the cloud

This is a presentation I held at the DATA conference 2020. The talk is about my research paper entitled "Capability-based scheduling of scientific workflows in the cloud".

I presented a distributed task scheduling algorithm and a software architecture for a system executing scientific workflows in the Cloud. The main challenges I addressed were (i) capability-based scheduling, which means that individual workflow tasks may require specific capabilities from highly heterogeneous compute machines in the Cloud, (ii) a dynamic environment where resources can be added and removed on demand, (iii) scalability in terms of scientific workflows consisting of hundreds of thousands of tasks, and (iv) fault tolerance because in the Cloud, faults can happen at any time. My software architecture consists of loosely coupled components communicating with each other through an event bus and a shared database. Workflow graphs are converted to process chains that can be scheduled independently. My scheduling algorithm collects distinct required capability sets for the process chains, asks the agents which of these sets they can manage, and then assigns process chains accordingly. I presented the results of four experiments I conducted to evaluate if my approach meets the aforementioned challenges. An implementation of my algorithm and software architecture is publicly available with the open-source workflow management system “Steep”.

Michel Krämer

July 07, 2020
Tweet

More Decks by Michel Krämer

Other Decks in Research

Transcript

  1. CAPABILITY-BASED SCHEDULING
    OF SCIENTIFIC WORKFLOWS
    IN THE CLOUD
    MICHEL KRÄMER

    View full-size slide

  2. Data processing
    requirements
    Very large
    data sets
    Heterogeneous
    services
    Automated data
    processing

    View full-size slide

  3. Data processing
    requirements
    Very large
    data sets
    Heterogeneous
    services
    Automated data
    processing

    View full-size slide

  4. Data processing
    requirements
    Very large
    data sets
    Heterogeneous
    services
    Automated data
    processing

    View full-size slide

  5. Data processing
    requirements
    Very large
    data sets
    Heterogeneous
    services
    Automated data
    processing

    View full-size slide

  6. A
    B D
    E
    C
    Cloud-based
    Scienti ic Work lows
    Dynamic environment
    Scalability
    Fault tolerance
    Deelman et al. (2018). The future of scienti ic work lows
    A
    B D
    E
    C
    Scienti ic Work lows
    Automated data processing
    Independent services
    Distributed environments

    View full-size slide

  7. A
    B D
    E
    C
    Cloud-based
    Scienti ic Work lows
    Dynamic environment
    Scalability
    Fault tolerance
    Deelman et al. (2018). The future of scienti ic work lows
    A
    B D
    E
    C
    Scienti ic Work lows
    Automated data processing
    Independent services
    Distributed environments

    View full-size slide

  8. A
    B D
    E
    C
    Cloud-based
    Scienti ic Work lows
    Dynamic environment
    Scalability
    Fault tolerance
    Deelman et al. (2018). The future of scienti ic work lows
    A
    B D
    E
    C
    Scienti ic Work lows
    Automated data processing
    Independent services
    Distributed environments

    View full-size slide

  9. A
    B D
    E
    C
    Cloud-based
    Scienti ic Work lows
    Dynamic environment
    Scalability
    Fault tolerance
    Deelman et al. (2018). The future of scienti ic work lows
    A
    B D
    E
    C
    Scienti ic Work lows
    Automated data processing
    Independent services
    Distributed environments

    View full-size slide

  10. A
    B D
    E
    C
    Cloud-based
    Scienti ic Work lows
    Dynamic environment
    Scalability
    Fault tolerance
    Deelman et al. (2018). The future of scienti ic work lows
    A
    B D
    E
    C
    Scienti ic Work lows
    Automated data processing
    Independent services
    Distributed environments

    View full-size slide

  11. A
    B D
    E
    C
    Cloud-based
    Scienti ic Work lows
    Dynamic environment
    Scalability
    Fault tolerance
    Deelman et al. (2018). The future of scienti ic work lows
    A
    B D
    E
    C
    Scienti ic Work lows
    Automated data processing
    Independent services
    Distributed environments

    View full-size slide

  12. A
    B D
    E
    C
    Cloud-based
    Scienti ic Work lows
    Dynamic environment
    Scalability
    Fault tolerance
    Deelman et al. (2018). The future of scienti ic work lows
    A
    B D
    E
    C
    Scienti ic Work lows
    Automated data processing
    Independent services
    Distributed environments

    View full-size slide

  13. Challenges
    of cloud-based work low management
    Dynamic environment
    Scalability
    Fault tolerance
    Deelman et al. (2018). The future of scienti ic work lows

    View full-size slide

  14. A
    B D
    E
    C
    Cloud-based
    Scienti ic Work lows
    Dynamic environment
    Scalability
    Fault tolerance
    Deelman et al. (2018). The future of scienti ic work lows
    Challenges
    of cloud-based work low management
    Dynamic environment
    Scalability
    Fault tolerance
    Deelman et al. (2018). The future of scienti ic work lows

    View full-size slide

  15. A
    B D
    E
    C
    Cloud-based
    Scienti ic Work lows
    Dynamic environment
    Scalability
    Fault tolerance
    Deelman et al. (2018). The future of scienti ic work lows
    Challenges
    of cloud-based work low management
    Dynamic environment
    Scalability
    Fault tolerance
    Deelman et al. (2018). The future of scienti ic work lows

    View full-size slide

  16. Challenges
    of cloud-based work low management
    Dynamic environment
    Scalability
    Fault tolerance
    Deelman et al. (2018). The future of scienti ic work lows

    View full-size slide

  17. How to schedule
    heterogeneous
    processing services?

    View full-size slide

  18. Naïve scheduling
    Service
    instances
    Virtual
    Machines
    ...
    ...
    A
    VM1
    VM2
    VM3
    VM4
    VMn
    Docker
    A
    Docker
    B
    GPU
    C
    In-memory
    C
    In-memory
    D
    Docker GPU

    View full-size slide

  19. Naïve scheduling
    Service
    instances
    Virtual
    Machines
    ...
    ...
    A
    VM1
    VM2
    VM3
    VM4
    VMn
    Docker
    A
    Docker
    B
    GPU
    C
    In-memory
    C
    In-memory
    D
    Docker GPU

    View full-size slide

  20. Naïve scheduling
    Service
    instances
    Virtual
    Machines
    ...
    ...
    A
    VM1
    VM2
    VM3
    VM4
    VMn
    Docker
    Docker
    GPU
    In-memory
    ...
    Docker
    GPU
    In-memory
    ...
    Docker
    GPU
    In-memory
    ...
    Docker
    GPU
    In-memory
    ...
    Docker
    GPU
    In-memory
    ...
    A
    Docker
    B
    GPU
    C
    In-memory
    C
    In-memory
    D
    Docker GPU

    View full-size slide

  21. Contribution
    Dynamic environment
    Scalability
    Fault
    tolerance

    View full-size slide

  22. Contribution
    Dynamic environment
    Capability-based scheduling
    Scalability
    Fault
    tolerance

    View full-size slide

  23. Contribution
    Dynamic environment
    Capability-based scheduling
    Scalability
    Fault
    tolerance
    Software architecture
    + Algorithm

    View full-size slide

  24. Software
    architecture

    View full-size slide

  25. Overview
    HTTP
    server
    Controller Scheduler
    Instance 1 Instance n
    Agent
    ...
    Database
    Event bus
    Cloud
    manager
    H C S A M

    View full-size slide

  26. Overview
    HTTP
    server
    Controller Scheduler
    Instance 1 Instance n
    Agent
    ...
    Database
    Event bus
    Cloud
    manager
    H C S A M

    View full-size slide

  27. Overview
    HTTP
    server
    Controller Scheduler
    Instance 1 Instance n
    Agent
    ...
    Database
    Event bus
    Cloud
    manager
    H C S A M

    View full-size slide

  28. Overview
    HTTP
    server
    Controller
    Instance 1 Instance n
    Agent
    ...
    Database
    Event bus
    Cloud
    manager
    H C S A M
    Scheduler

    View full-size slide

  29. Overview
    HTTP
    server
    Scheduler
    Instance 1 Instance n
    Agent
    ...
    Database
    Event bus
    Cloud
    manager
    H C S A M
    Controller

    View full-size slide

  30. Overview
    HTTP
    server
    Controller Scheduler
    Instance 1 Instance n
    Agent
    ...
    Database
    Event bus
    Cloud
    manager
    H C S A M

    View full-size slide

  31. Overview
    HTTP
    server
    Controller Scheduler
    Instance 1 Instance n
    Agent
    ...
    Database
    Event bus
    Cloud
    manager
    H C S A M

    View full-size slide

  32. Overview
    HTTP
    server
    Controller Scheduler
    Instance 1 Instance n
    Agent
    ...
    Database
    Event bus
    Cloud
    manager
    H C S A M

    View full-size slide

  33. Overview
    HTTP
    server
    Controller Scheduler
    Instance n
    Agent
    ...
    Database
    Event bus
    Cloud
    manager
    H C S A M
    Instance 1

    View full-size slide

  34. A
    B D
    E
    C
    Controller
    Read work low from database
    Generate new process chains
    Save process chains into database
    Wait for results

    View full-size slide

  35. A
    B D
    E
    C
    Controller
    Read work low from database
    Generate new process chains
    Save process chains into database
    Wait for results

    View full-size slide

  36. A
    B D
    E
    C
    Controller
    Read work low from database
    Generate new process chains
    Save process chains into database
    Wait for results

    View full-size slide

  37. A
    B D
    E
    C
    Controller
    Read work low from database
    Generate new process chains
    Save process chains into database
    Wait for results

    View full-size slide

  38. A
    B D
    E
    C
    Controller
    Read work low from database
    Generate new process chains
    Save process chains into database
    Wait for results

    View full-size slide

  39. A
    B D
    E
    C
    Controller
    Read work low from database
    Generate new process chains
    Save process chains into database
    Wait for results

    View full-size slide

  40. A
    B D
    E
    C
    Controller
    Read work low from database
    Generate new process chains
    Save process chains into database
    Wait for results

    View full-size slide

  41. A
    B D
    E
    C
    Controller
    Read work low from database
    Generate new process chains
    Save process chains into database
    Wait for results

    View full-size slide

  42. A
    B D
    E
    C
    Controller
    Read work low from database
    Generate new process chains
    Save process chains into database
    Wait for results

    View full-size slide

  43. Scheduling
    algorithm

    View full-size slide

  44. 500 process chains
    Scheduler
    PC1
    Docker
    A1
    C++
    PC2
    Docker
    PCi
    Python
    TensorFlow
    PCj+1
    Docker
    ... ...
    PCj
    GPU
    PCi+1
    GPU
    ...
    A2
    (busy)
    Docker
    2 agents

    View full-size slide

  45. 500 process chains
    Scheduler
    distinct
    required
    capability sets
    PC1
    Docker
    A1
    C++
    PC2
    Docker
    PCi
    Python
    TensorFlow
    PCj+1
    Docker
    ... ...
    PCj
    GPU
    PCi+1
    GPU
    ...
    2 agents
    A2
    (busy)
    Docker

    View full-size slide

  46. 500 process chains 2 agents
    Scheduler
    PC1
    Docker
    A1
    C++
    A2
    (busy)
    Docker
    PC2
    Docker
    PCi
    Python
    TensorFlow
    PCj+1
    Docker
    ... ...
    PCj
    GPU
    PCi+1
    GPU
    ...
    Docker
    Python
    TensorFlow
    GPU

    View full-size slide

  47. 500 process chains 2 agents
    Scheduler
    PC1
    Docker
    A1
    C++
    A2
    (busy)
    Docker
    PC2
    Docker
    PCi
    Python
    TensorFlow
    PCj+1
    Docker
    ... ...
    PCj
    GPU
    PCi+1
    GPU
    ...
    Docker
    Python
    TensorFlow
    GPU

    View full-size slide

  48. 500 process chains 2 agents
    Scheduler
    PC1
    Docker
    A1
    C++
    A2
    (busy)
    Docker
    PC2
    Docker
    PCi
    Python
    TensorFlow
    PCj+1
    Docker
    ... ...
    PCj
    GPU
    PCi+1
    GPU
    ...
    busy
    ???
    Docker
    Python
    TensorFlow
    GPU

    View full-size slide

  49. 500 process chains 2 agents
    Scheduler
    Cloud Manager
    PC1
    Docker
    A1
    C++
    A2
    (busy)
    Docker
    PC2
    Docker
    PCi
    Python
    TensorFlow
    PCj+1
    Docker
    ... ...
    PCj
    GPU
    PCi+1
    GPU
    ...
    Docker
    Python
    TensorFlow
    GPU
    request agents

    View full-size slide

  50. 500 process chains 2 agents
    Scheduler
    Cloud Manager
    PC1
    Docker
    A1
    C++
    A2
    (busy)
    Docker
    PC2
    Docker
    PCi
    Python
    TensorFlow
    PCj+1
    Docker
    ... ...
    PCj
    GPU
    PCi+1
    GPU
    ...
    create
    Docker
    Python
    TensorFlow
    GPU

    View full-size slide

  51. 500 process chains 5 agents
    Scheduler
    Cloud Manager
    PC1
    Docker
    A1
    C++
    A3
    Docker
    A5
    GPU
    Docker
    A2
    (busy)
    Docker
    PC2
    Docker
    PCi
    Python
    TensorFlow A4
    Python
    TensorFlow
    PCj+1
    Docker
    ... ...
    PCj
    GPU
    PCi+1
    GPU
    ...
    create
    Docker
    Python
    TensorFlow
    GPU

    View full-size slide

  52. 500 process chains 5 agents
    Scheduler
    PC1
    Docker
    A1
    C++
    A3
    Docker
    A2
    (busy)
    Docker
    PC2
    Docker
    PCi
    Python
    TensorFlow A4
    Python
    TensorFlow
    PCj+1
    Docker
    ... ...
    PCj
    GPU
    PCi+1
    GPU
    ...
    distinct
    required
    capability sets
    A5
    GPU
    Docker

    View full-size slide

  53. 500 process chains 5 agents
    Scheduler
    PC1
    Docker
    A1
    C++
    A3
    Docker
    A2
    (busy)
    Docker
    PC2
    Docker
    PCi
    Python
    TensorFlow A4
    Python
    TensorFlow
    PCj+1
    Docker
    ... ...
    PCj
    GPU
    PCi+1
    GPU
    ...
    Docker
    Python
    TensorFlow
    GPU
    A5
    GPU
    Docker

    View full-size slide

  54. 500 process chains 5 agents
    Scheduler
    PC1
    Docker
    A1
    C++
    A3
    Docker
    A2
    (busy)
    Docker
    PC2
    Docker
    PCi
    Python
    TensorFlow A4
    Python
    TensorFlow
    PCj+1
    Docker
    ... ...
    PCj
    GPU
    PCi+1
    GPU
    ...
    Docker
    Python
    TensorFlow
    GPU
    A5
    GPU
    Docker

    View full-size slide

  55. 500 process chains 5 agents
    Scheduler
    PC1
    Docker
    A1
    C++
    A3
    Docker
    A2
    (busy)
    Docker
    PC2
    Docker
    PCi
    Python
    TensorFlow A4
    Python
    TensorFlow
    PCj+1
    Docker
    ... ...
    PCj
    GPU
    PCi+1
    GPU
    ...
    Docker
    Python
    TensorFlow
    GPU
    busy
    OK
    OK
    OK
    ???
    A5
    GPU
    Docker

    View full-size slide

  56. 500 process chains 5 agents
    Scheduler
    PC1
    Docker
    A1
    C++
    A3
    Docker
    A3
    Docker
    A2
    (busy)
    Docker
    PC2
    Docker
    PCi
    Python
    TensorFlow A4
    Python
    TensorFlow
    A4
    Python
    TensorFlow
    PCj+1
    Docker
    ... ...
    PCj
    GPU
    PCi+1
    GPU
    ...
    Docker
    Python
    TensorFlow
    GPU
    A5
    GPU
    Docker
    A5
    GPU
    Docker
    A5
    GPU
    Docker

    View full-size slide

  57. 500 process chains 5 agents
    Scheduler
    PC1
    Docker
    A1
    C++
    A3
    Docker
    A3
    Docker
    A2
    (busy)
    Docker
    PC2
    Docker
    PCi
    Python
    TensorFlow A4
    Python
    TensorFlow
    A4
    Python
    TensorFlow
    PCj+1
    Docker
    ... ...
    PCj
    GPU
    PCi+1
    GPU
    ...
    Docker
    Python
    TensorFlow
    GPU
    A5
    GPU
    Docker
    A5
    GPU
    Docker

    View full-size slide

  58. 500 process chains 5 agents
    Scheduler
    PC1
    Docker
    A1
    C++
    A3
    Docker
    A3
    Docker
    A2
    (busy)
    Docker
    PC2
    Docker
    PCi
    Python
    TensorFlow A4
    Python
    TensorFlow
    A4
    Python
    TensorFlow
    PCj+1
    Docker
    ... ...
    PCj
    GPU
    PCi+1
    GPU
    ...
    Docker
    Docker
    Python
    TensorFlow
    GPU
    A5
    GPU
    Docker
    A5
    GPU
    Docker
    fetch
    process chain

    View full-size slide

  59. 500 process chains 5 agents
    Scheduler
    PC1
    Docker
    A1
    C++
    A3
    Docker
    A3
    Docker
    A2
    (busy)
    Docker
    PC2
    Docker
    PCi
    Python
    TensorFlow A4
    Python
    TensorFlow
    A4
    Python
    TensorFlow
    PCj+1
    Docker
    ... ...
    PCj
    GPU
    PCi+1
    GPU
    ...
    Docker
    Python
    TensorFlow
    GPU
    A5
    GPU
    Docker
    A5
    GPU
    Docker

    View full-size slide

  60. 500 process chains 5 agents
    Scheduler
    PC1
    Docker
    A1
    C++
    A3
    (busy)
    Docker
    A3
    Docker
    A2
    (busy)
    Docker
    PC2
    Docker
    PCi
    Python
    TensorFlow A4
    Python
    TensorFlow
    A4
    Python
    TensorFlow
    PCj+1
    Docker
    ... ...
    PCj
    GPU
    PCi+1
    GPU
    ...
    Docker
    Python
    TensorFlow
    GPU
    A5
    GPU
    Docker
    A5
    GPU
    Docker

    View full-size slide

  61. 500 process chains 5 agents
    Scheduler
    PC1
    Docker
    A1
    C++
    A3
    (busy)
    Docker
    A2
    (busy)
    Docker
    PCi
    Python
    TensorFlow A4
    (busy)
    Python
    TensorFlow
    A4
    Python
    TensorFlow
    PCj+1
    Docker
    ... ...
    PCj
    GPU
    PCi+1
    GPU
    ...
    Python
    TensorFlow
    GPU
    A5
    GPU
    Docker
    A5
    GPU
    Docker

    View full-size slide

  62. 500 process chains 5 agents
    Scheduler
    PC1
    Docker
    A1
    C++
    A3
    (busy)
    Docker
    A2
    (busy)
    Docker
    A4
    (busy)
    Python
    TensorFlow
    PCj+1
    Docker
    ... ...
    PCj
    GPU
    PCi+1
    GPU
    ...
    GPU
    A5
    (busy)
    GPU
    Docker
    A5
    GPU
    Docker

    View full-size slide

  63. 500 process chains 5 agents
    Scheduler
    Repeat
    PC1
    Docker
    A1
    C++
    A3
    (busy)
    Docker
    A2
    (busy)
    Docker
    A4
    (busy)
    Python
    TensorFlow
    PCj+1
    Docker
    ... ...
    PCj
    GPU
    ...
    A5
    (busy)
    GPU
    Docker

    View full-size slide

  64. Contribution
    Dynamic environment
    Capability-based scheduling
    Scalability
    Fault
    tolerance
    Software architecture
    + Algorithm

    View full-size slide

  65. Experiment 1
    Capability-based scheduling
    100 process chains
    4 distinct capability sets
    Correct allocation
    R1
    R2
    R3
    R4
    R3+R4
    Process chain
    Start End
    Agent killed
    Fault

    View full-size slide

  66. Experiment 2
    Dynamic environment
    1000 process chains
    1 agent at the beginning
    8 agents at the end
    R1
    R2
    R3
    R4
    R3+R4
    Process chain
    Start End
    Agent killed
    Fault

    View full-size slide

  67. Experiment 3
    Scalability (process chains)
    150.000 process chains
    up to 8 agents
    Load managed well
    R1
    R2
    R3
    R4
    R3+R4
    Process chain
    Start End
    Agent killed
    Fault

    View full-size slide

  68. Experiment 4
    Fault tolerance
    1000 process chains
    Agents randomly killed
    Successful recovery

    View full-size slide

  69. Implementation of the
    software architecture
    and algorithm
    Open Source
    https://steep-wms.github.io/
    Steep

    View full-size slide

  70. Thanks for listening!
    MICHEL KRÄMER
    Fraunhofer IGD, Germany
    [email protected]
    github.com/michel-kraemer
    steep-wms.github.io
    Icons by Freepik from www. laticon.com

    View full-size slide