Capability-based scheduling of scientific workflows in the cloud

Transcript

data processing

Data processing requirements Very large data sets Heterogeneous services Automated

data processing

Data processing requirements Very large data sets Heterogeneous services Automated

data processing

Data processing requirements Very large data sets Heterogeneous services Automated

data processing

A B D E C Cloud-based Scienti ic Work lows

Dynamic environment Scalability Fault tolerance Deelman et al. (2018). The future of scienti ic work lows A B D E C Scienti ic Work lows Automated data processing Independent services Distributed environments

A B D E C Cloud-based Scienti ic Work lows

Challenges of cloud-based work low management Dynamic environment Scalability Fault

tolerance Deelman et al. (2018). The future of scienti ic work lows

A B D E C Cloud-based Scienti ic Work lows

Dynamic environment Scalability Fault tolerance Deelman et al. (2018). The future of scienti ic work lows Challenges of cloud-based work low management Dynamic environment Scalability Fault tolerance Deelman et al. (2018). The future of scienti ic work lows

A B D E C Cloud-based Scienti ic Work lows

Dynamic environment Scalability Fault tolerance Deelman et al. (2018). The future of scienti ic work lows Challenges of cloud-based work low management Dynamic environment Scalability Fault tolerance Deelman et al. (2018). The future of scienti ic work lows

Challenges of cloud-based work low management Dynamic environment Scalability Fault

tolerance Deelman et al. (2018). The future of scienti ic work lows

How to schedule heterogeneous processing services?

Naïve scheduling Service instances Virtual Machines ... ... A VM1

VM2 VM3 VM4 VMn Docker A Docker B GPU C In-memory C In-memory D Docker GPU

Naïve scheduling Service instances Virtual Machines ... ... A VM1

VM2 VM3 VM4 VMn Docker A Docker B GPU C In-memory C In-memory D Docker GPU

Naïve scheduling Service instances Virtual Machines ... ... A VM1

VM2 VM3 VM4 VMn Docker Docker GPU In-memory ... Docker GPU In-memory ... Docker GPU In-memory ... Docker GPU In-memory ... Docker GPU In-memory ... A Docker B GPU C In-memory C In-memory D Docker GPU

Contribution Dynamic environment Scalability Fault tolerance

Contribution Dynamic environment Capability-based scheduling Scalability Fault tolerance

Contribution Dynamic environment Capability-based scheduling Scalability Fault tolerance Software architecture

+ Algorithm

Software architecture

Overview HTTP server Controller Scheduler Instance 1 Instance n Agent

... Database Event bus Cloud manager H C S A M

Overview HTTP server Controller Scheduler Instance 1 Instance n Agent

Overview HTTP server Controller Instance 1 Instance n Agent ...

Database Event bus Cloud manager H C S A M Scheduler

Overview HTTP server Scheduler Instance 1 Instance n Agent ...

Database Event bus Cloud manager H C S A M Controller

Overview HTTP server Controller Scheduler Instance 1 Instance n Agent

Overview HTTP server Controller Scheduler Instance n Agent ... Database

Event bus Cloud manager H C S A M Instance 1

A B D E C Controller Read work low from

database Generate new process chains Save process chains into database Wait for results

A B D E C Controller Read work low from

Scheduling algorithm

500 process chains Scheduler PC1 Docker A1 C++ PC2 Docker

PCi Python TensorFlow PCj+1 Docker ... ... PCj GPU PCi+1 GPU ... A2 (busy) Docker 2 agents

500 process chains Scheduler distinct required capability sets PC1 Docker

A1 C++ PC2 Docker PCi Python TensorFlow PCj+1 Docker ... ... PCj GPU PCi+1 GPU ... 2 agents A2 (busy) Docker

500 process chains 2 agents Scheduler PC1 Docker A1 C++

A2 (busy) Docker PC2 Docker PCi Python TensorFlow PCj+1 Docker ... ... PCj GPU PCi+1 GPU ... Docker Python TensorFlow GPU

500 process chains 2 agents Scheduler PC1 Docker A1 C++

A2 (busy) Docker PC2 Docker PCi Python TensorFlow PCj+1 Docker ... ... PCj GPU PCi+1 GPU ... Docker Python TensorFlow GPU

500 process chains 2 agents Scheduler PC1 Docker A1 C++

A2 (busy) Docker PC2 Docker PCi Python TensorFlow PCj+1 Docker ... ... PCj GPU PCi+1 GPU ... busy ??? Docker Python TensorFlow GPU

500 process chains 2 agents Scheduler Cloud Manager PC1 Docker

A1 C++ A2 (busy) Docker PC2 Docker PCi Python TensorFlow PCj+1 Docker ... ... PCj GPU PCi+1 GPU ... Docker Python TensorFlow GPU request agents

500 process chains 2 agents Scheduler Cloud Manager PC1 Docker

A1 C++ A2 (busy) Docker PC2 Docker PCi Python TensorFlow PCj+1 Docker ... ... PCj GPU PCi+1 GPU ... create Docker Python TensorFlow GPU

500 process chains 5 agents Scheduler Cloud Manager PC1 Docker

A1 C++ A3 Docker A5 GPU Docker A2 (busy) Docker PC2 Docker PCi Python TensorFlow A4 Python TensorFlow PCj+1 Docker ... ... PCj GPU PCi+1 GPU ... create Docker Python TensorFlow GPU

500 process chains 5 agents Scheduler PC1 Docker A1 C++

A3 Docker A2 (busy) Docker PC2 Docker PCi Python TensorFlow A4 Python TensorFlow PCj+1 Docker ... ... PCj GPU PCi+1 GPU ... distinct required capability sets A5 GPU Docker

500 process chains 5 agents Scheduler PC1 Docker A1 C++

A3 Docker A2 (busy) Docker PC2 Docker PCi Python TensorFlow A4 Python TensorFlow PCj+1 Docker ... ... PCj GPU PCi+1 GPU ... Docker Python TensorFlow GPU A5 GPU Docker

500 process chains 5 agents Scheduler PC1 Docker A1 C++

A3 Docker A2 (busy) Docker PC2 Docker PCi Python TensorFlow A4 Python TensorFlow PCj+1 Docker ... ... PCj GPU PCi+1 GPU ... Docker Python TensorFlow GPU A5 GPU Docker

500 process chains 5 agents Scheduler PC1 Docker A1 C++

A3 Docker A2 (busy) Docker PC2 Docker PCi Python TensorFlow A4 Python TensorFlow PCj+1 Docker ... ... PCj GPU PCi+1 GPU ... Docker Python TensorFlow GPU busy OK OK OK ??? A5 GPU Docker

500 process chains 5 agents Scheduler PC1 Docker A1 C++

A3 Docker A3 Docker A2 (busy) Docker PC2 Docker PCi Python TensorFlow A4 Python TensorFlow A4 Python TensorFlow PCj+1 Docker ... ... PCj GPU PCi+1 GPU ... Docker Python TensorFlow GPU A5 GPU Docker A5 GPU Docker A5 GPU Docker

500 process chains 5 agents Scheduler PC1 Docker A1 C++

A3 Docker A3 Docker A2 (busy) Docker PC2 Docker PCi Python TensorFlow A4 Python TensorFlow A4 Python TensorFlow PCj+1 Docker ... ... PCj GPU PCi+1 GPU ... Docker Python TensorFlow GPU A5 GPU Docker A5 GPU Docker

500 process chains 5 agents Scheduler PC1 Docker A1 C++

A3 Docker A3 Docker A2 (busy) Docker PC2 Docker PCi Python TensorFlow A4 Python TensorFlow A4 Python TensorFlow PCj+1 Docker ... ... PCj GPU PCi+1 GPU ... Docker Docker Python TensorFlow GPU A5 GPU Docker A5 GPU Docker fetch process chain

500 process chains 5 agents Scheduler PC1 Docker A1 C++

A3 Docker A3 Docker A2 (busy) Docker PC2 Docker PCi Python TensorFlow A4 Python TensorFlow A4 Python TensorFlow PCj+1 Docker ... ... PCj GPU PCi+1 GPU ... Docker Python TensorFlow GPU A5 GPU Docker A5 GPU Docker

500 process chains 5 agents Scheduler PC1 Docker A1 C++

A3 (busy) Docker A3 Docker A2 (busy) Docker PC2 Docker PCi Python TensorFlow A4 Python TensorFlow A4 Python TensorFlow PCj+1 Docker ... ... PCj GPU PCi+1 GPU ... Docker Python TensorFlow GPU A5 GPU Docker A5 GPU Docker

500 process chains 5 agents Scheduler PC1 Docker A1 C++

A3 (busy) Docker A2 (busy) Docker PCi Python TensorFlow A4 (busy) Python TensorFlow A4 Python TensorFlow PCj+1 Docker ... ... PCj GPU PCi+1 GPU ... Python TensorFlow GPU A5 GPU Docker A5 GPU Docker

500 process chains 5 agents Scheduler PC1 Docker A1 C++

A3 (busy) Docker A2 (busy) Docker A4 (busy) Python TensorFlow PCj+1 Docker ... ... PCj GPU PCi+1 GPU ... GPU A5 (busy) GPU Docker A5 GPU Docker

500 process chains 5 agents Scheduler Repeat PC1 Docker A1

C++ A3 (busy) Docker A2 (busy) Docker A4 (busy) Python TensorFlow PCj+1 Docker ... ... PCj GPU ... A5 (busy) GPU Docker

Evaluation

Contribution Dynamic environment Capability-based scheduling Scalability Fault tolerance Software architecture

+ Algorithm

Experiment 1 Capability-based scheduling 100 process chains 4 distinct capability

sets Correct allocation R1 R2 R3 R4 R3+R4 Process chain Start End Agent killed Fault

Experiment 2 Dynamic environment 1000 process chains 1 agent at

the beginning 8 agents at the end R1 R2 R3 R4 R3+R4 Process chain Start End Agent killed Fault

Experiment 3 Scalability (process chains) 150.000 process chains up to

8 agents Load managed well R1 R2 R3 R4 R3+R4 Process chain Start End Agent killed Fault

Experiment 4 Fault tolerance 1000 process chains Agents randomly killed

Successful recovery

Implementation of the software architecture and algorithm Open Source https://steep-wms.github.io/

Steep

Thanks for listening! MICHEL KRÄMER Fraunhofer IGD, Germany [email protected] github.com/michel-kraemer

steep-wms.github.io Icons by Freepik from www. laticon.com

Capability-based scheduling of scientific workf...

Capability-based scheduling of scientific workflows in the cloud

More Decks by Michel Krämer

Other Decks in Research

Featured

Transcript