Slide 1

Slide 1 text

Workflows that run everywhere and where to run them Runtime metrics analysis for workflow deployment Tazro Ohta, Database Center for Life Science (DBCLS)

Slide 2

Slide 2 text

Workflows are now portable Tools are packaged in containers Workflows are written in Common Workflow Language Good bye to g i t c l o n e m a k e s t a c k o v e r f l o w

Slide 3

Slide 3 text

EVERYWHERE means options Where should I run my workflow? Laptop/Desktop Shared computing cluster Cloud platforms General instance Compute optimized Memory optimized Storage optimized

Slide 4

Slide 4 text

Know your workflows To run them at the best performance, you should know: Runtime metrics (resource usage) Processing time CPU/Memory usage Block I/O Network I/O Performance with relation to inputs data size / file size parameters / arguments environment/hardware

Slide 5

Slide 5 text

CWL‑metrics: Runtime metrics analysis A system to capture runtime metrics via Docker API Analyze metrics with workflow metadata such as Inputs github.com/inutano/cwl‑metrics or google 'cwl‑metrics'

Slide 6

Slide 6 text

How to use 1. Wrap your tools in Docker containers 2. Write CWL of your tools/workflow 3. Install CWL‑metrics and Run c u r l - L " h t t p s : / / t i n y u r l . c o m / c w l - m e t r i c s " | b a s h will install CWL‑metrics and run daemon process 4. Exec c w l t o o l to run your workflow with specified options 5. c w l - m e t r i c s f e t c h to get summarized runtime metrics

Slide 7

Slide 7 text

No content

Slide 8

Slide 8 text

How it works

Slide 9

Slide 9 text

The data it captures Full list at influxdata/telegraf docker daemon info: assigned cpus, mems, #containers, etc. docker container info: pid, exitcode, started/ended at, etc. mem: max usage, total usage, cache, etc. cpu: total usage, percent usage, user/kernel, etc. network: receive/transmit bytes, packets, errors, etc. block I/O: read, write, total, etc.

Slide 10

Slide 10 text

Analysis of runtime metrics c w l - m e t r i c s f e t c h client for elasticsearch outputs summarized JSON or TSV data Use Kibana to visualize raw data Use elasticsearch API directly from command line

Slide 11

Slide 11 text

RNA‑Seq workflow comparison doi.org/10.1101/456756 or Search 'cwl‑metrics' on bioRxiv Materials 7 workflows at pitagora‑galaxy/cwl 9 samples of different #reads and length from SRA 6 different AWS instances m5/c5/r5 2xlarge and 4xlarge

Slide 12

Slide 12 text

HiSAT2‑StringTie workflow (Time, SE/PE)

Slide 13

Slide 13 text

Comparison of workflows (Time/Mem)

Slide 14

Slide 14 text

Comparison of metrics and cost per run

Slide 15

Slide 15 text

Future plan Resource prediction using stored data Improve implementation less dependencies work with other containers Integrate with Provenance put metrics information in provenance object

Slide 16

Slide 16 text

Share your workflow by CWL! We will help to collect the metrics!