Improve portability of bioinformatics software across HPC and cloud infrastructures

Slide 1

Slide 1 text

Improve portability of bioinformatics software across HPC and cloud infrastructures 6th International IBM Cloud Academy Conference 2018 25 May 2018 @ The Institute of Statistical Mathematics Tazro Ohta, Database Center for Life Science Tomoya Tanjo, National Institute of Informatics Osamu Ogasawara, National Institute of Genetics

Slide 2

Slide 2 text

Acknowledgement Intercloud CREST group Staff of DDBJ and DBCLS Galaxy Community Japan Community members of Galaxy, CWL, and OBF

Slide 3

Slide 3 text

Background "Genomics as a big data science"

Slide 4

Slide 4 text

Many samples for many purposes Unlike the other area, bioinformatics data analysis comes with Data explosion 100MB‑100GB per sample 10‑100,000 samples per research Large‑scale int'l collaborations / many individual projects Thousands of software tools Open source command line tools from individual developers "Workflow" to connect tools and iterate for samples "Routinely unique"

Slide 5

Slide 5 text

#tools > 2,000

Slide 6

Slide 6 text

Routinely unique 46 data analysis projects in 18 months required 34 different types of analysis "Many projects required customized techniques that used only once" Chang J. (2015) Core services: Reward bioinformaticians. Nature

Slide 7

Slide 7 text

Routines breakdown Set up servers Install and configure tools and workflows Transfer data Fetch reference data from the public databases Manage preprocessing jobs Set up interactive environment for statistics (e.g. Jupyter) Collect and keep result data Repeat on reviewer's demand Can clouds help us to reduce the cost of the routines?

Slide 8

Slide 8 text

Packaging bioinformatics tools and workflows Getting out of dependencies hell

Slide 9

Slide 9 text

Packaging tools Efforts for years to containerize tools Containers are now provided for most of popular tools biocontainers bioboxes Ongoing trials for other containers for HPC udocker singularity

Slide 10

Slide 10 text

Benchmark: native vs docker Native vs Docker execution benchmark comparison "the observed standard deviation is smaller when running with Docker" Di Tommaso P, et al. (2015) The impact of Docker containers on the performance of genomic pipelines. PeerJ

Slide 11

Slide 11 text

Packaging workflows Tools are often used as components of a workflow Everyone has their favorite wf job management software; Galaxy, Taberna, Airflow, Nextflow, Cromwell, and.. shell Sharing workflows among the different environments is hard

Slide 12

Slide 12 text

The age of Common Workflow Language "A specification for describing analysis workflows and tools" A new open‑source community standard since 2014 for all data analysis tasks, not only bioinformatics describes structure of tools and workflows in YAML base container image, base command, input/output

Slide 13

Slide 13 text

CWL: how it works Requirements tool and workflow definition files (.cwl) job configuration file (.yaml or .json) workflow engine that supports CWL execution

Slide 14

Slide 14 text

CWL in action ‑ tool definition c w l V e r s i o n : v 1 . 0 c l a s s : C o m m a n d L i n e T o o l h i n t s : D o c k e r R e q u i r e m e n t : d o c k e r P u l l : i n u t a n o / r s e m : 0 . 1 . 0 b a s e C o m m a n d : [ " r s e m - c a l c u l a t e - e x p r e s s i o n " ] i n p u t s : t h r e a d s : t y p e : i n t i n p u t B i n d i n g : p r e f i x : - p f a s t q : t y p e : F i l e i n p u t B i n d i n g : p o s i t i o n : 1 o u t p u t s : r e a d s P e r G e n e : t y p e : F i l e o u t p u t B i n d i n g : g l o b : " * R e a d s P e r G e n e . o u t . t a b "

Slide 15

Slide 15 text

CWL in action ‑ workflow definition c w l V e r s i o n : v 1 . 0 c l a s s : W o r k f l o w i n p u t s : r u n T h r e a d N : i n t d a t a U R L : s t r i n g o u t p u t s : r e a d s P e r G e n e : t y p e : F i l e o u t p u t S o u r c e : r s e m / r e a d s P e r G e n e s t e p s : d o w n l o a d _ d a t a : r u n : d o w n l o a d _ d a t a . c w l i n : d a t a U R L : d a t a U R L o u t : [ d a t a F i l e s ] r s e m : r u n : r s e m . c w l i n : t h r e a d s : r u n T h r e a d N d a t a : d o w n l o a d / d a t a F i l e s o u t : [ r e a d s P e r G e n e ]

Slide 16

Slide 16 text

CWL in action ‑ job configuration r u n T h r e a d N : 8 d a t a U R L : f t p . d d b j . n i g . a c . j p / p a t h t o / e x a m p l e . f a s t q

Slide 17

Slide 17 text

CWL in action ‑ execution using CWL reference implementation (cwltool): $ c w l t o o l r s e m _ w o r k f l o w . c w l r s e m _ w o r k f l o w _ j o b c o n f . y m l Basic idea CWL focuses on "What" of workflow structure of workflow incudling input, action, and output "How" should be determined by execution environment job scheduling and management are depending on engines

Slide 18

Slide 18 text

Implementations supporting CWL Software Platform support cwltool Linux, OS X, Windows, local execution only Arvados AWS, GCP, Azure, Slurm Toil AWS, Azure, GCP, Grid Engine, LSF, Mesos, OpenStack, Slurm, PBS/Torque Rabix Bunny Linux, OS X, GA4GH TES (experimental) CWL‑ Airflow Linux, OS X REANA Kubernetes, CERN OpenStack (OpenStack Magnum) Cromwell local, HPC, Google, HtCondor CWLEXEC IBM Spectrum LSF 10.1.0.3+

Slide 19

Slide 19 text

OK now everything's portable... So where should I run my workflows?

Slide 20

Slide 20 text

Select the best instance for given WF An ideal system to optimize cloud instance selection will require resource usage data of past WF executions

Slide 21

Slide 21 text

Collecting resource usage of data analysis workflows

Slide 22

Slide 22 text

CWL‑metrics github.com/inutano/cwl‑metrics collects container resource usage including total CPU usage max memory usage total disk IO exec time collects metadata of tools and workflows via cwltool easy to install, run (almost) everywhere

Slide 23

Slide 23 text

CWL‑metrics: How it works collect metrics data via influxdata/telegraf collect WF metadata via cwltool store in elasticsearch, output summary data

Slide 24

Slide 24 text

Example: WFs on different instance type

Slide 25

Slide 25 text

Future work A compact summary file format taken with CWL file Support more workflow engines Support multihost environment Support containers other than docker Summary Genomics need more machines, more easy‑to‑use clouds Packaging tools and workflows for easy migration to the clouds Collecting data for environment selection optimization Need more metrics data of wider variety of workflows