Improve portability of bioinformatics software across HPC and cloud infrastructures

Improve portability of bioinformatics software across HPC and cloud infrastructures

6th International IBM Cloud Academy Conference 2018, 25 May 2018 @ The Institute of Statistical Mathematics

991f3366d9cc17386e6a66ef4abc6dbc?s=128

Tazro Inutano Ohta

May 25, 2018
Tweet

Transcript

  1. Improve portability of bioinformatics software across HPC and cloud infrastructures

    6th International IBM Cloud Academy Conference 2018 25 May 2018 @ The Institute of Statistical Mathematics Tazro Ohta, Database Center for Life Science Tomoya Tanjo, National Institute of Informatics Osamu Ogasawara, National Institute of Genetics
  2. Acknowledgement Intercloud CREST group Staff of DDBJ and DBCLS Galaxy

    Community Japan Community members of Galaxy, CWL, and OBF
  3. Background "Genomics as a big data science"

  4. Many samples for many purposes Unlike the other area, bioinformatics

    data analysis comes with Data explosion 100MB‑100GB per sample 10‑100,000 samples per research Large‑scale int'l collaborations / many individual projects Thousands of software tools Open source command line tools from individual developers "Workflow" to connect tools and iterate for samples "Routinely unique"
  5. #tools > 2,000

  6. Routinely unique 46 data analysis projects in 18 months required

    34 different types of analysis "Many projects required customized techniques that used only once" Chang J. (2015) Core services: Reward bioinformaticians. Nature
  7. Routines breakdown Set up servers Install and configure tools and

    workflows Transfer data Fetch reference data from the public databases Manage preprocessing jobs Set up interactive environment for statistics (e.g. Jupyter) Collect and keep result data Repeat on reviewer's demand Can clouds help us to reduce the cost of the routines?
  8. Packaging bioinformatics tools and workflows Getting out of dependencies hell

  9. Packaging tools Efforts for years to containerize tools Containers are

    now provided for most of popular tools biocontainers bioboxes Ongoing trials for other containers for HPC udocker singularity
  10. Benchmark: native vs docker Native vs Docker execution benchmark comparison

    "the observed standard deviation is smaller when running with Docker" Di Tommaso P, et al. (2015) The impact of Docker containers on the performance of genomic pipelines. PeerJ
  11. Packaging workflows Tools are often used as components of a

    workflow Everyone has their favorite wf job management software; Galaxy, Taberna, Airflow, Nextflow, Cromwell, and.. shell Sharing workflows among the different environments is hard
  12. The age of Common Workflow Language "A specification for describing

    analysis workflows and tools" A new open‑source community standard since 2014 for all data analysis tasks, not only bioinformatics describes structure of tools and workflows in YAML base container image, base command, input/output
  13. CWL: how it works Requirements tool and workflow definition files

    (.cwl) job configuration file (.yaml or .json) workflow engine that supports CWL execution
  14. CWL in action ‑ tool definition c w l V

    e r s i o n : v 1 . 0 c l a s s : C o m m a n d L i n e T o o l h i n t s : D o c k e r R e q u i r e m e n t : d o c k e r P u l l : i n u t a n o / r s e m : 0 . 1 . 0 b a s e C o m m a n d : [ " r s e m - c a l c u l a t e - e x p r e s s i o n " ] i n p u t s : t h r e a d s : t y p e : i n t i n p u t B i n d i n g : p r e f i x : - p f a s t q : t y p e : F i l e i n p u t B i n d i n g : p o s i t i o n : 1 o u t p u t s : r e a d s P e r G e n e : t y p e : F i l e o u t p u t B i n d i n g : g l o b : " * R e a d s P e r G e n e . o u t . t a b "
  15. CWL in action ‑ workflow definition c w l V

    e r s i o n : v 1 . 0 c l a s s : W o r k f l o w i n p u t s : r u n T h r e a d N : i n t d a t a U R L : s t r i n g o u t p u t s : r e a d s P e r G e n e : t y p e : F i l e o u t p u t S o u r c e : r s e m / r e a d s P e r G e n e s t e p s : d o w n l o a d _ d a t a : r u n : d o w n l o a d _ d a t a . c w l i n : d a t a U R L : d a t a U R L o u t : [ d a t a F i l e s ] r s e m : r u n : r s e m . c w l i n : t h r e a d s : r u n T h r e a d N d a t a : d o w n l o a d / d a t a F i l e s o u t : [ r e a d s P e r G e n e ]
  16. CWL in action ‑ job configuration r u n T

    h r e a d N : 8 d a t a U R L : f t p . d d b j . n i g . a c . j p / p a t h t o / e x a m p l e . f a s t q
  17. CWL in action ‑ execution using CWL reference implementation (cwltool):

    $ c w l t o o l r s e m _ w o r k f l o w . c w l r s e m _ w o r k f l o w _ j o b c o n f . y m l Basic idea CWL focuses on "What" of workflow structure of workflow incudling input, action, and output "How" should be determined by execution environment job scheduling and management are depending on engines
  18. Implementations supporting CWL Software Platform support cwltool Linux, OS X,

    Windows, local execution only Arvados AWS, GCP, Azure, Slurm Toil AWS, Azure, GCP, Grid Engine, LSF, Mesos, OpenStack, Slurm, PBS/Torque Rabix Bunny Linux, OS X, GA4GH TES (experimental) CWL‑ Airflow Linux, OS X REANA Kubernetes, CERN OpenStack (OpenStack Magnum) Cromwell local, HPC, Google, HtCondor CWLEXEC IBM Spectrum LSF 10.1.0.3+
  19. OK now everything's portable... So where should I run my

    workflows?
  20. Select the best instance for given WF An ideal system

    to optimize cloud instance selection will require resource usage data of past WF executions
  21. Collecting resource usage of data analysis workflows

  22. CWL‑metrics github.com/inutano/cwl‑metrics collects container resource usage including total CPU usage

    max memory usage total disk IO exec time collects metadata of tools and workflows via cwltool easy to install, run (almost) everywhere
  23. CWL‑metrics: How it works collect metrics data via influxdata/telegraf collect

    WF metadata via cwltool store in elasticsearch, output summary data
  24. Example: WFs on different instance type

  25. Future work A compact summary file format taken with CWL

    file Support more workflow engines Support multihost environment Support containers other than docker Summary Genomics need more machines, more easy‑to‑use clouds Packaging tools and workflows for easy migration to the clouds Collecting data for environment selection optimization Need more metrics data of wider variety of workflows