Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Reproducible computational pipelines with Docker and Nextflow

Paolo Di Tommaso
November 09, 2015

Reproducible computational pipelines with Docker and Nextflow

Bio In Docker Symposium, 9 November 2015, London.


Paolo Di Tommaso

November 09, 2015


  1. Paolo Di Tommaso - Notredame Lab 
 Center for Genomic

    Regulation (CRG) Bio in Docker Symposium - 9 Nov 2015, London Reproducible computational pipelines with Docker and Nextflow

    CARRY OUT BIOINFORMATICS ANALYSIS?* • Ability to compile/run the average software suite. • Diversity and complexity of software deployments. • Poor and/or incomplete documentation. • Installation of different software, packages,... and getting them to work on different platforms. • Lack of truly standard file formats. • Time installing software. • Lack of computing resources (cpus, memory, storage, etc). * available at https://goo.gl/TF9TMj
  3. COMPLEXITY • Dozens of dependencies (binary tools, compilers, libraries, system

    tools, etc) • Experimental nature of academic SW tends to be difficult to install, configure and deploy • Heterogeneous executing platforms and system architecture (laptop→supercomputer)
  4. BENEFITS • Smaller images (~100MB) • Fast instantiation time (~1sec)

    • Almost native performance • Easy to build, publish, share and deploy • Transparent build process
  5. Host PACKAGING A WORKFLOW Docker image Binary tools Workflow scripts

    Config file Compilers Libraries Environment
  6. PROS • Dead easy deployment procedure • Self-contained and precise

    controlled runtime • Rapidly reproduce any former configuration • Consistent results over time and across different platforms
  7. BENCHMARK* * Di Tommaso P, Palumbo E, Chatzou M, Prieto

    P, Heuer ML, Notredame C. (2015) 
 The impact of Docker containers on the performance of genomic pipelines. PeerJ 3:e1273 
  8. • DSL (domain specify lang.) on top of JVM •

    High-level parallelisation model • Configurable executors target multiple platforms
  9. RATIONALE • Make workflows portable across different computational environments •

    Simplify deployment and enable reproducibility • Reuse any existing piece of SW (tools, scripts, etc)
  10. process  foo  {   !      input:    

       val  str  from  'Hello'   !      output:        file  'my_file'  into  result   !      script:        """        echo  $str  world!  >  my_file        """   }   ! PROCESS DEFINITION
  11. WHAT A SCRIPT LOOKS LIKE ! sequences  =  Channel.fromPath("/data/sample.fasta")  

                            process  blast  {   input:     file  'in.fasta'  from  sequences   output:   file  'out.txt'  into  blast_result   ! """   blastp  -­‐query  in.fasta  -­‐outfmt  6  |  cut  -­‐f  2  |  \   blastdbcmd  -­‐entry_batch  -­‐  >  out.txt   """   }     ! process  align  {   input:     file  all_seqs  from  blast_result   output:     file  align_result   ! """   t_coffee  $all_seqs  2>&-­‐  |  tee  align_result
 """   }   ! blast_result.collectFile(name:  'final_alignment')    
  12. IMPLICIT PARALLELISM ! sequences  =  Channel.fromPath("/data/*.fasta")        

                      process  blast  {   input:     file  'in.fasta'  from  sequences   output:   file  'out.txt'  into  blast_result   ! """   blastp  -­‐query  in.fasta  -­‐outfmt  6  |  cut  -­‐f  2  |  \   blastdbcmd  -­‐entry_batch  -­‐  >  out.txt   """   }     ! process  align  {   input:     file  all_seqs  from  blast_result   output:     file  align_result   ! """   t_coffee  $all_seqs  2>&-­‐  |  tee  align_result
 """   }   ! blast_result.collectFile(name:  'final_alignment')    
  13. BENEFITS • High-level declarative parallelisation model • Portable across different

    platforms • Isolates task dependencies with Docker containers
  14. CONFIGURATION FILE ! process  {
    container  =  'your/image:latest'  

       executor  =  'sge'    
    queue  =  'cn-­‐el6'
    memory  =  '10GB'
    cpus  =  8
    time  =  '2h'
  15. DEPLOYMENT MODES • Local execution • Grid engine / batch

    scheduler • Distributed execution (embedded cluster) • Managed cloud (ClusterK, DNAnexus) • AWS cloud platform
  16. GRID ENGINE nextflow login node NFS cluster node cluster node

    cluster node cluster node batch scheduler submit tasks cluster node
  17. DISTRIBUTED MODE Login node NFS/Lustre Job request cluster node cluster

    node Job wrapper !  #!/bin/bash      #$  -­‐q  <queue>    #$  -­‐pe  ompi  <nodes>    #$  -­‐l  virtual_free=<mem>    mpirun  nextflow  run  <your-­‐pipeline>  -­‐with-­‐mpi   HPC cluster nextflow cluster nextflow driver nextflow worker nextflow worker nextflow worker Apache 
  18. AWS CLOUD Nextflow driver EC2 node S3 / EFS nextflow

    workers Elastic Load balancer submit tasks nextflow workers Nextflow workers Docker registry AWS ECR EC2 spot instances Apache 
 Ignite Apache 
  19. MANAGED CLOUD Nextflow Cirrus scheduler EC2 spot EC2 spot EC2

    spot Amazon S3 Submit tasks Configure instances Execute tasks Pipeline script CLUSTERK
  20. WHO IS USING NEXTFLOW? • Campagne Lab, Weill Medical College

    of Cornell University • Center for Biotechnology, Bielefeld University • Genetic Cancer group, International Agency Cancer Research • Guigo Lab, Center for Genomic Regulation • Medical genetics diagnostic, Oslo University Hospital • National Marrow Donor Program • Parasite Genomics, Sanger Institute • Sabeti Lab, Broad Institute • Veracyte Inc
  21. FUTURE WORK Short term • Investigate support for Bioboxes •

    Support for Git LFS (large file storage) • Version 1.0 (first half 2016) Long term • Enhance processing capabilities with distributed caching and data affinity. • Interoperability with YARN / Spark clusters / Common WL
  22. CONCLUSION • Docker is a game-changer for workflows packaging and

    deployment • Nextflow is a streaming oriented framework for computational workflows. • Docker + Nextflow = Reproducible self- contained pipelines.
  23. LINKS project home
 http://nextflow.io Docker benchmark https://peerj.com/articles/1273/ Univa-CRG white paper

    http://goo.gl/lEPSe2 this presentation https://speakerdeck.com/pditommaso