Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Reproducible computational pipelines with Docker and Nextflow

Paolo Di Tommaso
November 09, 2015
260

Reproducible computational pipelines with Docker and Nextflow

Bio In Docker Symposium, 9 November 2015, London.

http://core.brc.iop.kcl.ac.uk/wp-content/uploads/2015/02/BioInDockerAgenda.pdf

Paolo Di Tommaso

November 09, 2015
Tweet

Transcript

  1. Paolo Di Tommaso - Notredame Lab 
 Center for Genomic

    Regulation (CRG) Bio in Docker Symposium - 9 Nov 2015, London Reproducible computational pipelines with Docker and Nextflow
  2. @PaoloDiTommaso Research software engineer Comparative Bioinformatics, Notredame Lab Center for

    Genomic Regulation (CRG)
  3. None
  4. WHAT THINGS MOST FRUSTRATE YOU OR LIMIT YOUR ABILITY TO

    CARRY OUT BIOINFORMATICS ANALYSIS?* • Ability to compile/run the average software suite. • Diversity and complexity of software deployments. • Poor and/or incomplete documentation. • Installation of different software, packages,... and getting them to work on different platforms. • Lack of truly standard file formats. • Time installing software. • Lack of computing resources (cpus, memory, storage, etc). * available at https://goo.gl/TF9TMj
  5. To replicate the result of a typical 
 computational biology

    paper
 requires 280 hours!
  6. WHAT'S WRONG WITH COMPUTATIONAL WORKFLOWS?

  7. COMPLEXITY • Dozens of dependencies (binary tools, compilers, libraries, system

    tools, etc) • Experimental nature of academic SW tends to be difficult to install, configure and deploy • Heterogeneous executing platforms and system architecture (laptop→supercomputer)
  8. None
  9. CONTAINERS ARE THE THIRD BIG WAVE IN VIRTUALISATION TECHNOLOGY

  10. VM VS CONTAINER

  11. BENEFITS • Smaller images (~100MB) • Fast instantiation time (~1sec)

    • Almost native performance • Easy to build, publish, share and deploy • Transparent build process
  12. Host PACKAGING A WORKFLOW Docker image Binary tools Workflow scripts

    Config file Compilers Libraries Environment
  13. SCALING OUT . . . .

  14. CONTAINERS ORCHESTRATION • Swarm • Fleet • Kubernetes • Mesos

  15. NOT THE RIGHT ANSWER FOR COMPUTATIONAL PIPELINES

  16. SERVICES ORCHESTRATION
 ≠
 TASKS SCHEDULING

  17. OUR SOLUTION NEXTFLOW Host file system Registry

  18. DOCKER AT CRG Nextflow Config file Pipeline script docker registry

    head node Univa grid engine
  19. PROS • Dead easy deployment procedure • Self-contained and precise

    controlled runtime • Rapidly reproduce any former configuration • Consistent results over time and across different platforms
  20. CONS • Requires a modern Linux kernel (≥3.10) • Security

    concerns • Containers/images cleanup
  21. WHAT ABOUT PERFORMANCE?

  22. BENCHMARK* * Di Tommaso P, Palumbo E, Chatzou M, Prieto

    P, Heuer ML, Notredame C. (2015) 
 The impact of Docker containers on the performance of genomic pipelines. PeerJ 3:e1273 
 https://dx.doi.org/10.7717/peerj.1273
  23. • DSL (domain specify lang.) on top of JVM •

    High-level parallelisation model • Configurable executors target multiple platforms
  24. RATIONALE • Make workflows portable across different computational environments •

    Simplify deployment and enable reproducibility • Reuse any existing piece of SW (tools, scripts, etc)
  25. DATAFLOW PROGRAMMING A B X Y C D

  26. process  foo  {   !      input:    

       val  str  from  'Hello'   !      output:        file  'my_file'  into  result   !      script:        """        echo  $str  world!  >  my_file        """   }   ! PROCESS DEFINITION
  27. WHAT A SCRIPT LOOKS LIKE ! sequences  =  Channel.fromPath("/data/sample.fasta")  

                            process  blast  {   input:     file  'in.fasta'  from  sequences   output:   file  'out.txt'  into  blast_result   ! """   blastp  -­‐query  in.fasta  -­‐outfmt  6  |  cut  -­‐f  2  |  \   blastdbcmd  -­‐entry_batch  -­‐  >  out.txt   """   }     ! process  align  {   input:     file  all_seqs  from  blast_result   output:     file  align_result   ! """   t_coffee  $all_seqs  2>&-­‐  |  tee  align_result
 """   }   ! blast_result.collectFile(name:  'final_alignment')    
  28. IMPLICIT PARALLELISM ! sequences  =  Channel.fromPath("/data/*.fasta")        

                      process  blast  {   input:     file  'in.fasta'  from  sequences   output:   file  'out.txt'  into  blast_result   ! """   blastp  -­‐query  in.fasta  -­‐outfmt  6  |  cut  -­‐f  2  |  \   blastdbcmd  -­‐entry_batch  -­‐  >  out.txt   """   }     ! process  align  {   input:     file  all_seqs  from  blast_result   output:     file  align_result   ! """   t_coffee  $all_seqs  2>&-­‐  |  tee  align_result
 """   }   ! blast_result.collectFile(name:  'final_alignment')    
  29. IMPLICIT PARALLELISM BLAST T-COFFEE BLAST T-COFFEE T-COFFEE BLAST sample.fasta sample.fasta

    sample.fasta alignment
  30. BENEFITS • High-level declarative parallelisation model • Portable across different

    platforms • Isolates task dependencies with Docker containers
  31. CONFIGURATION FILE ! process  {
    container  =  'your/image:latest'  


       executor  =  'sge'    
    queue  =  'cn-­‐el6'
    memory  =  '10GB'
    cpus  =  8
    time  =  '2h'
 }  
  32. DEPLOYMENT MODES • Local execution • Grid engine / batch

    scheduler • Distributed execution (embedded cluster) • Managed cloud (ClusterK, DNAnexus) • AWS cloud platform
  33. LOCAL EXECUTION nextflow ! procs ! procs file system POSIX

    procs host
  34. GRID ENGINE nextflow login node NFS cluster node cluster node

    cluster node cluster node batch scheduler submit tasks cluster node
  35. SUPPORTED PLATFORMS 35

  36. DISTRIBUTED MODE Login node NFS/Lustre Job request cluster node cluster

    node Job wrapper !  #!/bin/bash      #$  -­‐q  <queue>    #$  -­‐pe  ompi  <nodes>    #$  -­‐l  virtual_free=<mem>    mpirun  nextflow  run  <your-­‐pipeline>  -­‐with-­‐mpi   HPC cluster nextflow cluster nextflow driver nextflow worker nextflow worker nextflow worker Apache 
 Ignite
  37. AWS CLOUD Nextflow driver EC2 node S3 / EFS nextflow

    workers Elastic Load balancer submit tasks nextflow workers Nextflow workers Docker registry AWS ECR EC2 spot instances Apache 
 Ignite Apache 
 Ignite
  38. MANAGED CLOUD Nextflow Cirrus scheduler EC2 spot EC2 spot EC2

    spot Amazon S3 Submit tasks Configure instances Execute tasks Pipeline script CLUSTERK
  39. DEMO

  40. WHO IS USING NEXTFLOW? • Campagne Lab, Weill Medical College

    of Cornell University • Center for Biotechnology, Bielefeld University • Genetic Cancer group, International Agency Cancer Research • Guigo Lab, Center for Genomic Regulation • Medical genetics diagnostic, Oslo University Hospital • National Marrow Donor Program • Parasite Genomics, Sanger Institute • Sabeti Lab, Broad Institute • Veracyte Inc
  41. FUTURE WORK Short term • Investigate support for Bioboxes •

    Support for Git LFS (large file storage) • Version 1.0 (first half 2016) Long term • Enhance processing capabilities with distributed caching and data affinity. • Interoperability with YARN / Spark clusters / Common WL
  42. CONCLUSION • Docker is a game-changer for workflows packaging and

    deployment • Nextflow is a streaming oriented framework for computational workflows. • Docker + Nextflow = Reproducible self- contained pipelines.
  43. THANKS

  44. LINKS project home
 http://nextflow.io Docker benchmark https://peerj.com/articles/1273/ Univa-CRG white paper

    http://goo.gl/lEPSe2 this presentation https://speakerdeck.com/pditommaso