Reproducible computational pipelines with Docker and Nextflow

Paolo Di Tommaso - Notredame Lab   Center for Genomic
Regulation (CRG) Bio in Docker Symposium - 9 Nov 2015, London Reproducible computational pipelines with Docker and Nextﬂow

@PaoloDiTommaso Research software engineer Comparative Bioinformatics, Notredame Lab Center for
Genomic Regulation (CRG)

WHAT THINGS MOST FRUSTRATE YOU OR LIMIT YOUR ABILITY TO
CARRY OUT BIOINFORMATICS ANALYSIS?* • Ability to compile/run the average software suite. • Diversity and complexity of software deployments. • Poor and/or incomplete documentation. • Installation of different software, packages,... and getting them to work on different platforms. • Lack of truly standard ﬁle formats. • Time installing software. • Lack of computing resources (cpus, memory, storage, etc). * available at https://goo.gl/TF9TMj

To replicate the result of a typical   computational biology
paper  requires 280 hours!

WHAT'S WRONG WITH COMPUTATIONAL WORKFLOWS?

COMPLEXITY • Dozens of dependencies (binary tools, compilers, libraries, system
tools, etc) • Experimental nature of academic SW tends to be difﬁcult to install, conﬁgure and deploy • Heterogeneous executing platforms and system architecture (laptop→supercomputer)

CONTAINERS ARE THE THIRD BIG WAVE IN VIRTUALISATION TECHNOLOGY

VM VS CONTAINER

BENEFITS • Smaller images (~100MB) • Fast instantiation time (~1sec)
• Almost native performance • Easy to build, publish, share and deploy • Transparent build process

Host PACKAGING A WORKFLOW Docker image Binary tools Workflow scripts
Config file Compilers Libraries Environment

SCALING OUT . . . .

CONTAINERS ORCHESTRATION • Swarm • Fleet • Kubernetes • Mesos

NOT THE RIGHT ANSWER FOR COMPUTATIONAL PIPELINES

SERVICES ORCHESTRATION  ≠  TASKS SCHEDULING

OUR SOLUTION NEXTFLOW Host ﬁle system Registry

DOCKER AT CRG Nextflow Config file Pipeline script docker registry
head node Univa grid engine

PROS • Dead easy deployment procedure • Self-contained and precise
controlled runtime • Rapidly reproduce any former conﬁguration • Consistent results over time and across different platforms

CONS • Requires a modern Linux kernel (≥3.10) • Security
concerns • Containers/images cleanup

WHAT ABOUT PERFORMANCE?

BENCHMARK* * Di Tommaso P, Palumbo E, Chatzou M, Prieto
P, Heuer ML, Notredame C. (2015)   The impact of Docker containers on the performance of genomic pipelines. PeerJ 3:e1273   https://dx.doi.org/10.7717/peerj.1273

• DSL (domain specify lang.) on top of JVM •
High-level parallelisation model • Conﬁgurable executors target multiple platforms

RATIONALE • Make workﬂows portable across different computational environments •
Simplify deployment and enable reproducibility • Reuse any existing piece of SW (tools, scripts, etc)

DATAFLOW PROGRAMMING A B X Y C D

process foo { ! input:
val str from 'Hello' ! output: file 'my_file' into result ! script: """ echo $str world! > my_file """ } ! PROCESS DEFINITION

WHAT A SCRIPT LOOKS LIKE ! sequences = Channel.fromPath("/data/sample.fasta")
process blast { input: file 'in.fasta' from sequences output: file 'out.txt' into blast_result ! """ blastp -‐query in.fasta -‐outfmt 6 | cut -‐f 2 | \ blastdbcmd -‐entry_batch -‐ > out.txt """ } ! process align { input: file all_seqs from blast_result output: file align_result ! """ t_coffee $all_seqs 2>&-‐ | tee align_result  """ } ! blast_result.collectFile(name: 'final_alignment')

IMPLICIT PARALLELISM ! sequences = Channel.fromPath("/data/*.fasta")
process blast { input: file 'in.fasta' from sequences output: file 'out.txt' into blast_result ! """ blastp -‐query in.fasta -‐outfmt 6 | cut -‐f 2 | \ blastdbcmd -‐entry_batch -‐ > out.txt """ } ! process align { input: file all_seqs from blast_result output: file align_result ! """ t_coffee $all_seqs 2>&-‐ | tee align_result  """ } ! blast_result.collectFile(name: 'final_alignment')

IMPLICIT PARALLELISM BLAST T-COFFEE BLAST T-COFFEE T-COFFEE BLAST sample.fasta sample.fasta
sample.fasta alignment

BENEFITS • High-level declarative parallelisation model • Portable across different
platforms • Isolates task dependencies with Docker containers

CONFIGURATION FILE ! process {  container = 'your/image:latest'  
executor = 'sge'   queue = 'cn-‐el6'  memory = '10GB'  cpus = 8  time = '2h'  }

DEPLOYMENT MODES • Local execution • Grid engine / batch
scheduler • Distributed execution (embedded cluster) • Managed cloud (ClusterK, DNAnexus) • AWS cloud platform

LOCAL EXECUTION nextﬂow ! procs ! procs ﬁle system POSIX
procs host

GRID ENGINE nextﬂow login node NFS cluster node cluster node
cluster node cluster node batch scheduler submit tasks cluster node

SUPPORTED PLATFORMS 35

DISTRIBUTED MODE Login node NFS/Lustre Job request cluster node cluster
node Job wrapper ! #!/bin/bash #$ -‐q <queue> #$ -‐pe ompi <nodes> #$ -‐l virtual_free=<mem> mpirun nextflow run <your-‐pipeline> -‐with-‐mpi HPC cluster nextflow cluster nextflow driver nextflow worker nextflow worker nextflow worker Apache   Ignite

AWS CLOUD Nextflow driver EC2 node S3 / EFS nextflow
workers Elastic Load balancer submit tasks nextflow workers Nextflow workers Docker registry AWS ECR EC2 spot instances Apache   Ignite Apache   Ignite

MANAGED CLOUD Nextﬂow Cirrus scheduler EC2 spot EC2 spot EC2
spot Amazon S3 Submit tasks Conﬁgure instances Execute tasks Pipeline script CLUSTERK

WHO IS USING NEXTFLOW? • Campagne Lab, Weill Medical College
of Cornell University • Center for Biotechnology, Bielefeld University • Genetic Cancer group, International Agency Cancer Research • Guigo Lab, Center for Genomic Regulation • Medical genetics diagnostic, Oslo University Hospital • National Marrow Donor Program • Parasite Genomics, Sanger Institute • Sabeti Lab, Broad Institute • Veracyte Inc

FUTURE WORK Short term • Investigate support for Bioboxes •
Support for Git LFS (large file storage) • Version 1.0 (first half 2016) Long term • Enhance processing capabilities with distributed caching and data affinity. • Interoperability with YARN / Spark clusters / Common WL

CONCLUSION • Docker is a game-changer for workflows packaging and
deployment • Nextflow is a streaming oriented framework for computational workflows. • Docker + Nextflow = Reproducible self- contained pipelines.

THANKS

LINKS project home  http://nextﬂow.io Docker benchmark https://peerj.com/articles/1273/ Univa-CRG white paper
http://goo.gl/lEPSe2 this presentation https://speakerdeck.com/pditommaso

Reproducible computational pipelines with Docke...

Reproducible computational pipelines with Docker and Nextflow

More Decks by Paolo Di Tommaso

Featured

Transcript