COMPLEXITY • Dozens of dependencies (binary tools, compilers, libraries, system tools, etc) • Experimental nature of academic SW tends to be difficult to install, configure and deploy • Heterogeneous executing platforms and system architecture (laptop→supercomputer)
• A pipeline script is written by composition putting together several process • A process can execute any script or tool • It allows to reuse any existing piece of code
DATAFLOW • Declarative computational model for concurrent processes • Processes wait for data, when an input set is ready the process is executed • They communicate by using dataflow variables i.e. async stream of data called channels • Parallelisation and tasks dependencies are implicitly defined by process in/out declarations
• The executor abstraction layer allows you to run the same script on different platforms • Local (default) • Cluster (SGE, LSF, SLURM, Torque/PBS) • HPC (beta) • Cloud (beta)
BENEFITS • Smaller images (~100MB) • Fast instantiation time (<1sec) • Almost native performance • Easy to build, publish, share and deploy • Enable tools versioning and archiving
PROS • Dead easy deployment procedure • Self-contained and precise controlled runtime • Rapidly reproduce any former configuration • Consistent results over time and across different platforms
SHIFTER • Alternative implementation developed by NERSC (Berkeley lab) • HPC friendly, does not require special permission • Compatible with Docker images • Integrated with SLURM scheduler
WHO IS USING NEXTFLOW? • Campagne Lab, Weill Medical College of Cornell University • Center for Biotechnology, Bielefeld University • Genetic Cancer group, International Agency for Cancer Research • Guigo Lab, Center for Genomic Regulation • Medical genetics diagnostic, Oslo University Hospital • National Marrow Donor Program • Joint Genomic Institute • Parasite Genomics, Sanger Institute
FUTURE WORK Short term • Built-in support for Shifter • Enhance scheduling capability of HPC execution mode • Version 1.0 (second half 2016) Long term • Web user interface • Enhance support for cloud (Google Compute Engine)
CONCLUSION • Nextflow is a streaming oriented framework for computational workflows. • It is not supposed to replace your favourite tools • It provides a parallel and scalable environment for your scripts • It enables reproducible pipelines deployment