Introducing Nextflow

PIPELINE FRAMEWORK Paolo Di Tommaso - Notredame Lab, CRG

CHALLENGES • Optimise computation taking advantage of distributed cluster /
cloud • Simplify deployment of complex pipelines

To replicate the result of a typical   computational biology
paper  requires 280 hours!

COMPLEXITY • Dozens of dependencies (binary tools, compilers, libraries, system
tools, etc) • Experimental nature of academic SW tends to be difﬁcult to install, conﬁgure and deploy • Heterogeneous executing platforms and system architecture (laptop→supercomputer)

DO NOT REINVENT   THE WHEEL

UNIX PIPE MODEL cat seqs | blastp -query - |
head 10 | t_coffee > result

WHAT WE NEED Compose Linux commands and scripts as usual
+ Handle multiple inputs/outputs Portable across multiple platforms Fault tolerance

NEXTFLOW • Fast application prototypes • High-level parallelisation model •
Portable across multiple execution platforms • Enable pipeline reproducibility

LIGHTWEIGHT

Just download it: curl -fsSL get.nextflow.io | bash nextflow Dependencies:
Unix-like OS (Linux, OSX, etc.) and Java 7/8

FAST PROTOTYPING

• A pipeline script is written by composition putting together
several process • A process can execute any script or tool • It allows to reuse any existing piece of code

process foo { input: val str from 'Hello' output: file
'my_file' into result script: """ echo $str world! > my_file """ } PROCESS DEFINITION

WHAT A SCRIPT LOOKS LIKE sequences = Channel.fromPath("/data/sample.fasta") process blast
{ input: file 'in.fasta' from sequences output: file 'out.txt' into blast_result """ blastp -query in.fasta -outfmt 6 | cut -f 2 | \ blastdbcmd -entry_batch - > out.txt """ } process align { input: file all_seqs from blast_result output: file 'align.txt' into align_result """ t_coffee $all_seqs 2>&- | tee align.txt  """ } align_result.collectFile(name: 'final_alignment')

IMPLICIT PARALLELISM sequences = Channel.fromPath("/data/*.fasta") process blast { input: file
'in.fasta' from sequences output: file 'out.txt' into blast_result """ blastp -query in.fasta -outfmt 6 | cut -f 2 | \ blastdbcmd -entry_batch - > out.txt """ } process align { input: file all_seqs from blast_result output: file 'align.txt' into align_result """ t_coffee $all_seqs 2>&- | tee align.txt  """ } align_result.collectFile(name: 'final_alignment')

IMPLICIT PARALLELISM BLAST T-COFFEE BLAST T-COFFEE T-COFFEE BLAST sample.fasta sample.fasta
sample.fasta alignment

DATAFLOW

DATAFLOW • Declarative computational model for concurrent processes • Processes
wait for data, when an input set is ready the process is executed • They communicate by using dataﬂow variables i.e. async stream of data called channels • Parallelisation and tasks dependencies are implicitly deﬁned by process in/out declarations

REACTIVE NETWORK

PORTABLE SCRIPTS

• The executor abstraction layer allows you to run the
same script on different platforms • Local (default) • Cluster (SGE, LSF, SLURM, Torque/PBS) • HPC (beta) • Cloud (beta)

Dataﬂow Task dispatcher Executors POSIX processes qsub .. tasks DSL
interpreter nextﬂow

LOCAL EXECUTOR procs procs ﬁle system POSIX procs host nextﬂow

CLUSTER EXECUTOR nextﬂow login node NFS/GPFS cluster node cluster node
cluster node cluster node batch scheduler submit tasks cluster node

CONFIGURATION FILE process {  executor = 'sge'   queue =
'cn-el6'  memory = '10GB'  cpus = 8  time = '2h'  }

HPC EXECUTOR Login node NFS/GPFS Job request cluster node cluster
node Job wrapper #!/bin/bash #$ -q <queue> #$ -pe ompi <nodes> #$ -l virtual_free=<mem> mpirun nextflow run <your-pipeline> -with-mpi HPC cluster nextflow cluster nextflow driver nextflow worker nextflow worker nextflow worker

CONTAINERS ALLOWS TO ISOLATE TASKS DEPENDENCIES

VM VS CONTAINER

BENEFITS • Smaller images (~100MB) • Fast instantiation time (<1sec)
• Almost native performance • Easy to build, publish, share and deploy • Enable tools versioning and archiving

Host BASIC CONTAINERISATION Docker image Binary tools Workflow scripts Config
file Compilers Libraries Environment

SCALING OUT . . . .

OUR SOLUTION NEXTFLOW Host ﬁle system Registry

DOCKER AT CRG Nextflow Config file Pipeline script docker registry
head node Univa grid engine

PROS • Dead easy deployment procedure • Self-contained and precise
controlled runtime • Rapidly reproduce any former conﬁguration • Consistent results over time and across different platforms

CONS • Requires a modern Linux kernel (≥3.10) • Security
concerns • Containers/images cleanup

SHIFTER • Alternative implementation developed by NERSC (Berkeley lab) •
HPC friendly, does not require special permission • Compatible with Docker images • Integrated with SLURM scheduler

ERROR RECOVERY

• Stop on failure / ﬁx / resume executions •
Automatically re-execute failing tasks increasing requested resources (memory, disk, etc.) • Ignore task errors

WHO IS USING NEXTFLOW? • Campagne Lab, Weill Medical College
of Cornell University • Center for Biotechnology, Bielefeld University • Genetic Cancer group, International Agency for Cancer Research • Guigo Lab, Center for Genomic Regulation • Medical genetics diagnostic, Oslo University Hospital • National Marrow Donor Program • Joint Genomic Institute • Parasite Genomics, Sanger Institute

FUTURE WORK Short term • Built-in support for Shifter •
Enhance scheduling capability of HPC execution mode • Version 1.0 (second half 2016) Long term • Web user interface • Enhance support for cloud (Google Compute Engine)

CONCLUSION • Nextﬂow is a streaming oriented framework for computational
workﬂows. • It is not supposed to replace your favourite tools • It provides a parallel and scalable environment for your scripts • It enables reproducible pipelines deployment

THANKS

LINKS project home  http://nextflow.io GitHub repository http://github.com/nextflow-io/nextflow this presentation https://speakerdeck.com/pditommaso

Introducing Nextflow

Introducing Nextflow

More Decks by Paolo Di Tommaso

Featured

Transcript