Nextflow: a tool for deploying reproducible computational pipelines

Bioinformatics Open Source Conference 11 july 2015 - Dublin, Ireland
A tool for deploying reproducible computational pipelines

Paolo Di Tommaso Research software engineer ~ 5 years at
Notredame Lab Comparative Bioinformatics Center for Genomic Regulation (CRG)

WHAT IS NEXTFLOW A framework and ﬂuent DSL modelled around
the UNIX pipe concept, that simpliﬁes writing parallel pipelines in a portable manner

MOTIVATION • Fast workﬂow prototyping • Reuse any existing scripts/tools
• High-level parallelisation model • Portable across platforms • Enables reproducibility

HOW IT WORKS • The pipeline flow is defined in
a declarative manner • A script is composed by several processes • A process is defined by a set of inputs/outputs and a script snippet to be executed

process foo { ! input:
val str from 'Hello' ! output: file 'my_file' into result ! script: """ echo $str world! > my_file """ } ! PROCESS DEFINITION

DATAFLOW • Declarative computational model for concurrent processes • Processes
wait for data, when an input set is ready the process is executed • They communicate by using dataﬂow variables i.e. async stream of data called channels • Parallelisation and tasks dependencies are implicitly deﬁned by process in/out declarations

REACTIVE NETWORK

WHAT A SCRIPT LOOKS LIKE ! params.query = "$baseDir/data/sample.fa"
! seq = Channel.fromPath(params.query) ! process blast { input: file 'seq.fa' from seq ! output: file 'out.txt' into result ! script: """ blastp -‐db NR -‐query seq.fa -‐outfmt 6 > out.txt """ } ! result.print { it.text }

$ nextflow run blast-test.nf ! N E X T F
L O W ~ version 0.12.0 [3d/ec5c2e] Submitted process > blast (1) ! 1ycsB 1YCS:B 100.00 60 0 0 1 60 170 229 3e-42 131 1ycsB 1ABO:B 24.07 54 39 1 3 56 6 57 4e-05 28.5 1ycsB 1ABO:A 24.07 54 39 1 3 56 6 57 4e-05 28.5 1ycsB 1PHT:A 30.43 23 16 0 6 28 10 32 0.013 22.3 1vie 1VIE:A 100.00 51 0 0 1 51 12 62 1e-35 108 1pht 1PHT:A 100.00 80 0 0 1 80 5 84 1e-56 164 1pht 1YCS:B 30.43 23 16 0 6 28 175 197 0.015 23.5 1pht 1IHF:B 33.33 21 14 0 53 73 60 80 0.75 18.1 1pht 1IHT:H 32.00 25 17 0 40 64 175 199 4.0 16.5 1pht 1IHS:H 32.00 25 17 0 40 64 175 199 4.0 16.5 10

$ nextflow run blast-test.nf --query 'data/*.fa' ! N E X
T F L O W ~ version 0.12.0 [3d/ec5c2e] Submitted process > blast (2) [1f/277042] Submitted process > blast (3) [9d/b49472] Submitted process > blast (4) [4a/3c2d5e] Submitted process > blast (1) [61/7dc8f0] Submitted process > blast (5) ! 1ycsB 1YCS:B 100.00 60 0 0 1 60 170 229 3e-42 131 1ycsB 1ABO:B 24.07 54 39 1 3 56 6 57 4e-05 28.5 1ycsB 1ABO:A 24.07 54 39 1 3 56 6 57 4e-05 28.5 1ycsB 1PHT:A 30.43 23 16 0 6 28 10 32 0.013 22.3 1vie 1VIE:A 100.00 51 0 0 1 51 12 62 1e-35 108 1pht 1PHT:A 100.00 80 0 0 1 80 5 84 1e-56 164 1pht 1YCS:B 30.43 23 16 0 6 28 175 197 0.015 23.5 1pht 1IHF:B 33.33 21 14 0 53 73 60 80 0.75 18.1 1pht 1IHT:H 32.00 25 17 0 40 64 175 199 4.0 16.5 1pht 1IHS:H 32.00 25 17 0 40 64 175 199 4.0 16.5 11

NFS PLATFORM AGNOSTIC cluster engine local executor Nextﬂow grid executor
Nextﬂow *nix OS

SUPPORTED PLATFORMS 13

HOW TO MAKE REPRODUCIBILITY EASIER?

WHY IS   SO HARD TO DEPLOY A PIPELINE?

COMMON PROBLEMS • Many dependencies (scripts, tools, DB, etc) •
Environment conﬁguration • Frequent updates • Track changes and versions • The tragedy of absolute paths ...

OUR SOLUTION • Manage a pipeline as a *self-contained* Github
repository • It includes external scripts and environment conﬁguration • Binary dependencies can be deployed with Docker containers

A GITHUB REPO CAN BE USED TO DEPLOY A SELF-CONTAINED
PIPELINE PROJECT

$ nextflow run nextflow-io/rnatoy -with-docker ! N E X T
F L O W ~ version 0.14.3 Pulling nextflow-io/rnatoy ... downloaded from https://github.com/nextflow-io/rnatoy.git ! Launching 'nextflow-io/rnatoy' - revision: 9c61bf5ac5 [master] R N A T O Y P I P E L I N E ================================= genome : /User/../data/ggal_1_4885000_49020000.Ggal71.500bp.fa annotat : /User/../data/ggal_1_4885000_49020000.bed.gff pair1 : /User/../data/*_1.fq pair2 : /User/../data/*_2.fq [warm up] executor > local [02/b08c28] Submitted process > buildIndex (ggal_1_4885000_49020000.Ggal71) [ea/97d004] Submitted process > mapping (ggal_gut) [98/16c9e5] Submitted process > mapping (ggal_liver) [b5/38a0c7] Submitted process > makeTranscript (ggal_gut) [00/e5efd6] Submitted process > makeTranscript (ggal_liver) Saving: transcript_ggal_gut.gtf Saving: transcript_ggal_liver.gtf 20

DISTRIBUTED MODEL Nextflow Pipeline script Config file

VERSIONING • Any Git tag, branch and commit ID can
be used to a revision identiﬁer • This allows to execute any previous version in a consistent manner

$ nextflow run nextflow-io/rnatoy -revision v1.0 ! N E X
T F L O W ~ version 0.14.3 Launching 'nextflow-io/rnatoy' - revision: 0d0443d8f7 [v1.0] R N A T O Y P I P E L I N E ================================= [35/cb611b] Submitted process > prepareTranscriptome (1) [cd/239926] Submitted process > buildIndex (1) [c6/f6488d] Submitted process > mapping (2) [bc/b3ea76] Submitted process > mapping (1) [f4/8d4628] Submitted process > makeTranscript (1) [eb/92db7f] Submitted process > makeTranscript (2) Saving: transcript_ggal_alpha.gtf Saving: transcript_ggal_beta.gtf 23

WHO IS USING NEXTFLOW? • Andrew Stewart, Veracyte Inc •
Emilio Palumbo, Center for Genomic Regulation • Georgios Pappas, University of Brasilia • Lukas Jelonek, Justus-Liebig-Universität Gießen • Matthieu Foll, International Agency for Research on Cancer • Michael L Heuer, National Marrow Donor Program • Rémi Planel, University Claude Bernard Lyon 1 • Rob Syme, CCDM, Curtin University • Sascha Steinbiss, Sanger Institute • Simon Ye, Broad Institute • Tobias Sargeant, The Walter and Eliza Hall Institute

FUTURE WORK Short term • Built-in low-latency cluster Medium term
• Support for Bioboxes project • Support for Git Large File Storage • Support for Rkt containers Long term • Graphical editor • Support for CWL / WDL (?) • Support for YARN / Spark clusters

KEEP IN TOCH Home page  http://nextflow.io Github  https://github.com/nextflow-io/nextflow Chat  https://gitter.im/nextflow-io/nextflow

Nextflow: a tool for deploying reproducible com...

Nextflow: a tool for deploying reproducible computational pipelines

Paolo Di Tommaso

More Decks by Paolo Di Tommaso

Featured

Transcript

Bioinformatics Open Source Conference 11 july 2015 - Dublin, Ireland

Paolo Di Tommaso Research software engineer ~ 5 years at

WHAT IS NEXTFLOW A framework and ﬂuent DSL modelled around

MOTIVATION • Fast workﬂow prototyping • Reuse any existing scripts/tools

HOW IT WORKS • The pipeline ﬂow is deﬁned in

process foo { ! input:

DATAFLOW • Declarative computational model for concurrent processes • Processes

REACTIVE NETWORK

WHAT A SCRIPT LOOKS LIKE ! params.query = "$baseDir/data/sample.fa"

$ nextflow run blast-test.nf ! N E X T F

$ nextflow run blast-test.nf --query 'data/*.fa' ! N E X

NFS PLATFORM AGNOSTIC cluster engine local executor Nextﬂow grid executor

SUPPORTED PLATFORMS 13

HOW TO MAKE REPRODUCIBILITY EASIER?

WHY IS   SO HARD TO DEPLOY A PIPELINE?

COMMON PROBLEMS • Many dependencies (scripts, tools, DB, etc) •

OUR SOLUTION • Manage a pipeline as a self-contained Github

A GITHUB REPO CAN BE USED TO DEPLOY A SELF-CONTAINED

$ nextflow run nextflow-io/rnatoy -with-docker ! N E X T

DISTRIBUTED MODEL Nextflow Pipeline script Config file

VERSIONING • Any Git tag, branch and commit ID can

$ nextflow run nextflow-io/rnatoy -revision v1.0 ! N E X

WHO IS USING NEXTFLOW? • Andrew Stewart, Veracyte Inc •

FUTURE WORK Short term • Built-in low-latency cluster Medium term

KEEP IN TOCH Home page  http://nextflow.io Github  https://github.com/nextflow-io/nextflow Chat  https://gitter.im/nextflow-io/nextflow