Slide 1

Slide 1 text

Bioinformatics Open Source Conference 11 july 2015 - Dublin, Ireland A tool for deploying reproducible computational pipelines

Slide 2

Slide 2 text

Paolo Di Tommaso Research software engineer ~ 5 years at Notredame Lab Comparative Bioinformatics Center for Genomic Regulation (CRG)

Slide 3

Slide 3 text

WHAT IS NEXTFLOW A framework and fluent DSL modelled around the UNIX pipe concept, that simplifies writing parallel pipelines in a portable manner

Slide 4

Slide 4 text

MOTIVATION • Fast workflow prototyping • Reuse any existing scripts/tools • High-level parallelisation model • Portable across platforms • Enables reproducibility

Slide 5

Slide 5 text

HOW IT WORKS • The pipeline flow is defined in a declarative manner • A script is composed by several processes • A process is defined by a set of inputs/outputs and a script snippet to be executed

Slide 6

Slide 6 text

process  foo  {   !      input:        val  str  from  'Hello'   !      output:        file  'my_file'  into  result   !      script:        """        echo  $str  world!  >  my_file        """   }   ! PROCESS DEFINITION

Slide 7

Slide 7 text

DATAFLOW • Declarative computational model for concurrent processes • Processes wait for data, when an input set is ready the process is executed • They communicate by using dataflow variables i.e. async stream of data called channels • Parallelisation and tasks dependencies are implicitly defined by process in/out declarations

Slide 8

Slide 8 text

REACTIVE NETWORK

Slide 9

Slide 9 text

WHAT A SCRIPT LOOKS LIKE ! params.query  =  "$baseDir/data/sample.fa"   ! seq  =  Channel.fromPath(params.query)                           ! process  blast  {          input:          file  'seq.fa'  from  seq   !        output:          file  'out.txt'  into  result   !        script:          """          blastp  -­‐db  NR  -­‐query  seq.fa  -­‐outfmt  6  >  out.txt          """   }   ! result.print  {  it.text  }    

Slide 10

Slide 10 text

$ nextflow run blast-test.nf ! N E X T F L O W ~ version 0.12.0 [3d/ec5c2e] Submitted process > blast (1) ! 1ycsB 1YCS:B 100.00 60 0 0 1 60 170 229 3e-42 131 1ycsB 1ABO:B 24.07 54 39 1 3 56 6 57 4e-05 28.5 1ycsB 1ABO:A 24.07 54 39 1 3 56 6 57 4e-05 28.5 1ycsB 1PHT:A 30.43 23 16 0 6 28 10 32 0.013 22.3 1vie 1VIE:A 100.00 51 0 0 1 51 12 62 1e-35 108 1pht 1PHT:A 100.00 80 0 0 1 80 5 84 1e-56 164 1pht 1YCS:B 30.43 23 16 0 6 28 175 197 0.015 23.5 1pht 1IHF:B 33.33 21 14 0 53 73 60 80 0.75 18.1 1pht 1IHT:H 32.00 25 17 0 40 64 175 199 4.0 16.5 1pht 1IHS:H 32.00 25 17 0 40 64 175 199 4.0 16.5 10

Slide 11

Slide 11 text

$ nextflow run blast-test.nf --query 'data/*.fa' ! N E X T F L O W ~ version 0.12.0 [3d/ec5c2e] Submitted process > blast (2) [1f/277042] Submitted process > blast (3) [9d/b49472] Submitted process > blast (4) [4a/3c2d5e] Submitted process > blast (1) [61/7dc8f0] Submitted process > blast (5) ! 1ycsB 1YCS:B 100.00 60 0 0 1 60 170 229 3e-42 131 1ycsB 1ABO:B 24.07 54 39 1 3 56 6 57 4e-05 28.5 1ycsB 1ABO:A 24.07 54 39 1 3 56 6 57 4e-05 28.5 1ycsB 1PHT:A 30.43 23 16 0 6 28 10 32 0.013 22.3 1vie 1VIE:A 100.00 51 0 0 1 51 12 62 1e-35 108 1pht 1PHT:A 100.00 80 0 0 1 80 5 84 1e-56 164 1pht 1YCS:B 30.43 23 16 0 6 28 175 197 0.015 23.5 1pht 1IHF:B 33.33 21 14 0 53 73 60 80 0.75 18.1 1pht 1IHT:H 32.00 25 17 0 40 64 175 199 4.0 16.5 1pht 1IHS:H 32.00 25 17 0 40 64 175 199 4.0 16.5 11

Slide 12

Slide 12 text

NFS PLATFORM AGNOSTIC cluster engine local executor Nextflow grid executor Nextflow *nix OS

Slide 13

Slide 13 text

SUPPORTED PLATFORMS 13

Slide 14

Slide 14 text

HOW TO MAKE REPRODUCIBILITY EASIER?

Slide 15

Slide 15 text

WHY IS 
 SO HARD TO DEPLOY A PIPELINE?

Slide 16

Slide 16 text

COMMON PROBLEMS • Many dependencies (scripts, tools, DB, etc) • Environment configuration • Frequent updates • Track changes and versions • The tragedy of absolute paths ...

Slide 17

Slide 17 text

OUR SOLUTION • Manage a pipeline as a *self-contained* Github repository • It includes external scripts and environment configuration • Binary dependencies can be deployed with Docker containers

Slide 18

Slide 18 text

A GITHUB REPO CAN BE USED TO DEPLOY A SELF-CONTAINED PIPELINE PROJECT

Slide 19

Slide 19 text

No content

Slide 20

Slide 20 text

$ nextflow run nextflow-io/rnatoy -with-docker ! N E X T F L O W ~ version 0.14.3 Pulling nextflow-io/rnatoy ... downloaded from https://github.com/nextflow-io/rnatoy.git ! Launching 'nextflow-io/rnatoy' - revision: 9c61bf5ac5 [master] R N A T O Y P I P E L I N E ================================= genome : /User/../data/ggal_1_4885000_49020000.Ggal71.500bp.fa annotat : /User/../data/ggal_1_4885000_49020000.bed.gff pair1 : /User/../data/*_1.fq pair2 : /User/../data/*_2.fq [warm up] executor > local [02/b08c28] Submitted process > buildIndex (ggal_1_4885000_49020000.Ggal71) [ea/97d004] Submitted process > mapping (ggal_gut) [98/16c9e5] Submitted process > mapping (ggal_liver) [b5/38a0c7] Submitted process > makeTranscript (ggal_gut) [00/e5efd6] Submitted process > makeTranscript (ggal_liver) Saving: transcript_ggal_gut.gtf Saving: transcript_ggal_liver.gtf 20

Slide 21

Slide 21 text

DISTRIBUTED MODEL Nextflow Pipeline script Config file

Slide 22

Slide 22 text

VERSIONING • Any Git tag, branch and commit ID can be used to a revision identifier • This allows to execute any previous version in a consistent manner

Slide 23

Slide 23 text

$ nextflow run nextflow-io/rnatoy -revision v1.0 ! N E X T F L O W ~ version 0.14.3 Launching 'nextflow-io/rnatoy' - revision: 0d0443d8f7 [v1.0] R N A T O Y P I P E L I N E ================================= [35/cb611b] Submitted process > prepareTranscriptome (1) [cd/239926] Submitted process > buildIndex (1) [c6/f6488d] Submitted process > mapping (2) [bc/b3ea76] Submitted process > mapping (1) [f4/8d4628] Submitted process > makeTranscript (1) [eb/92db7f] Submitted process > makeTranscript (2) Saving: transcript_ggal_alpha.gtf Saving: transcript_ggal_beta.gtf 23

Slide 24

Slide 24 text

WHO IS USING NEXTFLOW? • Andrew Stewart, Veracyte Inc • Emilio Palumbo, Center for Genomic Regulation • Georgios Pappas, University of Brasilia • Lukas Jelonek, Justus-Liebig-Universität Gießen • Matthieu Foll, International Agency for Research on Cancer • Michael L Heuer, National Marrow Donor Program • Rémi Planel, University Claude Bernard Lyon 1 • Rob Syme, CCDM, Curtin University • Sascha Steinbiss, Sanger Institute • Simon Ye, Broad Institute • Tobias Sargeant, The Walter and Eliza Hall Institute

Slide 25

Slide 25 text

FUTURE WORK Short term • Built-in low-latency cluster Medium term • Support for Bioboxes project • Support for Git Large File Storage • Support for Rkt containers Long term • Graphical editor • Support for CWL / WDL (?) • Support for YARN / Spark clusters

Slide 26

Slide 26 text

KEEP IN TOCH Home page
 http://nextflow.io Github
 https://github.com/nextflow-io/nextflow Chat
 https://gitter.im/nextflow-io/nextflow