Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Nextflow: a tool for deploying reproducible computational pipelines

Paolo Di Tommaso
July 11, 2015
280

Nextflow: a tool for deploying reproducible computational pipelines

Nextflow is a framework and a DSL modelled around the UNIX pipe concept, that simplifies writing parallel
pipelines in a portable manner

Paolo Di Tommaso

July 11, 2015
Tweet

Transcript

  1. Bioinformatics Open Source Conference 11 july 2015 - Dublin, Ireland

    A tool for deploying reproducible computational pipelines
  2. Paolo Di Tommaso Research software engineer ~ 5 years at

    Notredame Lab Comparative Bioinformatics Center for Genomic Regulation (CRG)
  3. WHAT IS NEXTFLOW A framework and fluent DSL modelled around

    the UNIX pipe concept, that simplifies writing parallel pipelines in a portable manner
  4. MOTIVATION • Fast workflow prototyping • Reuse any existing scripts/tools

    • High-level parallelisation model • Portable across platforms • Enables reproducibility
  5. HOW IT WORKS • The pipeline flow is defined in

    a declarative manner • A script is composed by several processes • A process is defined by a set of inputs/outputs and a script snippet to be executed
  6. process  foo  {   !      input:    

       val  str  from  'Hello'   !      output:        file  'my_file'  into  result   !      script:        """        echo  $str  world!  >  my_file        """   }   ! PROCESS DEFINITION
  7. DATAFLOW • Declarative computational model for concurrent processes • Processes

    wait for data, when an input set is ready the process is executed • They communicate by using dataflow variables i.e. async stream of data called channels • Parallelisation and tasks dependencies are implicitly defined by process in/out declarations
  8. WHAT A SCRIPT LOOKS LIKE ! params.query  =  "$baseDir/data/sample.fa"  

    ! seq  =  Channel.fromPath(params.query)                           ! process  blast  {          input:          file  'seq.fa'  from  seq   !        output:          file  'out.txt'  into  result   !        script:          """          blastp  -­‐db  NR  -­‐query  seq.fa  -­‐outfmt  6  >  out.txt          """   }   ! result.print  {  it.text  }    
  9. $ nextflow run blast-test.nf ! N E X T F

    L O W ~ version 0.12.0 [3d/ec5c2e] Submitted process > blast (1) ! 1ycsB 1YCS:B 100.00 60 0 0 1 60 170 229 3e-42 131 1ycsB 1ABO:B 24.07 54 39 1 3 56 6 57 4e-05 28.5 1ycsB 1ABO:A 24.07 54 39 1 3 56 6 57 4e-05 28.5 1ycsB 1PHT:A 30.43 23 16 0 6 28 10 32 0.013 22.3 1vie 1VIE:A 100.00 51 0 0 1 51 12 62 1e-35 108 1pht 1PHT:A 100.00 80 0 0 1 80 5 84 1e-56 164 1pht 1YCS:B 30.43 23 16 0 6 28 175 197 0.015 23.5 1pht 1IHF:B 33.33 21 14 0 53 73 60 80 0.75 18.1 1pht 1IHT:H 32.00 25 17 0 40 64 175 199 4.0 16.5 1pht 1IHS:H 32.00 25 17 0 40 64 175 199 4.0 16.5 10
  10. $ nextflow run blast-test.nf --query 'data/*.fa' ! N E X

    T F L O W ~ version 0.12.0 [3d/ec5c2e] Submitted process > blast (2) [1f/277042] Submitted process > blast (3) [9d/b49472] Submitted process > blast (4) [4a/3c2d5e] Submitted process > blast (1) [61/7dc8f0] Submitted process > blast (5) ! 1ycsB 1YCS:B 100.00 60 0 0 1 60 170 229 3e-42 131 1ycsB 1ABO:B 24.07 54 39 1 3 56 6 57 4e-05 28.5 1ycsB 1ABO:A 24.07 54 39 1 3 56 6 57 4e-05 28.5 1ycsB 1PHT:A 30.43 23 16 0 6 28 10 32 0.013 22.3 1vie 1VIE:A 100.00 51 0 0 1 51 12 62 1e-35 108 1pht 1PHT:A 100.00 80 0 0 1 80 5 84 1e-56 164 1pht 1YCS:B 30.43 23 16 0 6 28 175 197 0.015 23.5 1pht 1IHF:B 33.33 21 14 0 53 73 60 80 0.75 18.1 1pht 1IHT:H 32.00 25 17 0 40 64 175 199 4.0 16.5 1pht 1IHS:H 32.00 25 17 0 40 64 175 199 4.0 16.5 11
  11. COMMON PROBLEMS • Many dependencies (scripts, tools, DB, etc) •

    Environment configuration • Frequent updates • Track changes and versions • The tragedy of absolute paths ...
  12. OUR SOLUTION • Manage a pipeline as a *self-contained* Github

    repository • It includes external scripts and environment configuration • Binary dependencies can be deployed with Docker containers
  13. $ nextflow run nextflow-io/rnatoy -with-docker ! N E X T

    F L O W ~ version 0.14.3 Pulling nextflow-io/rnatoy ... downloaded from https://github.com/nextflow-io/rnatoy.git ! Launching 'nextflow-io/rnatoy' - revision: 9c61bf5ac5 [master] R N A T O Y P I P E L I N E ================================= genome : /User/../data/ggal_1_4885000_49020000.Ggal71.500bp.fa annotat : /User/../data/ggal_1_4885000_49020000.bed.gff pair1 : /User/../data/*_1.fq pair2 : /User/../data/*_2.fq [warm up] executor > local [02/b08c28] Submitted process > buildIndex (ggal_1_4885000_49020000.Ggal71) [ea/97d004] Submitted process > mapping (ggal_gut) [98/16c9e5] Submitted process > mapping (ggal_liver) [b5/38a0c7] Submitted process > makeTranscript (ggal_gut) [00/e5efd6] Submitted process > makeTranscript (ggal_liver) Saving: transcript_ggal_gut.gtf Saving: transcript_ggal_liver.gtf 20
  14. VERSIONING • Any Git tag, branch and commit ID can

    be used to a revision identifier • This allows to execute any previous version in a consistent manner
  15. $ nextflow run nextflow-io/rnatoy -revision v1.0 ! N E X

    T F L O W ~ version 0.14.3 Launching 'nextflow-io/rnatoy' - revision: 0d0443d8f7 [v1.0] R N A T O Y P I P E L I N E ================================= [35/cb611b] Submitted process > prepareTranscriptome (1) [cd/239926] Submitted process > buildIndex (1) [c6/f6488d] Submitted process > mapping (2) [bc/b3ea76] Submitted process > mapping (1) [f4/8d4628] Submitted process > makeTranscript (1) [eb/92db7f] Submitted process > makeTranscript (2) Saving: transcript_ggal_alpha.gtf Saving: transcript_ggal_beta.gtf 23
  16. WHO IS USING NEXTFLOW? • Andrew Stewart, Veracyte Inc •

    Emilio Palumbo, Center for Genomic Regulation • Georgios Pappas, University of Brasilia • Lukas Jelonek, Justus-Liebig-Universität Gießen • Matthieu Foll, International Agency for Research on Cancer • Michael L Heuer, National Marrow Donor Program • Rémi Planel, University Claude Bernard Lyon 1 • Rob Syme, CCDM, Curtin University • Sascha Steinbiss, Sanger Institute • Simon Ye, Broad Institute • Tobias Sargeant, The Walter and Eliza Hall Institute
  17. FUTURE WORK Short term • Built-in low-latency cluster Medium term

    • Support for Bioboxes project • Support for Git Large File Storage • Support for Rkt containers Long term • Graphical editor • Support for CWL / WDL (?) • Support for YARN / Spark clusters