Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Nextflow: a tool for deploying reproducible computational pipelines

Paolo Di Tommaso
July 11, 2015
270

Nextflow: a tool for deploying reproducible computational pipelines

Nextflow is a framework and a DSL modelled around the UNIX pipe concept, that simplifies writing parallel
pipelines in a portable manner

Paolo Di Tommaso

July 11, 2015
Tweet

Transcript

  1. Bioinformatics Open Source Conference

    11 july 2015 - Dublin, Ireland
    A tool for deploying reproducible
    computational pipelines

    View Slide

  2. Paolo Di Tommaso

    Research software engineer

    ~ 5 years at Notredame Lab

    Comparative Bioinformatics

    Center for Genomic Regulation (CRG)

    View Slide

  3. WHAT IS NEXTFLOW
    A framework and fluent DSL modelled
    around the UNIX pipe concept,

    that simplifies writing parallel

    pipelines in a portable manner

    View Slide

  4. MOTIVATION
    • Fast workflow prototyping

    • Reuse any existing scripts/tools

    • High-level parallelisation model

    • Portable across platforms

    • Enables reproducibility

    View Slide

  5. HOW IT WORKS
    • The pipeline flow is defined in a declarative
    manner

    • A script is composed by several processes

    • A process is defined by a set of inputs/outputs
    and a script snippet to be executed

    View Slide

  6. process  foo  {  
    !
         input:  
         val  str  from  'Hello'  
    !
         output:  
         file  'my_file'  into  result  
    !
         script:  
         """  
         echo  $str  world!  >  my_file  
         """  
    }  
    !
    PROCESS DEFINITION

    View Slide

  7. DATAFLOW
    • Declarative computational model for concurrent processes

    • Processes wait for data, when an input set is ready the
    process is executed

    • They communicate by using dataflow variables i.e. async
    stream of data called channels

    • Parallelisation and tasks dependencies are implicitly defined
    by process in/out declarations

    View Slide

  8. REACTIVE NETWORK

    View Slide

  9. WHAT A SCRIPT LOOKS LIKE
    !
    params.query  =  "$baseDir/data/sample.fa"  
    !
    seq  =  Channel.fromPath(params.query)  
                           
    !
    process  blast  {  
           input:  
           file  'seq.fa'  from  seq  
    !
           output:  
           file  'out.txt'  into  result  
    !
           script:  
           """  
           blastp  -­‐db  NR  -­‐query  seq.fa  -­‐outfmt  6  >  out.txt  
           """  
    }  
    !
    result.print  {  it.text  }    

    View Slide

  10. $ nextflow run blast-test.nf
    !
    N E X T F L O W ~ version 0.12.0
    [3d/ec5c2e] Submitted process > blast (1)
    !
    1ycsB 1YCS:B 100.00 60 0 0 1 60 170 229 3e-42 131
    1ycsB 1ABO:B 24.07 54 39 1 3 56 6 57 4e-05 28.5
    1ycsB 1ABO:A 24.07 54 39 1 3 56 6 57 4e-05 28.5
    1ycsB 1PHT:A 30.43 23 16 0 6 28 10 32 0.013 22.3
    1vie 1VIE:A 100.00 51 0 0 1 51 12 62 1e-35 108
    1pht 1PHT:A 100.00 80 0 0 1 80 5 84 1e-56 164
    1pht 1YCS:B 30.43 23 16 0 6 28 175 197 0.015 23.5
    1pht 1IHF:B 33.33 21 14 0 53 73 60 80 0.75 18.1
    1pht 1IHT:H 32.00 25 17 0 40 64 175 199 4.0 16.5
    1pht 1IHS:H 32.00 25 17 0 40 64 175 199 4.0 16.5
    10

    View Slide

  11. $ nextflow run blast-test.nf --query 'data/*.fa'
    !
    N E X T F L O W ~ version 0.12.0
    [3d/ec5c2e] Submitted process > blast (2)
    [1f/277042] Submitted process > blast (3)
    [9d/b49472] Submitted process > blast (4)
    [4a/3c2d5e] Submitted process > blast (1)
    [61/7dc8f0] Submitted process > blast (5)
    !
    1ycsB 1YCS:B 100.00 60 0 0 1 60 170 229 3e-42 131
    1ycsB 1ABO:B 24.07 54 39 1 3 56 6 57 4e-05 28.5
    1ycsB 1ABO:A 24.07 54 39 1 3 56 6 57 4e-05 28.5
    1ycsB 1PHT:A 30.43 23 16 0 6 28 10 32 0.013 22.3
    1vie 1VIE:A 100.00 51 0 0 1 51 12 62 1e-35 108
    1pht 1PHT:A 100.00 80 0 0 1 80 5 84 1e-56 164
    1pht 1YCS:B 30.43 23 16 0 6 28 175 197 0.015 23.5
    1pht 1IHF:B 33.33 21 14 0 53 73 60 80 0.75 18.1
    1pht 1IHT:H 32.00 25 17 0 40 64 175 199 4.0 16.5
    1pht 1IHS:H 32.00 25 17 0 40 64 175 199 4.0 16.5
    11

    View Slide

  12. NFS
    PLATFORM AGNOSTIC
    cluster engine
    local executor
    Nextflow
    grid executor
    Nextflow
    *nix OS

    View Slide

  13. SUPPORTED PLATFORMS
    13

    View Slide

  14. HOW

    TO MAKE

    REPRODUCIBILITY

    EASIER?

    View Slide

  15. WHY IS 

    SO HARD

    TO DEPLOY

    A PIPELINE?

    View Slide

  16. COMMON PROBLEMS
    • Many dependencies (scripts, tools, DB, etc)

    • Environment configuration

    • Frequent updates

    • Track changes and versions

    • The tragedy of absolute paths ...

    View Slide

  17. OUR SOLUTION
    • Manage a pipeline as a *self-contained* Github
    repository

    • It includes external scripts and environment
    configuration

    • Binary dependencies can be deployed with
    Docker containers

    View Slide

  18. A GITHUB REPO

    CAN BE USED TO DEPLOY

    A SELF-CONTAINED

    PIPELINE PROJECT

    View Slide

  19. View Slide

  20. $ nextflow run nextflow-io/rnatoy -with-docker
    !
    N E X T F L O W ~ version 0.14.3
    Pulling nextflow-io/rnatoy ...
    downloaded from https://github.com/nextflow-io/rnatoy.git
    !
    Launching 'nextflow-io/rnatoy' - revision: 9c61bf5ac5 [master]
    R N A T O Y P I P E L I N E
    =================================
    genome : /User/../data/ggal_1_4885000_49020000.Ggal71.500bp.fa
    annotat : /User/../data/ggal_1_4885000_49020000.bed.gff
    pair1 : /User/../data/*_1.fq
    pair2 : /User/../data/*_2.fq
    [warm up] executor > local
    [02/b08c28] Submitted process > buildIndex (ggal_1_4885000_49020000.Ggal71)
    [ea/97d004] Submitted process > mapping (ggal_gut)
    [98/16c9e5] Submitted process > mapping (ggal_liver)
    [b5/38a0c7] Submitted process > makeTranscript (ggal_gut)
    [00/e5efd6] Submitted process > makeTranscript (ggal_liver)
    Saving: transcript_ggal_gut.gtf
    Saving: transcript_ggal_liver.gtf
    20

    View Slide

  21. DISTRIBUTED MODEL
    Nextflow
    Pipeline script
    Config file

    View Slide

  22. VERSIONING
    • Any Git tag, branch and commit ID can be used to
    a revision identifier

    • This allows to execute any previous version in a
    consistent manner

    View Slide

  23. $ nextflow run nextflow-io/rnatoy -revision v1.0
    !
    N E X T F L O W ~ version 0.14.3
    Launching 'nextflow-io/rnatoy' - revision: 0d0443d8f7 [v1.0]
    R N A T O Y P I P E L I N E
    =================================
    [35/cb611b] Submitted process > prepareTranscriptome (1)
    [cd/239926] Submitted process > buildIndex (1)
    [c6/f6488d] Submitted process > mapping (2)
    [bc/b3ea76] Submitted process > mapping (1)
    [f4/8d4628] Submitted process > makeTranscript (1)
    [eb/92db7f] Submitted process > makeTranscript (2)
    Saving: transcript_ggal_alpha.gtf
    Saving: transcript_ggal_beta.gtf
    23

    View Slide

  24. WHO IS USING NEXTFLOW?
    • Andrew Stewart, Veracyte Inc

    • Emilio Palumbo, Center for Genomic Regulation

    • Georgios Pappas, University of Brasilia

    • Lukas Jelonek, Justus-Liebig-Universität Gießen

    • Matthieu Foll, International Agency for Research on Cancer

    • Michael L Heuer, National Marrow Donor Program

    • Rémi Planel, University Claude Bernard Lyon 1

    • Rob Syme, CCDM, Curtin University

    • Sascha Steinbiss, Sanger Institute

    • Simon Ye, Broad Institute

    • Tobias Sargeant, The Walter and Eliza Hall Institute

    View Slide

  25. FUTURE WORK
    Short term

    • Built-in low-latency cluster

    Medium term

    • Support for Bioboxes project

    • Support for Git Large File Storage

    • Support for Rkt containers

    Long term

    • Graphical editor

    • Support for CWL / WDL (?)

    • Support for YARN / Spark clusters

    View Slide

  26. KEEP IN TOCH
    Home page

    http://nextflow.io

    Github

    https://github.com/nextflow-io/nextflow

    Chat

    https://gitter.im/nextflow-io/nextflow

    View Slide