$30 off During Our Annual Pro Sale. View Details »

Introducing Nextflow

Paolo Di Tommaso
February 04, 2016
580

Introducing Nextflow

Introduction to Nextflow pipeline framework given at CNAG, Barcelona

Paolo Di Tommaso

February 04, 2016
Tweet

Transcript

  1. PIPELINE FRAMEWORK
    Paolo Di Tommaso - Notredame Lab, CRG

    View Slide

  2. CHALLENGES
    • Optimise computation taking advantage of
    distributed cluster / cloud
    • Simplify deployment of complex pipelines

    View Slide

  3. To replicate the result of a typical 

    computational biology paper

    requires 280 hours!

    View Slide

  4. COMPLEXITY
    • Dozens of dependencies (binary tools, compilers,
    libraries, system tools, etc)
    • Experimental nature of academic SW tends to be
    difficult to install, configure and deploy
    • Heterogeneous executing platforms and system
    architecture (laptop→supercomputer)

    View Slide

  5. DO NOT REINVENT 

    THE WHEEL

    View Slide

  6. UNIX PIPE MODEL
    cat seqs | blastp -query - | head 10 | t_coffee > result

    View Slide

  7. WHAT WE NEED
    Compose Linux commands and scripts as usual
    +
    Handle multiple inputs/outputs
    Portable across multiple platforms
    Fault tolerance

    View Slide

  8. NEXTFLOW
    • Fast application prototypes
    • High-level parallelisation model
    • Portable across multiple execution platforms
    • Enable pipeline reproducibility

    View Slide

  9. LIGHTWEIGHT

    View Slide

  10. Just download it:
    curl -fsSL get.nextflow.io | bash
    nextflow
    Dependencies: Unix-like OS (Linux, OSX, etc.) and Java 7/8

    View Slide

  11. FAST PROTOTYPING

    View Slide

  12. • A pipeline script is written by composition
    putting together several process
    • A process can execute any script or tool
    • It allows to reuse any existing piece of code

    View Slide

  13. process foo {
    input:
    val str from 'Hello'
    output:
    file 'my_file' into result
    script:
    """
    echo $str world! > my_file
    """
    }
    PROCESS DEFINITION

    View Slide

  14. WHAT A SCRIPT LOOKS LIKE
    sequences = Channel.fromPath("/data/sample.fasta")
    process blast {
    input:
    file 'in.fasta' from sequences
    output:
    file 'out.txt' into blast_result
    """
    blastp -query in.fasta -outfmt 6 | cut -f 2 | \
    blastdbcmd -entry_batch - > out.txt
    """
    }
    process align {
    input:
    file all_seqs from blast_result
    output:
    file 'align.txt' into align_result
    """
    t_coffee $all_seqs 2>&- | tee align.txt

    """
    }
    align_result.collectFile(name: 'final_alignment')

    View Slide

  15. IMPLICIT PARALLELISM
    sequences = Channel.fromPath("/data/*.fasta")
    process blast {
    input:
    file 'in.fasta' from sequences
    output:
    file 'out.txt' into blast_result
    """
    blastp -query in.fasta -outfmt 6 | cut -f 2 | \
    blastdbcmd -entry_batch - > out.txt
    """
    }
    process align {
    input:
    file all_seqs from blast_result
    output:
    file 'align.txt' into align_result
    """
    t_coffee $all_seqs 2>&- | tee align.txt

    """
    }
    align_result.collectFile(name: 'final_alignment')

    View Slide

  16. IMPLICIT PARALLELISM
    BLAST
    T-COFFEE
    BLAST
    T-COFFEE
    T-COFFEE
    BLAST
    sample.fasta
    sample.fasta
    sample.fasta
    alignment

    View Slide

  17. DATAFLOW

    View Slide

  18. DATAFLOW
    • Declarative computational model for concurrent processes
    • Processes wait for data, when an input set is ready the
    process is executed
    • They communicate by using dataflow variables i.e. async
    stream of data called channels
    • Parallelisation and tasks dependencies are implicitly defined
    by process in/out declarations

    View Slide

  19. REACTIVE NETWORK

    View Slide

  20. PORTABLE SCRIPTS

    View Slide

  21. • The executor abstraction layer allows you to
    run the same script on different platforms
    • Local (default)
    • Cluster (SGE, LSF, SLURM, Torque/PBS)
    • HPC (beta)
    • Cloud (beta)

    View Slide

  22. Dataflow
    Task dispatcher
    Executors
    POSIX
    processes
    qsub ..
    tasks
    DSL interpreter
    nextflow

    View Slide

  23. LOCAL EXECUTOR
    procs
    procs
    file system
    POSIX
    procs
    host
    nextflow

    View Slide

  24. CLUSTER EXECUTOR
    nextflow
    login node
    NFS/GPFS
    cluster node
    cluster node
    cluster node
    cluster node
    batch scheduler
    submit tasks
    cluster node

    View Slide

  25. CONFIGURATION FILE
    process {

    executor = 'sge' 

    queue = 'cn-el6'

    memory = '10GB'

    cpus = 8

    time = '2h'

    }

    View Slide

  26. HPC EXECUTOR
    Login node
    NFS/GPFS
    Job request
    cluster node
    cluster node
    Job wrapper
    #!/bin/bash
    #$ -q
    #$ -pe ompi
    #$ -l virtual_free=
    mpirun nextflow run -with-mpi
    HPC cluster
    nextflow cluster
    nextflow driver
    nextflow worker
    nextflow worker
    nextflow worker

    View Slide

  27. View Slide

  28. CONTAINERS ALLOWS TO
    ISOLATE TASKS
    DEPENDENCIES

    View Slide

  29. VM VS
    CONTAINER

    View Slide

  30. BENEFITS
    • Smaller images (~100MB)
    • Fast instantiation time (<1sec)
    • Almost native performance
    • Easy to build, publish, share and deploy
    • Enable tools versioning and archiving

    View Slide

  31. Host
    BASIC CONTAINERISATION
    Docker image
    Binary tools
    Workflow scripts
    Config file
    Compilers
    Libraries
    Environment

    View Slide

  32. SCALING OUT
    . . . .

    View Slide

  33. OUR SOLUTION
    NEXTFLOW
    Host file system
    Registry

    View Slide

  34. DOCKER AT CRG
    Nextflow
    Config file
    Pipeline script
    docker
    registry
    head
    node
    Univa grid engine

    View Slide

  35. PROS
    • Dead easy deployment procedure
    • Self-contained and precise controlled runtime
    • Rapidly reproduce any former configuration
    • Consistent results over time and across different
    platforms

    View Slide

  36. CONS
    • Requires a modern Linux kernel (≥3.10)
    • Security concerns
    • Containers/images cleanup

    View Slide

  37. SHIFTER
    • Alternative implementation developed by NERSC
    (Berkeley lab)
    • HPC friendly, does not require special permission
    • Compatible with Docker images
    • Integrated with SLURM scheduler

    View Slide

  38. ERROR RECOVERY

    View Slide

  39. • Stop on failure / fix / resume executions
    • Automatically re-execute failing tasks increasing
    requested resources (memory, disk, etc.)
    • Ignore task errors

    View Slide

  40. DEMO

    View Slide

  41. WHO IS USING NEXTFLOW?
    • Campagne Lab, Weill Medical College of Cornell University
    • Center for Biotechnology, Bielefeld University
    • Genetic Cancer group, International Agency for Cancer
    Research
    • Guigo Lab, Center for Genomic Regulation
    • Medical genetics diagnostic, Oslo University Hospital
    • National Marrow Donor Program
    • Joint Genomic Institute
    • Parasite Genomics, Sanger Institute

    View Slide

  42. FUTURE WORK
    Short term
    • Built-in support for Shifter
    • Enhance scheduling capability of HPC execution mode
    • Version 1.0 (second half 2016)
    Long term
    • Web user interface
    • Enhance support for cloud (Google Compute Engine)

    View Slide

  43. CONCLUSION
    • Nextflow is a streaming oriented framework for
    computational workflows.
    • It is not supposed to replace your favourite tools
    • It provides a parallel and scalable environment for
    your scripts
    • It enables reproducible pipelines deployment

    View Slide

  44. THANKS

    View Slide

  45. LINKS
    project home

    http://nextflow.io
    GitHub repository
    http://github.com/nextflow-io/nextflow
    this presentation
    https://speakerdeck.com/pditommaso

    View Slide