$30 off During Our Annual Pro Sale. View Details »

Reproducible computational pipelines with Docker and Nextflow

Paolo Di Tommaso
November 09, 2015
280

Reproducible computational pipelines with Docker and Nextflow

Bio In Docker Symposium, 9 November 2015, London.

http://core.brc.iop.kcl.ac.uk/wp-content/uploads/2015/02/BioInDockerAgenda.pdf

Paolo Di Tommaso

November 09, 2015
Tweet

Transcript

  1. Paolo Di Tommaso - Notredame Lab 

    Center for Genomic Regulation (CRG)

    Bio in Docker Symposium - 9 Nov 2015, London
    Reproducible computational
    pipelines with Docker and
    Nextflow

    View Slide

  2. @PaoloDiTommaso

    Research software engineer

    Comparative Bioinformatics,

    Notredame Lab

    Center for Genomic Regulation (CRG)

    View Slide

  3. View Slide

  4. WHAT THINGS MOST FRUSTRATE YOU OR
    LIMIT YOUR ABILITY TO CARRY OUT
    BIOINFORMATICS ANALYSIS?*
    • Ability to compile/run the average software suite.

    • Diversity and complexity of software deployments.

    • Poor and/or incomplete documentation.

    • Installation of different software, packages,... and getting them to
    work on different platforms.

    • Lack of truly standard file formats.

    • Time installing software.

    • Lack of computing resources (cpus, memory, storage, etc).
    * available at https://goo.gl/TF9TMj

    View Slide

  5. To replicate the result of a typical 

    computational biology paper

    requires 280 hours!

    View Slide

  6. WHAT'S WRONG WITH
    COMPUTATIONAL
    WORKFLOWS?

    View Slide

  7. COMPLEXITY
    • Dozens of dependencies (binary tools, compilers,
    libraries, system tools, etc)

    • Experimental nature of academic SW tends to be
    difficult to install, configure and deploy

    • Heterogeneous executing platforms and system
    architecture (laptop→supercomputer)

    View Slide

  8. View Slide

  9. CONTAINERS ARE

    THE THIRD BIG WAVE

    IN VIRTUALISATION

    TECHNOLOGY

    View Slide

  10. VM VS
    CONTAINER

    View Slide

  11. BENEFITS
    • Smaller images (~100MB)

    • Fast instantiation time (~1sec)

    • Almost native performance

    • Easy to build, publish, share and deploy

    • Transparent build process

    View Slide

  12. Host
    PACKAGING A WORKFLOW
    Docker image
    Binary tools
    Workflow scripts
    Config file
    Compilers
    Libraries
    Environment

    View Slide

  13. SCALING OUT
    . . . .

    View Slide

  14. CONTAINERS ORCHESTRATION
    • Swarm

    • Fleet

    • Kubernetes

    • Mesos

    View Slide

  15. NOT THE RIGHT ANSWER
    FOR COMPUTATIONAL
    PIPELINES

    View Slide

  16. SERVICES ORCHESTRATION

    ≠

    TASKS SCHEDULING

    View Slide

  17. OUR SOLUTION
    NEXTFLOW
    Host file system
    Registry

    View Slide

  18. DOCKER AT CRG
    Nextflow
    Config file
    Pipeline script
    docker

    registry
    head

    node
    Univa grid engine

    View Slide

  19. PROS
    • Dead easy deployment procedure

    • Self-contained and precise controlled runtime

    • Rapidly reproduce any former configuration

    • Consistent results over time and across different
    platforms

    View Slide

  20. CONS
    • Requires a modern Linux kernel (≥3.10)

    • Security concerns

    • Containers/images cleanup

    View Slide

  21. WHAT ABOUT
    PERFORMANCE?

    View Slide

  22. BENCHMARK*
    * Di Tommaso P, Palumbo E, Chatzou M, Prieto P, Heuer ML, Notredame C. (2015) 

    The impact of Docker containers on the performance of genomic pipelines. PeerJ 3:e1273 

    https://dx.doi.org/10.7717/peerj.1273

    View Slide

  23. • DSL (domain specify lang.) on top of JVM

    • High-level parallelisation model

    • Configurable executors target multiple platforms

    View Slide

  24. RATIONALE
    • Make workflows portable across different
    computational environments

    • Simplify deployment and enable reproducibility

    • Reuse any existing piece of SW (tools, scripts, etc)

    View Slide

  25. DATAFLOW PROGRAMMING
    A
    B
    X
    Y
    C D

    View Slide

  26. process  foo  {  
    !
         input:  
         val  str  from  'Hello'  
    !
         output:  
         file  'my_file'  into  result  
    !
         script:  
         """  
         echo  $str  world!  >  my_file  
         """  
    }  
    !
    PROCESS DEFINITION

    View Slide

  27. WHAT A SCRIPT LOOKS LIKE
    !
    sequences  =  Channel.fromPath("/data/sample.fasta")  
                           
    process  blast  {  
    input:    
    file  'in.fasta'  from  sequences  
    output:  
    file  'out.txt'  into  blast_result  
    !
    """  
    blastp  -­‐query  in.fasta  -­‐outfmt  6  |  cut  -­‐f  2  |  \  
    blastdbcmd  -­‐entry_batch  -­‐  >  out.txt  
    """  
    }    
    !
    process  align  {  
    input:    
    file  all_seqs  from  blast_result  
    output:    
    file  align_result  
    !
    """  
    t_coffee  $all_seqs  2>&-­‐  |  tee  align_result

    """  
    }  
    !
    blast_result.collectFile(name:  'final_alignment')    

    View Slide

  28. IMPLICIT PARALLELISM
    !
    sequences  =  Channel.fromPath("/data/*.fasta")  
                           
    process  blast  {  
    input:    
    file  'in.fasta'  from  sequences  
    output:  
    file  'out.txt'  into  blast_result  
    !
    """  
    blastp  -­‐query  in.fasta  -­‐outfmt  6  |  cut  -­‐f  2  |  \  
    blastdbcmd  -­‐entry_batch  -­‐  >  out.txt  
    """  
    }    
    !
    process  align  {  
    input:    
    file  all_seqs  from  blast_result  
    output:    
    file  align_result  
    !
    """  
    t_coffee  $all_seqs  2>&-­‐  |  tee  align_result

    """  
    }  
    !
    blast_result.collectFile(name:  'final_alignment')    

    View Slide

  29. IMPLICIT PARALLELISM
    BLAST
    T-COFFEE
    BLAST
    T-COFFEE
    T-COFFEE
    BLAST
    sample.fasta
    sample.fasta
    sample.fasta
    alignment

    View Slide

  30. BENEFITS
    • High-level declarative parallelisation model

    • Portable across different platforms

    • Isolates task dependencies with Docker containers

    View Slide

  31. CONFIGURATION FILE
    !
    process  {

       container  =  'your/image:latest'  

       executor  =  'sge'    

       queue  =  'cn-­‐el6'

       memory  =  '10GB'

       cpus  =  8

       time  =  '2h'

    }  

    View Slide

  32. DEPLOYMENT MODES
    • Local execution

    • Grid engine / batch scheduler

    • Distributed execution (embedded cluster)

    • Managed cloud (ClusterK, DNAnexus)

    • AWS cloud platform

    View Slide

  33. LOCAL EXECUTION
    nextflow
    !
    procs
    !
    procs
    file system
    POSIX

    procs
    host

    View Slide

  34. GRID ENGINE
    nextflow
    login node
    NFS
    cluster node
    cluster node
    cluster node
    cluster node
    batch scheduler
    submit tasks
    cluster node

    View Slide

  35. SUPPORTED PLATFORMS
    35

    View Slide

  36. DISTRIBUTED MODE
    Login node
    NFS/Lustre
    Job request
    cluster node
    cluster node
    Job wrapper
    !
     #!/bin/bash    
     #$  -­‐q    
     #$  -­‐pe  ompi    
     #$  -­‐l  virtual_free=  
     mpirun  nextflow  run    -­‐with-­‐mpi  
    HPC cluster
    nextflow cluster
    nextflow driver
    nextflow worker
    nextflow worker
    nextflow worker
    Apache 

    Ignite

    View Slide

  37. AWS CLOUD
    Nextflow

    driver
    EC2 node
    S3 / EFS
    nextflow

    workers
    Elastic Load balancer
    submit tasks
    nextflow

    workers
    Nextflow

    workers
    Docker registry

    AWS ECR
    EC2 spot instances
    Apache 

    Ignite
    Apache 

    Ignite

    View Slide

  38. MANAGED CLOUD
    Nextflow
    Cirrus scheduler
    EC2 spot
    EC2 spot
    EC2 spot
    Amazon S3
    Submit tasks
    Configure instances

    Execute tasks
    Pipeline script
    CLUSTERK

    View Slide

  39. DEMO

    View Slide

  40. WHO IS USING NEXTFLOW?
    • Campagne Lab, Weill Medical College of Cornell University

    • Center for Biotechnology, Bielefeld University

    • Genetic Cancer group, International Agency Cancer Research

    • Guigo Lab, Center for Genomic Regulation

    • Medical genetics diagnostic, Oslo University Hospital

    • National Marrow Donor Program

    • Parasite Genomics, Sanger Institute

    • Sabeti Lab, Broad Institute

    • Veracyte Inc

    View Slide

  41. FUTURE WORK
    Short term

    • Investigate support for Bioboxes

    • Support for Git LFS (large file storage)

    • Version 1.0 (first half 2016)

    Long term

    • Enhance processing capabilities with distributed caching and
    data affinity.

    • Interoperability with YARN / Spark clusters / Common WL

    View Slide

  42. CONCLUSION
    • Docker is a game-changer for workflows
    packaging and deployment

    • Nextflow is a streaming oriented framework for
    computational workflows.

    • Docker + Nextflow = Reproducible self-
    contained pipelines.

    View Slide

  43. THANKS

    View Slide

  44. LINKS
    project home

    http://nextflow.io
    Docker benchmark

    https://peerj.com/articles/1273/
    Univa-CRG white paper

    http://goo.gl/lEPSe2
    this presentation

    https://speakerdeck.com/pditommaso

    View Slide