Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introducing Nextflow

Paolo Di Tommaso
February 04, 2016
510

Introducing Nextflow

Introduction to Nextflow pipeline framework given at CNAG, Barcelona

Paolo Di Tommaso

February 04, 2016
Tweet

Transcript

  1. PIPELINE FRAMEWORK Paolo Di Tommaso - Notredame Lab, CRG

  2. CHALLENGES • Optimise computation taking advantage of distributed cluster /

    cloud • Simplify deployment of complex pipelines
  3. To replicate the result of a typical 
 computational biology

    paper
 requires 280 hours!
  4. COMPLEXITY • Dozens of dependencies (binary tools, compilers, libraries, system

    tools, etc) • Experimental nature of academic SW tends to be difficult to install, configure and deploy • Heterogeneous executing platforms and system architecture (laptop→supercomputer)
  5. DO NOT REINVENT 
 THE WHEEL

  6. UNIX PIPE MODEL cat seqs | blastp -query - |

    head 10 | t_coffee > result
  7. WHAT WE NEED Compose Linux commands and scripts as usual

    + Handle multiple inputs/outputs Portable across multiple platforms Fault tolerance
  8. NEXTFLOW • Fast application prototypes • High-level parallelisation model •

    Portable across multiple execution platforms • Enable pipeline reproducibility
  9. LIGHTWEIGHT

  10. Just download it: curl -fsSL get.nextflow.io | bash nextflow Dependencies:

    Unix-like OS (Linux, OSX, etc.) and Java 7/8
  11. FAST PROTOTYPING

  12. • A pipeline script is written by composition putting together

    several process • A process can execute any script or tool • It allows to reuse any existing piece of code
  13. process foo { input: val str from 'Hello' output: file

    'my_file' into result script: """ echo $str world! > my_file """ } PROCESS DEFINITION
  14. WHAT A SCRIPT LOOKS LIKE sequences = Channel.fromPath("/data/sample.fasta") process blast

    { input: file 'in.fasta' from sequences output: file 'out.txt' into blast_result """ blastp -query in.fasta -outfmt 6 | cut -f 2 | \ blastdbcmd -entry_batch - > out.txt """ } process align { input: file all_seqs from blast_result output: file 'align.txt' into align_result """ t_coffee $all_seqs 2>&- | tee align.txt
 """ } align_result.collectFile(name: 'final_alignment')
  15. IMPLICIT PARALLELISM sequences = Channel.fromPath("/data/*.fasta") process blast { input: file

    'in.fasta' from sequences output: file 'out.txt' into blast_result """ blastp -query in.fasta -outfmt 6 | cut -f 2 | \ blastdbcmd -entry_batch - > out.txt """ } process align { input: file all_seqs from blast_result output: file 'align.txt' into align_result """ t_coffee $all_seqs 2>&- | tee align.txt
 """ } align_result.collectFile(name: 'final_alignment')
  16. IMPLICIT PARALLELISM BLAST T-COFFEE BLAST T-COFFEE T-COFFEE BLAST sample.fasta sample.fasta

    sample.fasta alignment
  17. DATAFLOW

  18. DATAFLOW • Declarative computational model for concurrent processes • Processes

    wait for data, when an input set is ready the process is executed • They communicate by using dataflow variables i.e. async stream of data called channels • Parallelisation and tasks dependencies are implicitly defined by process in/out declarations
  19. REACTIVE NETWORK

  20. PORTABLE SCRIPTS

  21. • The executor abstraction layer allows you to run the

    same script on different platforms • Local (default) • Cluster (SGE, LSF, SLURM, Torque/PBS) • HPC (beta) • Cloud (beta)
  22. Dataflow Task dispatcher Executors POSIX processes qsub .. tasks DSL

    interpreter nextflow
  23. LOCAL EXECUTOR procs procs file system POSIX procs host nextflow

  24. CLUSTER EXECUTOR nextflow login node NFS/GPFS cluster node cluster node

    cluster node cluster node batch scheduler submit tasks cluster node
  25. CONFIGURATION FILE process {
 executor = 'sge' 
 queue =

    'cn-el6'
 memory = '10GB'
 cpus = 8
 time = '2h'
 }
  26. HPC EXECUTOR Login node NFS/GPFS Job request cluster node cluster

    node Job wrapper #!/bin/bash #$ -q <queue> #$ -pe ompi <nodes> #$ -l virtual_free=<mem> mpirun nextflow run <your-pipeline> -with-mpi HPC cluster nextflow cluster nextflow driver nextflow worker nextflow worker nextflow worker
  27. None
  28. CONTAINERS ALLOWS TO ISOLATE TASKS DEPENDENCIES

  29. VM VS CONTAINER

  30. BENEFITS • Smaller images (~100MB) • Fast instantiation time (<1sec)

    • Almost native performance • Easy to build, publish, share and deploy • Enable tools versioning and archiving
  31. Host BASIC CONTAINERISATION Docker image Binary tools Workflow scripts Config

    file Compilers Libraries Environment
  32. SCALING OUT . . . .

  33. OUR SOLUTION NEXTFLOW Host file system Registry

  34. DOCKER AT CRG Nextflow Config file Pipeline script docker registry

    head node Univa grid engine
  35. PROS • Dead easy deployment procedure • Self-contained and precise

    controlled runtime • Rapidly reproduce any former configuration • Consistent results over time and across different platforms
  36. CONS • Requires a modern Linux kernel (≥3.10) • Security

    concerns • Containers/images cleanup
  37. SHIFTER • Alternative implementation developed by NERSC (Berkeley lab) •

    HPC friendly, does not require special permission • Compatible with Docker images • Integrated with SLURM scheduler
  38. ERROR RECOVERY

  39. • Stop on failure / fix / resume executions •

    Automatically re-execute failing tasks increasing requested resources (memory, disk, etc.) • Ignore task errors
  40. DEMO

  41. WHO IS USING NEXTFLOW? • Campagne Lab, Weill Medical College

    of Cornell University • Center for Biotechnology, Bielefeld University • Genetic Cancer group, International Agency for Cancer Research • Guigo Lab, Center for Genomic Regulation • Medical genetics diagnostic, Oslo University Hospital • National Marrow Donor Program • Joint Genomic Institute • Parasite Genomics, Sanger Institute
  42. FUTURE WORK Short term • Built-in support for Shifter •

    Enhance scheduling capability of HPC execution mode • Version 1.0 (second half 2016) Long term • Web user interface • Enhance support for cloud (Google Compute Engine)
  43. CONCLUSION • Nextflow is a streaming oriented framework for computational

    workflows. • It is not supposed to replace your favourite tools • It provides a parallel and scalable environment for your scripts • It enables reproducible pipelines deployment
  44. THANKS

  45. LINKS project home
 http://nextflow.io GitHub repository http://github.com/nextflow-io/nextflow this presentation https://speakerdeck.com/pditommaso