Slide 1

Slide 1 text

PIPELINE FRAMEWORK Paolo Di Tommaso - Notredame Lab, CRG

Slide 2

Slide 2 text

CHALLENGES • Optimise computation taking advantage of distributed cluster / cloud • Simplify deployment of complex pipelines

Slide 3

Slide 3 text

To replicate the result of a typical 
 computational biology paper
 requires 280 hours!

Slide 4

Slide 4 text

COMPLEXITY • Dozens of dependencies (binary tools, compilers, libraries, system tools, etc) • Experimental nature of academic SW tends to be difficult to install, configure and deploy • Heterogeneous executing platforms and system architecture (laptop→supercomputer)

Slide 5

Slide 5 text

DO NOT REINVENT 
 THE WHEEL

Slide 6

Slide 6 text

UNIX PIPE MODEL cat seqs | blastp -query - | head 10 | t_coffee > result

Slide 7

Slide 7 text

WHAT WE NEED Compose Linux commands and scripts as usual + Handle multiple inputs/outputs Portable across multiple platforms Fault tolerance

Slide 8

Slide 8 text

NEXTFLOW • Fast application prototypes • High-level parallelisation model • Portable across multiple execution platforms • Enable pipeline reproducibility

Slide 9

Slide 9 text

LIGHTWEIGHT

Slide 10

Slide 10 text

Just download it: curl -fsSL get.nextflow.io | bash nextflow Dependencies: Unix-like OS (Linux, OSX, etc.) and Java 7/8

Slide 11

Slide 11 text

FAST PROTOTYPING

Slide 12

Slide 12 text

• A pipeline script is written by composition putting together several process • A process can execute any script or tool • It allows to reuse any existing piece of code

Slide 13

Slide 13 text

process foo { input: val str from 'Hello' output: file 'my_file' into result script: """ echo $str world! > my_file """ } PROCESS DEFINITION

Slide 14

Slide 14 text

WHAT A SCRIPT LOOKS LIKE sequences = Channel.fromPath("/data/sample.fasta") process blast { input: file 'in.fasta' from sequences output: file 'out.txt' into blast_result """ blastp -query in.fasta -outfmt 6 | cut -f 2 | \ blastdbcmd -entry_batch - > out.txt """ } process align { input: file all_seqs from blast_result output: file 'align.txt' into align_result """ t_coffee $all_seqs 2>&- | tee align.txt
 """ } align_result.collectFile(name: 'final_alignment')

Slide 15

Slide 15 text

IMPLICIT PARALLELISM sequences = Channel.fromPath("/data/*.fasta") process blast { input: file 'in.fasta' from sequences output: file 'out.txt' into blast_result """ blastp -query in.fasta -outfmt 6 | cut -f 2 | \ blastdbcmd -entry_batch - > out.txt """ } process align { input: file all_seqs from blast_result output: file 'align.txt' into align_result """ t_coffee $all_seqs 2>&- | tee align.txt
 """ } align_result.collectFile(name: 'final_alignment')

Slide 16

Slide 16 text

IMPLICIT PARALLELISM BLAST T-COFFEE BLAST T-COFFEE T-COFFEE BLAST sample.fasta sample.fasta sample.fasta alignment

Slide 17

Slide 17 text

DATAFLOW

Slide 18

Slide 18 text

DATAFLOW • Declarative computational model for concurrent processes • Processes wait for data, when an input set is ready the process is executed • They communicate by using dataflow variables i.e. async stream of data called channels • Parallelisation and tasks dependencies are implicitly defined by process in/out declarations

Slide 19

Slide 19 text

REACTIVE NETWORK

Slide 20

Slide 20 text

PORTABLE SCRIPTS

Slide 21

Slide 21 text

• The executor abstraction layer allows you to run the same script on different platforms • Local (default) • Cluster (SGE, LSF, SLURM, Torque/PBS) • HPC (beta) • Cloud (beta)

Slide 22

Slide 22 text

Dataflow Task dispatcher Executors POSIX processes qsub .. tasks DSL interpreter nextflow

Slide 23

Slide 23 text

LOCAL EXECUTOR procs procs file system POSIX procs host nextflow

Slide 24

Slide 24 text

CLUSTER EXECUTOR nextflow login node NFS/GPFS cluster node cluster node cluster node cluster node batch scheduler submit tasks cluster node

Slide 25

Slide 25 text

CONFIGURATION FILE process {
 executor = 'sge' 
 queue = 'cn-el6'
 memory = '10GB'
 cpus = 8
 time = '2h'
 }

Slide 26

Slide 26 text

HPC EXECUTOR Login node NFS/GPFS Job request cluster node cluster node Job wrapper #!/bin/bash #$ -q #$ -pe ompi #$ -l virtual_free= mpirun nextflow run -with-mpi HPC cluster nextflow cluster nextflow driver nextflow worker nextflow worker nextflow worker

Slide 27

Slide 27 text

No content

Slide 28

Slide 28 text

CONTAINERS ALLOWS TO ISOLATE TASKS DEPENDENCIES

Slide 29

Slide 29 text

VM VS CONTAINER

Slide 30

Slide 30 text

BENEFITS • Smaller images (~100MB) • Fast instantiation time (<1sec) • Almost native performance • Easy to build, publish, share and deploy • Enable tools versioning and archiving

Slide 31

Slide 31 text

Host BASIC CONTAINERISATION Docker image Binary tools Workflow scripts Config file Compilers Libraries Environment

Slide 32

Slide 32 text

SCALING OUT . . . .

Slide 33

Slide 33 text

OUR SOLUTION NEXTFLOW Host file system Registry

Slide 34

Slide 34 text

DOCKER AT CRG Nextflow Config file Pipeline script docker registry head node Univa grid engine

Slide 35

Slide 35 text

PROS • Dead easy deployment procedure • Self-contained and precise controlled runtime • Rapidly reproduce any former configuration • Consistent results over time and across different platforms

Slide 36

Slide 36 text

CONS • Requires a modern Linux kernel (≥3.10) • Security concerns • Containers/images cleanup

Slide 37

Slide 37 text

SHIFTER • Alternative implementation developed by NERSC (Berkeley lab) • HPC friendly, does not require special permission • Compatible with Docker images • Integrated with SLURM scheduler

Slide 38

Slide 38 text

ERROR RECOVERY

Slide 39

Slide 39 text

• Stop on failure / fix / resume executions • Automatically re-execute failing tasks increasing requested resources (memory, disk, etc.) • Ignore task errors

Slide 40

Slide 40 text

DEMO

Slide 41

Slide 41 text

WHO IS USING NEXTFLOW? • Campagne Lab, Weill Medical College of Cornell University • Center for Biotechnology, Bielefeld University • Genetic Cancer group, International Agency for Cancer Research • Guigo Lab, Center for Genomic Regulation • Medical genetics diagnostic, Oslo University Hospital • National Marrow Donor Program • Joint Genomic Institute • Parasite Genomics, Sanger Institute

Slide 42

Slide 42 text

FUTURE WORK Short term • Built-in support for Shifter • Enhance scheduling capability of HPC execution mode • Version 1.0 (second half 2016) Long term • Web user interface • Enhance support for cloud (Google Compute Engine)

Slide 43

Slide 43 text

CONCLUSION • Nextflow is a streaming oriented framework for computational workflows. • It is not supposed to replace your favourite tools • It provides a parallel and scalable environment for your scripts • It enables reproducible pipelines deployment

Slide 44

Slide 44 text

THANKS

Slide 45

Slide 45 text

LINKS project home
 http://nextflow.io GitHub repository http://github.com/nextflow-io/nextflow this presentation https://speakerdeck.com/pditommaso