Dataflow oriented bioinformatics pipelines with Nextflow

Slide 1

Slide 1 text

Dataﬂow oriented bioinformatics pipelines Paolo Di Tommaso Notredame’s lab CRG, 20th June ’13

Slide 2

Slide 2 text

No content

Slide 3

Slide 3 text

The computer scientist approach 1. Database server 2. Parse proteins 3. Populate DB 4. You need a client app to access DB 5.select count(*) from table.proteins

Slide 4

Slide 4 text

The bioinformaticians’ way cat ~/proteins.fa | grep '>' | wc -‐l

Slide 5

Slide 5 text

The good • It allows quick and easy data extraction/ manipulation • No dependencies on external servers • It enables fast prototyping and experiments multiple alternative easily • Linux is the integration layer in the Bioinformatics domains

Slide 6

Slide 6 text

The bad • Non standard tools need to be compiled on the target platform • As script get bigger as it becomes fragile/ un-readable/hard to modify • In large script tasks inputs/outputs tends to be not clearly deﬁned

Slide 7

Slide 7 text

The ugly • It’s very hard to get it to scale properly with big data • Very different parallelization strategies and implementations • Shared memory (process, thread) • Message passing (MPI, actors) • Distributed computation • Hardware parallelization (GPU) • In general provide a too low level abstraction and requires speciﬁc API skills • Introduce hard dependencies to speciﬁc platform/framework

Slide 8

Slide 8 text

What we need Compose Linux commands and scripts as usual + High level parallelization model (ideally platform agnostic)

Slide 9

Slide 9 text

Dataﬂow • Declarative computational model for concurrent tasks execution1 • Originates in research for reactive system to monitor and control industrial processes • All tasks are parallel and form a process network • Tasks communicate through channels (non-blocking unidirectional FIFO queue) • A task is executed when all its inputs are bound • The synchronization is implicitly deﬁned by tasks inputs/ outputs declaration 1. G. Kahn, “The Semantics of a Simple Language for Parallel Programming,” Proc. of the IFIP Congress 74, North-Holland Publishing Co., 1974

Slide 10

Slide 10 text

Basic example a = new Dataflow() b = new Dataflow() c = new Dataflow() task { a << b + ' ' + c } task { b << 'Hello' } task { c << 'World!' } print a

Slide 11

Slide 11 text

Nextﬂow • Tasks declares inputs/outputs and a Linux executable script • It is executed as soon as declared inputs are available • It is inherently parallel • Each tasks is executed in its own private directory • Produced outputs trigger the execution of downstream tasks

Slide 12

Slide 12 text

Nextﬂow task task ('optional name') { input file_in output file_out output '*.fa': channel """ your BASH script : """ }

Slide 13

Slide 13 text

Parallel BLAST example query = file(args[0]) DB = "$HOME/blast-‐db/pdb/pdb" seq = channel() query.chunkFasta { seq << it } task { input seq output blast_result "echo '$seq' | blastp -‐db $DB -‐query $seq -‐outfmt 6 > blast_result" } task { input blast_result "cat $blast_result" }

Slide 14

Slide 14 text

Mixing Languages task { "your BASH script .. " } task { """ #!/usr/bin/env perl your glory PERL script .. """ } task { """ #!/usr/bin/env python your Python code .. """ }

Slide 15

Slide 15 text

Configurable execution layer • The same pipeline can run on different platforms by a simple definition into the configuration file • Local processes (Java threads) • Resource managers (SGE, LSF, SLURM, etc) • Cloud (Amazon Ec2, Google, etc)

Slide 16

Slide 16 text

Resume execution • When a task crash, the pipeline stops gracefully and reports the error cause • Easy debugging, each tasks can be executed separately • When ﬁxed, the execution can be resumed from the failure point

Slide 17

Slide 17 text

A case study: Pipe-R • Pipeline for the detection and mapping of long non- coding RNAs • ~ 10'000 lines of Perl code • run a single computer (single process) or cluster through SGE • ~ 70% of the code deals with parameters handling, ﬁle splitting, jobs parallelization, synchronization, etc. • very inefﬁcient parallelization

Slide 18

Slide 18 text

! ! ! ! ! ! Gene!A! Gene!B! Query!species! !Target!species! Anchor!2! Pipe%R' Anchor!1! Anchor!1! Anchor!3! Blast! Align! Homolog!A1! Homolog!A2! Homolog!A3! Homolog!B1!

Slide 19

Slide 19 text

Problems • The slowest job stops all the computation • Parallelization depends on number of genomes (1 genomes no parallel execution) • A speciﬁc resource manages technology (qsub) is hard-coded into the pipeline • If it crash, you lost days of computation

Slide 20

Slide 20 text

Piper-NF • Pipe-R implementation based on Nextﬂow • 350 lines of code vs. 10'000 legacy version • Much easier to write, test and to maintain • Implementation platform agnostic • Parallelized splitting by query and genome • Fine grain control on parallelization • Greatly improved tasks “interleaving”

Slide 21

Slide 21 text

! ! ! ! ! ! Gene!A! Gene!B! Query!species! !Target!species! Anchor!2! Piper&NF:*Let*it*ﬂow…* Anchor!1! Anchor!1! Anchor!3! Blast! Align! Homolog!A1! Homolog!A2! Homolog!A3! Homolog!B1!

Slide 22

Slide 22 text

Piper-NF benchmark • Query with 665 sequences • Mapping L.mania 22 genomes 30 mins 60 mins 90 mins 120 mins 150 mins 180 mins legacy 50 100 150 Legacy* No split Chunk 200 Chunk 100 Chunk 50 Chunk 25 cluster nodes * Partial (not including T-Coffee alignment step) 6 x faster!

Slide 23

Slide 23 text

What’s next • Support more data format (Fastaq, BAM, SAM) • Enhance syntax to make more expressive • Advanced proﬁling (Paraver) • Improve grid support (SLURM, LSF, DRMAA, etc) • Integrate the cloud (Amazon, DNAnexus) • Add health monitoring and automatic fail-over

Slide 24

Slide 24 text

Noteworthy • Deployed as single executable packaged i.e. download and run • Pipeline functional logic is decoupled by the actual execution layer (local/grid/cloud/?) • Reuse existing code/scripts • Parallelism is deﬁned implicitly by tasks inputs/ outputs declarations • On task error, stops gently reporting the error, and resume from failed point

Slide 25

Slide 25 text

Links • Nextflow http://nextflow-project.org • Dataflow http://www.gpars.org/1.0.0/guide/guide/ dataflow.html • Piper-NF http://github.com/cbcrg/piper-nf