Slide 1

Slide 1 text

Dataflow oriented bioinformatics pipelines Paolo Di Tommaso Notredame’s lab CRG, 20th June ’13

Slide 2

Slide 2 text

No content

Slide 3

Slide 3 text

The computer scientist approach 1. Database server 2. Parse proteins 3. Populate DB 4. You need a client app to access DB 5.select  count(*)  from  table.proteins

Slide 4

Slide 4 text

The bioinformaticians’ way cat  ~/proteins.fa  |  grep  '>'  |  wc  -­‐l  

Slide 5

Slide 5 text

The good • It allows quick and easy data extraction/ manipulation • No dependencies on external servers • It enables fast prototyping and experiments multiple alternative easily • Linux is the integration layer in the Bioinformatics domains

Slide 6

Slide 6 text

The bad • Non standard tools need to be compiled on the target platform • As script get bigger as it becomes fragile/ un-readable/hard to modify • In large script tasks inputs/outputs tends to be not clearly defined

Slide 7

Slide 7 text

The ugly • It’s very hard to get it to scale properly with big data • Very different parallelization strategies and implementations • Shared memory (process, thread) • Message passing (MPI, actors) • Distributed computation • Hardware parallelization (GPU) • In general provide a too low level abstraction and requires specific API skills • Introduce hard dependencies to specific platform/framework

Slide 8

Slide 8 text

What we need Compose Linux commands and scripts as usual + High level parallelization model (ideally platform agnostic)

Slide 9

Slide 9 text

Dataflow • Declarative computational model for concurrent tasks execution1 • Originates in research for reactive system to monitor and control industrial processes • All tasks are parallel and form a process network • Tasks communicate through channels (non-blocking unidirectional FIFO queue) • A task is executed when all its inputs are bound • The synchronization is implicitly defined by tasks inputs/ outputs declaration 1. G. Kahn, “The Semantics of a Simple Language for Parallel Programming,” Proc. of the IFIP Congress 74, North-Holland Publishing Co., 1974

Slide 10

Slide 10 text

Basic example a  =  new  Dataflow() b  =  new  Dataflow() c  =  new  Dataflow() task  {  a  <<  b  +  '  '  +  c   } task  {   b  <<  'Hello' } task  {    c  <<  'World!'   } print  a

Slide 11

Slide 11 text

Nextflow • Tasks declares inputs/outputs and a Linux executable script • It is executed as soon as declared inputs are available • It is inherently parallel • Each tasks is executed in its own private directory • Produced outputs trigger the execution of downstream tasks

Slide 12

Slide 12 text

Nextflow task task  ('optional  name')  {        input  file_in        output  file_out        output  '*.fa':  channel                """        your  BASH  script            :        """         }

Slide 13

Slide 13 text

Parallel BLAST example query  =  file(args[0]) DB  =  "$HOME/blast-­‐db/pdb/pdb" seq  =  channel() query.chunkFasta  {  seq  <<  it  } task  {        input  seq        output  blast_result        "echo  '$seq'  |  blastp  -­‐db  $DB  -­‐query  $seq  -­‐outfmt  6  >  blast_result" } task  {        input  blast_result        "cat  $blast_result" }

Slide 14

Slide 14 text

Mixing Languages task  {      "your  BASH  script  ..  " } task  {      """      #!/usr/bin/env  perl      your  glory  PERL  script  ..      """ } task  {      """      #!/usr/bin/env  python      your  Python  code  ..        """ }

Slide 15

Slide 15 text

Configurable execution layer • The same pipeline can run on different platforms by a simple definition into the configuration file • Local processes (Java threads) • Resource managers (SGE, LSF, SLURM, etc) • Cloud (Amazon Ec2, Google, etc)

Slide 16

Slide 16 text

Resume execution • When a task crash, the pipeline stops gracefully and reports the error cause • Easy debugging, each tasks can be executed separately • When fixed, the execution can be resumed from the failure point

Slide 17

Slide 17 text

A case study: Pipe-R • Pipeline for the detection and mapping of long non- coding RNAs • ~ 10'000 lines of Perl code • run a single computer (single process) or cluster through SGE • ~ 70% of the code deals with parameters handling, file splitting, jobs parallelization, synchronization, etc. • very inefficient parallelization

Slide 18

Slide 18 text

! ! ! ! ! ! Gene!A! Gene!B! Query!species! !Target!species! Anchor!2! Pipe%R' Anchor!1! Anchor!1! Anchor!3! Blast! Align! Homolog!A1! Homolog!A2! Homolog!A3! Homolog!B1!

Slide 19

Slide 19 text

Problems • The slowest job stops all the computation • Parallelization depends on number of genomes (1 genomes no parallel execution) • A specific resource manages technology (qsub) is hard-coded into the pipeline • If it crash, you lost days of computation

Slide 20

Slide 20 text

Piper-NF • Pipe-R implementation based on Nextflow • 350 lines of code vs. 10'000 legacy version • Much easier to write, test and to maintain • Implementation platform agnostic • Parallelized splitting by query and genome • Fine grain control on parallelization • Greatly improved tasks “interleaving”

Slide 21

Slide 21 text

! ! ! ! ! ! Gene!A! Gene!B! Query!species! !Target!species! Anchor!2! Piper&NF:*Let*it*flow…* Anchor!1! Anchor!1! Anchor!3! Blast! Align! Homolog!A1! Homolog!A2! Homolog!A3! Homolog!B1!

Slide 22

Slide 22 text

Piper-NF benchmark • Query with 665 sequences • Mapping L.mania 22 genomes 30 mins 60 mins 90 mins 120 mins 150 mins 180 mins legacy 50 100 150 Legacy* No split Chunk 200 Chunk 100 Chunk 50 Chunk 25 cluster nodes * Partial (not including T-Coffee alignment step) 6 x faster!

Slide 23

Slide 23 text

What’s next • Support more data format (Fastaq, BAM, SAM) • Enhance syntax to make more expressive • Advanced profiling (Paraver) • Improve grid support (SLURM, LSF, DRMAA, etc) • Integrate the cloud (Amazon, DNAnexus) • Add health monitoring and automatic fail-over

Slide 24

Slide 24 text

Noteworthy • Deployed as single executable packaged i.e. download and run • Pipeline functional logic is decoupled by the actual execution layer (local/grid/cloud/?) • Reuse existing code/scripts • Parallelism is defined implicitly by tasks inputs/ outputs declarations • On task error, stops gently reporting the error, and resume from failed point

Slide 25

Slide 25 text

Links • Nextflow http://nextflow-project.org • Dataflow http://www.gpars.org/1.0.0/guide/guide/ dataflow.html • Piper-NF http://github.com/cbcrg/piper-nf