Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Dataflow oriented bioinformatics pipelines with Nextflow

Dataflow oriented bioinformatics pipelines with Nextflow

I introduce the Dataflow programming model. How it can be used to deal with the increasing complexity of bioinformatics pipelines, and how the Nextflow framework addresses many problems common in other approaches.

Paolo Di Tommaso

June 20, 2013
Tweet

More Decks by Paolo Di Tommaso

Other Decks in Programming

Transcript

  1. The computer scientist approach 1. Database server 2. Parse proteins

    3. Populate DB 4. You need a client app to access DB 5.select  count(*)  from  table.proteins
  2. The good • It allows quick and easy data extraction/

    manipulation • No dependencies on external servers • It enables fast prototyping and experiments multiple alternative easily • Linux is the integration layer in the Bioinformatics domains
  3. The bad • Non standard tools need to be compiled

    on the target platform • As script get bigger as it becomes fragile/ un-readable/hard to modify • In large script tasks inputs/outputs tends to be not clearly defined
  4. The ugly • It’s very hard to get it to

    scale properly with big data • Very different parallelization strategies and implementations • Shared memory (process, thread) • Message passing (MPI, actors) • Distributed computation • Hardware parallelization (GPU) • In general provide a too low level abstraction and requires specific API skills • Introduce hard dependencies to specific platform/framework
  5. What we need Compose Linux commands and scripts as usual

    + High level parallelization model (ideally platform agnostic)
  6. Dataflow • Declarative computational model for concurrent tasks execution1 •

    Originates in research for reactive system to monitor and control industrial processes • All tasks are parallel and form a process network • Tasks communicate through channels (non-blocking unidirectional FIFO queue) • A task is executed when all its inputs are bound • The synchronization is implicitly defined by tasks inputs/ outputs declaration 1. G. Kahn, “The Semantics of a Simple Language for Parallel Programming,” Proc. of the IFIP Congress 74, North-Holland Publishing Co., 1974
  7. Basic example a  =  new  Dataflow() b  =  new  Dataflow()

    c  =  new  Dataflow() task  {  a  <<  b  +  '  '  +  c   } task  {   b  <<  'Hello' } task  {    c  <<  'World!'   } print  a
  8. Nextflow • Tasks declares inputs/outputs and a Linux executable script

    • It is executed as soon as declared inputs are available • It is inherently parallel • Each tasks is executed in its own private directory • Produced outputs trigger the execution of downstream tasks
  9. Nextflow task task  ('optional  name')  {        input

     file_in        output  file_out        output  '*.fa':  channel                """        your  BASH  script            :        """         }
  10. Parallel BLAST example query  =  file(args[0]) DB  =  "$HOME/blast-­‐db/pdb/pdb" seq

     =  channel() query.chunkFasta  {  seq  <<  it  } task  {        input  seq        output  blast_result        "echo  '$seq'  |  blastp  -­‐db  $DB  -­‐query  $seq  -­‐outfmt  6  >  blast_result" } task  {        input  blast_result        "cat  $blast_result" }
  11. Mixing Languages task  {      "your  BASH  script  ..

     " } task  {      """      #!/usr/bin/env  perl      your  glory  PERL  script  ..      """ } task  {      """      #!/usr/bin/env  python      your  Python  code  ..        """ }
  12. Configurable execution layer • The same pipeline can run on

    different platforms by a simple definition into the configuration file • Local processes (Java threads) • Resource managers (SGE, LSF, SLURM, etc) • Cloud (Amazon Ec2, Google, etc)
  13. Resume execution • When a task crash, the pipeline stops

    gracefully and reports the error cause • Easy debugging, each tasks can be executed separately • When fixed, the execution can be resumed from the failure point
  14. A case study: Pipe-R • Pipeline for the detection and

    mapping of long non- coding RNAs • ~ 10'000 lines of Perl code • run a single computer (single process) or cluster through SGE • ~ 70% of the code deals with parameters handling, file splitting, jobs parallelization, synchronization, etc. • very inefficient parallelization
  15. ! ! ! ! ! ! Gene!A! Gene!B! Query!species! !Target!species!

    Anchor!2! Pipe%R' Anchor!1! Anchor!1! Anchor!3! Blast! Align! Homolog!A1! Homolog!A2! Homolog!A3! Homolog!B1!
  16. Problems • The slowest job stops all the computation •

    Parallelization depends on number of genomes (1 genomes no parallel execution) • A specific resource manages technology (qsub) is hard-coded into the pipeline • If it crash, you lost days of computation
  17. Piper-NF • Pipe-R implementation based on Nextflow • 350 lines

    of code vs. 10'000 legacy version • Much easier to write, test and to maintain • Implementation platform agnostic • Parallelized splitting by query and genome • Fine grain control on parallelization • Greatly improved tasks “interleaving”
  18. ! ! ! ! ! ! Gene!A! Gene!B! Query!species! !Target!species!

    Anchor!2! Piper&NF:*Let*it*flow…* Anchor!1! Anchor!1! Anchor!3! Blast! Align! Homolog!A1! Homolog!A2! Homolog!A3! Homolog!B1!
  19. Piper-NF benchmark • Query with 665 sequences • Mapping L.mania

    22 genomes 30 mins 60 mins 90 mins 120 mins 150 mins 180 mins legacy 50 100 150 Legacy* No split Chunk 200 Chunk 100 Chunk 50 Chunk 25 cluster nodes * Partial (not including T-Coffee alignment step) 6 x faster!
  20. What’s next • Support more data format (Fastaq, BAM, SAM)

    • Enhance syntax to make more expressive • Advanced profiling (Paraver) • Improve grid support (SLURM, LSF, DRMAA, etc) • Integrate the cloud (Amazon, DNAnexus) • Add health monitoring and automatic fail-over
  21. Noteworthy • Deployed as single executable packaged i.e. download and

    run • Pipeline functional logic is decoupled by the actual execution layer (local/grid/cloud/?) • Reuse existing code/scripts • Parallelism is defined implicitly by tasks inputs/ outputs declarations • On task error, stops gently reporting the error, and resume from failed point