Dataflow oriented bioinformatics pipelines with Nextflow
I introduce the Dataflow programming model. How it can be used to deal with the increasing complexity of bioinformatics pipelines, and how the Nextflow framework addresses many problems common in other approaches.
scale properly with big data • Very different parallelization strategies and implementations • Shared memory (process, thread) • Message passing (MPI, actors) • Distributed computation • Hardware parallelization (GPU) • In general provide a too low level abstraction and requires speciﬁc API skills • Introduce hard dependencies to speciﬁc platform/framework
Originates in research for reactive system to monitor and control industrial processes • All tasks are parallel and form a process network • Tasks communicate through channels (non-blocking unidirectional FIFO queue) • A task is executed when all its inputs are bound • The synchronization is implicitly deﬁned by tasks inputs/ outputs declaration 1. G. Kahn, “The Semantics of a Simple Language for Parallel Programming,” Proc. of the IFIP Congress 74, North-Holland Publishing Co., 1974
mapping of long non- coding RNAs • ~ 10'000 lines of Perl code • run a single computer (single process) or cluster through SGE • ~ 70% of the code deals with parameters handling, ﬁle splitting, jobs parallelization, synchronization, etc. • very inefﬁcient parallelization
Parallelization depends on number of genomes (1 genomes no parallel execution) • A speciﬁc resource manages technology (qsub) is hard-coded into the pipeline • If it crash, you lost days of computation
of code vs. 10'000 legacy version • Much easier to write, test and to maintain • Implementation platform agnostic • Parallelized splitting by query and genome • Fine grain control on parallelization • Greatly improved tasks “interleaving”
22 genomes 30 mins 60 mins 90 mins 120 mins 150 mins 180 mins legacy 50 100 150 Legacy* No split Chunk 200 Chunk 100 Chunk 50 Chunk 25 cluster nodes * Partial (not including T-Coffee alignment step) 6 x faster!
• Enhance syntax to make more expressive • Advanced proﬁling (Paraver) • Improve grid support (SLURM, LSF, DRMAA, etc) • Integrate the cloud (Amazon, DNAnexus) • Add health monitoring and automatic fail-over
run • Pipeline functional logic is decoupled by the actual execution layer (local/grid/cloud/?) • Reuse existing code/scripts • Parallelism is deﬁned implicitly by tasks inputs/ outputs declarations • On task error, stops gently reporting the error, and resume from failed point