This presentation gives an a short introduction about our experience deploying large scale genomic pipelines with Nextflow and AWS Batch cloud service.
WHO IS THIS CHAP? @PaoloDiTommaso Research software engineer Comparative Bioinformatics, Notredame Lab Center for Genomic Regulation (CRG) Author of Nextflow project
GENOMIC WORKFLOWS • Data analysis applications to extract information from (large) genomic datasets • Mash-up of many different tools and scripts • Embarrassingly parallelisation, can spawn 100s-100k jobs over distributed cluster • Complex dependency trees and configuration → very fragile ecosystem
AWS BATCH • Compute jobs in the cloud in a batch fashion (ie. asynchronous) • It manages the provisioning and scaling of the cluster • It provides the concept of queue • A job is a container (cool!)
HOW A TASK LOOKS LIKE? • Create a Docker image including the job script • Upload the Docker image in a public registry • Create a job template referencing the upload image • Submit the job execution with the AWS command line tool
BOTTLENECKS • The need to handle the input downloads and output uploads reduce the workflow portability • Custom container images (ideally we would like to use community container images eg. BioContainers) • Orchestrating big real-world workflows can be challenging
REACTIVE NETWORK • Declarative computational model for parallel process executions • Processes wait for data, when an input set is ready the process is executed • They communicate by using dataflow variables i.e. async FIFO queues called channels • Parallelisation and tasks dependencies are implicitly defined by process in/out declarations
TAKE HOME MESSAGE • Batch provides a truly scalable elastic computing environment for containerised workloads • Delegating the cluster provisioning is a big plus • Choose carefully the size of EBS storage • Nextflow enables the seamless deployment of scalable and portable computational workflows