Large scale genomics with Nextflow and AWS Batch

LARGE SCALE GENOMICS WITH NEXTFLOW AND AWS BATCH Paolo Di
Tommaso, CRG (Barcelona) RCUK Cloud Workshop, 8 Jan 2018

WHO IS THIS CHAP? @PaoloDiTommaso Research software engineer Comparative Bioinformatics,
Notredame Lab Center for Genomic Regulation (CRG) Author of Nextﬂow project

GENOMIC WORKFLOWS • Data analysis applications to extract information from
(large) genomic datasets • Mash-up of many different tools and scripts • Embarrassingly parallelisation, can spawn 100s-100k jobs over distributed cluster • Complex dependency trees and conﬁguration → very fragile ecosystem

AWS BATCH • Compute jobs in the cloud in a
batch fashion (ie. asynchronous) • It manages the provisioning and scaling of the cluster • It provides the concept of queue • A job is a container (cool!)

HOW A TASK LOOKS LIKE? aws s3 cp s3://{bucket}/{sample}_r1.fq .
aws s3 cp s3://{bucket}/{sample}_r2.fq . aws s3 sync s3://{bucket}/assets/{reference} . bwa mem -t {cpus} {reference}.fa {sample}_r1.fq {sample}_r2.fq \ | samtools sort -o {sample}.bam samtools index {sample}.bam aws s3 cp {sample}.bam s3://{bucket}/ aws s3 cp {sample}.bam.bai s3://{bucket}/

HOW A TASK LOOKS LIKE? • Create a Docker image
including the job script • Upload the Docker image in a public registry • Create a job template referencing the upload image • Submit the job execution with the AWS command line tool

Companion parassite genome annotation pipeline, Steinbiss et al., DOI: 10.1093/nar/gkw292

BOTTLENECKS • The need to handle the input downloads and
output uploads reduce the workﬂow portability • Custom container images (ideally we would like to use community container images eg. BioContainers) • Orchestrating big real-world workﬂows can be challenging

Orchestration & Parallelisation Scalability & Portability Deployment & Reproducibility containers
Git GitHub

TASK EXAMPLE bwa mem reference.fa sample.fq \ | samtools sort
-o sample.bam

process align_sample { input: file 'reference.fa' from genome_ch file 'sample.fq'
from reads_ch output: file 'sample.bam' into bam_ch script: """ bwa mem reference.fa sample.fq \ | samtools sort -o sample.bam """ } TASK EXAMPLE bwa mem reference.fa sample.fq \ | samtools sort -o sample.bam

TASKS COMPOSITION process index_sample { input: file 'sample.bam' from bam_ch
output: file 'sample.bai' into bai_ch script: """ samtools index sample.bam """ } process align_sample { input: file 'reference.fa' from genome_ch file 'sample.fq' from reads_ch output: file 'sample.bam' into bam_ch script: """ bwa mem reference.fa sample.fq \ | samtools sort -o sample.bam """ }

REACTIVE NETWORK • Declarative computational model for parallel process executions
• Processes wait for data, when an input set is ready the process is executed • They communicate by using dataﬂow variables i.e. async FIFO queues called channels • Parallelisation and tasks dependencies are implicitly deﬁned by process in/out declarations

PORTABILITY

PORTABILITY process { executor = 'slurm' queue = 'my-queue' memory
= '8 GB' cpus = 4 container = 'user/image' }

PORTABILITY process { executor = 'awsbatch' queue = 'my-queue' memory
= '8 GB' cpus = 4 container = 'user/image' }

SHOW ME THE CODE !

AWS BATCH BENCHMARK • RNA-Seq quantiﬁcation pipeline * • 375
samples take from Encode project • 753 Jobs • ~65 i3.xlarge spot instances • ~23 h wall-time time ~4'850 CPU-hours * https://github.com/nextﬂow-io/rnaseq-encode-nf

THE BILL • $310 (~$1.2 per sample) • $90 Ec2
spot instances (1'400 Hrs x i3.xlarge) • $220 EBS storage (1 TB x ~1'400 Hrs) • Choosing a better sized EBS volume (~200GB)   ⤇ $135 ⤇ ~$0.36 per sample

WHO IS USING NEXTFLOW?

TAKE HOME MESSAGE • Batch provides a truly scalable elastic
computing environment for containerised workloads • Delegating the cluster provisioning is a big plus • Choose carefully the size of EBS storage • Nextﬂow enables the seamless deployment of scalable and portable computational workﬂows

ACKNOWLEDGMENT Evan Floden Emilio Palumbo Cedric Notredame Notredame Lab, CRG
Phil Ewels Brendan Boufﬂer

Large scale genomics with Nextflow and AWS Batch

Large scale genomics with Nextflow and AWS Batch

Paolo Di Tommaso

More Decks by Paolo Di Tommaso

Featured

Transcript