Slide 1

Slide 1 text

LARGE SCALE GENOMICS WITH NEXTFLOW AND AWS BATCH Paolo Di Tommaso, CRG (Barcelona) RCUK Cloud Workshop, 8 Jan 2018

Slide 2

Slide 2 text

WHO IS THIS CHAP? @PaoloDiTommaso Research software engineer Comparative Bioinformatics, Notredame Lab Center for Genomic Regulation (CRG) Author of Nextflow project

Slide 3

Slide 3 text

GENOMIC WORKFLOWS • Data analysis applications to extract information from (large) genomic datasets • Mash-up of many different tools and scripts • Embarrassingly parallelisation, can spawn 100s-100k jobs over distributed cluster • Complex dependency trees and configuration → very fragile ecosystem

Slide 4

Slide 4 text

AWS BATCH • Compute jobs in the cloud in a batch fashion (ie. asynchronous) • It manages the provisioning and scaling of the cluster • It provides the concept of queue • A job is a container (cool!)

Slide 5

Slide 5 text

HOW A TASK LOOKS LIKE? aws s3 cp s3://{bucket}/{sample}_r1.fq . aws s3 cp s3://{bucket}/{sample}_r2.fq . aws s3 sync s3://{bucket}/assets/{reference} . bwa mem -t {cpus} {reference}.fa {sample}_r1.fq {sample}_r2.fq \ | samtools sort -o {sample}.bam samtools index {sample}.bam aws s3 cp {sample}.bam s3://{bucket}/ aws s3 cp {sample}.bam.bai s3://{bucket}/

Slide 6

Slide 6 text

HOW A TASK LOOKS LIKE? • Create a Docker image including the job script • Upload the Docker image in a public registry • Create a job template referencing the upload image • Submit the job execution with the AWS command line tool

Slide 7

Slide 7 text

Companion parassite genome annotation pipeline, Steinbiss et al., DOI: 10.1093/nar/gkw292

Slide 8

Slide 8 text

BOTTLENECKS • The need to handle the input downloads and output uploads reduce the workflow portability • Custom container images (ideally we would like to use community container images eg. BioContainers) • Orchestrating big real-world workflows can be challenging

Slide 9

Slide 9 text

Orchestration & Parallelisation Scalability & Portability Deployment & Reproducibility containers Git GitHub

Slide 10

Slide 10 text

TASK EXAMPLE bwa mem reference.fa sample.fq \ | samtools sort -o sample.bam

Slide 11

Slide 11 text

process align_sample { input: file 'reference.fa' from genome_ch file 'sample.fq' from reads_ch output: file 'sample.bam' into bam_ch script: """ bwa mem reference.fa sample.fq \ | samtools sort -o sample.bam """ } TASK EXAMPLE bwa mem reference.fa sample.fq \ | samtools sort -o sample.bam

Slide 12

Slide 12 text

TASKS COMPOSITION process index_sample { input: file 'sample.bam' from bam_ch output: file 'sample.bai' into bai_ch script: """ samtools index sample.bam """ } process align_sample { input: file 'reference.fa' from genome_ch file 'sample.fq' from reads_ch output: file 'sample.bam' into bam_ch script: """ bwa mem reference.fa sample.fq \ | samtools sort -o sample.bam """ }

Slide 13

Slide 13 text

REACTIVE NETWORK • Declarative computational model for parallel process executions • Processes wait for data, when an input set is ready the process is executed • They communicate by using dataflow variables i.e. async FIFO queues called channels • Parallelisation and tasks dependencies are implicitly defined by process in/out declarations

Slide 14

Slide 14 text

PORTABILITY

Slide 15

Slide 15 text

PORTABILITY process { executor = 'slurm' queue = 'my-queue' memory = '8 GB' cpus = 4 container = 'user/image' }

Slide 16

Slide 16 text

PORTABILITY process { executor = 'awsbatch' queue = 'my-queue' memory = '8 GB' cpus = 4 container = 'user/image' }

Slide 17

Slide 17 text

SHOW ME THE CODE !

Slide 18

Slide 18 text

AWS BATCH BENCHMARK • RNA-Seq quantification pipeline * • 375 samples take from Encode project • 753 Jobs • ~65 i3.xlarge spot instances • ~23 h wall-time time ~4'850 CPU-hours * https://github.com/nextflow-io/rnaseq-encode-nf

Slide 19

Slide 19 text

No content

Slide 20

Slide 20 text

No content

Slide 21

Slide 21 text

THE BILL • $310 (~$1.2 per sample) • $90 Ec2 spot instances (1'400 Hrs x i3.xlarge) • $220 EBS storage (1 TB x ~1'400 Hrs) • Choosing a better sized EBS volume (~200GB) 
 ⤇ $135 ⤇ ~$0.36 per sample

Slide 22

Slide 22 text

WHO IS USING NEXTFLOW?

Slide 23

Slide 23 text

No content

Slide 24

Slide 24 text

No content

Slide 25

Slide 25 text

No content

Slide 26

Slide 26 text

TAKE HOME MESSAGE • Batch provides a truly scalable elastic computing environment for containerised workloads • Delegating the cluster provisioning is a big plus • Choose carefully the size of EBS storage • Nextflow enables the seamless deployment of scalable and portable computational workflows

Slide 27

Slide 27 text

ACKNOWLEDGMENT Evan Floden Emilio Palumbo Cedric Notredame Notredame Lab, CRG Phil Ewels Brendan Bouffler