Large scale genomics with Nextflow and AWS Batch

Paolo Di Tommaso
January 08, 2018

This presentation gives an a short introduction about our experience deploying large scale genomic pipelines with Nextflow and AWS Batch cloud service.

    Tommaso, CRG (Barcelona) RCUK Cloud Workshop, 8 Jan 2018
  2. WHO IS THIS CHAP? @PaoloDiTommaso Research software engineer Comparative Bioinformatics,

    Notredame Lab Center for Genomic Regulation (CRG) Author of Nextflow project
  3. GENOMIC WORKFLOWS • Data analysis applications to extract information from

    (large) genomic datasets • Mash-up of many different tools and scripts • Embarrassingly parallelisation, can spawn 100s-100k jobs over distributed cluster • Complex dependency trees and configuration → very fragile ecosystem
  4. AWS BATCH • Compute jobs in the cloud in a

    batch fashion (ie. asynchronous) • It manages the provisioning and scaling of the cluster • It provides the concept of queue • A job is a container (cool!)
  5. HOW A TASK LOOKS LIKE? aws s3 cp s3://{bucket}/{sample}_r1.fq .

    aws s3 cp s3://{bucket}/{sample}_r2.fq . aws s3 sync s3://{bucket}/assets/{reference} . bwa mem -t {cpus} {reference}.fa {sample}_r1.fq {sample}_r2.fq \ | samtools sort -o {sample}.bam samtools index {sample}.bam aws s3 cp {sample}.bam s3://{bucket}/ aws s3 cp {sample}.bam.bai s3://{bucket}/
  6. HOW A TASK LOOKS LIKE? • Create a Docker image

    including the job script • Upload the Docker image in a public registry • Create a job template referencing the upload image • Submit the job execution with the AWS command line tool
  7. BOTTLENECKS • The need to handle the input downloads and

    output uploads reduce the workflow portability • Custom container images (ideally we would like to use community container images eg. BioContainers) • Orchestrating big real-world workflows can be challenging
  8. process align_sample { input: file 'reference.fa' from genome_ch file 'sample.fq'

    from reads_ch output: file 'sample.bam' into bam_ch script: """ bwa mem reference.fa sample.fq \ | samtools sort -o sample.bam """ } TASK EXAMPLE bwa mem reference.fa sample.fq \ | samtools sort -o sample.bam
  9. TASKS COMPOSITION process index_sample { input: file 'sample.bam' from bam_ch

    output: file 'sample.bai' into bai_ch script: """ samtools index sample.bam """ } process align_sample { input: file 'reference.fa' from genome_ch file 'sample.fq' from reads_ch output: file 'sample.bam' into bam_ch script: """ bwa mem reference.fa sample.fq \ | samtools sort -o sample.bam """ }
  10. REACTIVE NETWORK • Declarative computational model for parallel process executions

    • Processes wait for data, when an input set is ready the process is executed • They communicate by using dataflow variables i.e. async FIFO queues called channels • Parallelisation and tasks dependencies are implicitly defined by process in/out declarations
  11. PORTABILITY process { executor = 'slurm' queue = 'my-queue' memory

    = '8 GB' cpus = 4 container = 'user/image' }
  12. PORTABILITY process { executor = 'awsbatch' queue = 'my-queue' memory

    = '8 GB' cpus = 4 container = 'user/image' }
  13. AWS BATCH BENCHMARK • RNA-Seq quantification pipeline * • 375

    samples take from Encode project • 753 Jobs • ~65 i3.xlarge spot instances • ~23 h wall-time time ~4'850 CPU-hours * https://github.com/nextflow-io/rnaseq-encode-nf
  14. THE BILL • $310 (~$1.2 per sample) • $90 Ec2

    spot instances (1'400 Hrs x i3.xlarge) • $220 EBS storage (1 TB x ~1'400 Hrs) • Choosing a better sized EBS volume (~200GB) 
 ⤇ $135 ⤇ ~$0.36 per sample
  15. TAKE HOME MESSAGE • Batch provides a truly scalable elastic

    computing environment for containerised workloads • Delegating the cluster provisioning is a big plus • Choose carefully the size of EBS storage • Nextflow enables the seamless deployment of scalable and portable computational workflows