Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Large scale genomics with Nextflow and AWS Batch

Paolo Di Tommaso
January 08, 2018
550

Large scale genomics with Nextflow and AWS Batch

This presentation gives an a short introduction about our experience deploying large scale genomic pipelines with Nextflow and AWS Batch cloud service.

Paolo Di Tommaso

January 08, 2018
Tweet

Transcript

  1. LARGE SCALE GENOMICS WITH NEXTFLOW AND AWS BATCH Paolo Di

    Tommaso, CRG (Barcelona) RCUK Cloud Workshop, 8 Jan 2018
  2. WHO IS THIS CHAP? @PaoloDiTommaso Research software engineer Comparative Bioinformatics,

    Notredame Lab Center for Genomic Regulation (CRG) Author of Nextflow project
  3. GENOMIC WORKFLOWS • Data analysis applications to extract information from

    (large) genomic datasets • Mash-up of many different tools and scripts • Embarrassingly parallelisation, can spawn 100s-100k jobs over distributed cluster • Complex dependency trees and configuration → very fragile ecosystem
  4. AWS BATCH • Compute jobs in the cloud in a

    batch fashion (ie. asynchronous) • It manages the provisioning and scaling of the cluster • It provides the concept of queue • A job is a container (cool!)
  5. HOW A TASK LOOKS LIKE? aws s3 cp s3://{bucket}/{sample}_r1.fq .

    aws s3 cp s3://{bucket}/{sample}_r2.fq . aws s3 sync s3://{bucket}/assets/{reference} . bwa mem -t {cpus} {reference}.fa {sample}_r1.fq {sample}_r2.fq \ | samtools sort -o {sample}.bam samtools index {sample}.bam aws s3 cp {sample}.bam s3://{bucket}/ aws s3 cp {sample}.bam.bai s3://{bucket}/
  6. HOW A TASK LOOKS LIKE? • Create a Docker image

    including the job script • Upload the Docker image in a public registry • Create a job template referencing the upload image • Submit the job execution with the AWS command line tool
  7. Companion parassite genome annotation pipeline, Steinbiss et al., DOI: 10.1093/nar/gkw292

  8. BOTTLENECKS • The need to handle the input downloads and

    output uploads reduce the workflow portability • Custom container images (ideally we would like to use community container images eg. BioContainers) • Orchestrating big real-world workflows can be challenging
  9. Orchestration & Parallelisation Scalability & Portability Deployment & Reproducibility containers

    Git GitHub
  10. TASK EXAMPLE bwa mem reference.fa sample.fq \ | samtools sort

    -o sample.bam
  11. process align_sample { input: file 'reference.fa' from genome_ch file 'sample.fq'

    from reads_ch output: file 'sample.bam' into bam_ch script: """ bwa mem reference.fa sample.fq \ | samtools sort -o sample.bam """ } TASK EXAMPLE bwa mem reference.fa sample.fq \ | samtools sort -o sample.bam
  12. TASKS COMPOSITION process index_sample { input: file 'sample.bam' from bam_ch

    output: file 'sample.bai' into bai_ch script: """ samtools index sample.bam """ } process align_sample { input: file 'reference.fa' from genome_ch file 'sample.fq' from reads_ch output: file 'sample.bam' into bam_ch script: """ bwa mem reference.fa sample.fq \ | samtools sort -o sample.bam """ }
  13. REACTIVE NETWORK • Declarative computational model for parallel process executions

    • Processes wait for data, when an input set is ready the process is executed • They communicate by using dataflow variables i.e. async FIFO queues called channels • Parallelisation and tasks dependencies are implicitly defined by process in/out declarations
  14. PORTABILITY

  15. PORTABILITY process { executor = 'slurm' queue = 'my-queue' memory

    = '8 GB' cpus = 4 container = 'user/image' }
  16. PORTABILITY process { executor = 'awsbatch' queue = 'my-queue' memory

    = '8 GB' cpus = 4 container = 'user/image' }
  17. SHOW ME THE CODE !

  18. AWS BATCH BENCHMARK • RNA-Seq quantification pipeline * • 375

    samples take from Encode project • 753 Jobs • ~65 i3.xlarge spot instances • ~23 h wall-time time ~4'850 CPU-hours * https://github.com/nextflow-io/rnaseq-encode-nf
  19. None
  20. None
  21. THE BILL • $310 (~$1.2 per sample) • $90 Ec2

    spot instances (1'400 Hrs x i3.xlarge) • $220 EBS storage (1 TB x ~1'400 Hrs) • Choosing a better sized EBS volume (~200GB) 
 ⤇ $135 ⤇ ~$0.36 per sample
  22. WHO IS USING NEXTFLOW?

  23. None
  24. None
  25. None
  26. TAKE HOME MESSAGE • Batch provides a truly scalable elastic

    computing environment for containerised workloads • Delegating the cluster provisioning is a big plus • Choose carefully the size of EBS storage • Nextflow enables the seamless deployment of scalable and portable computational workflows
  27. ACKNOWLEDGMENT Evan Floden Emilio Palumbo Cedric Notredame Notredame Lab, CRG

    Phil Ewels Brendan Bouffler