$30 off During Our Annual Pro Sale. View Details »

Large scale genomics with Nextflow and AWS Batch

Paolo Di Tommaso
January 08, 2018
580

Large scale genomics with Nextflow and AWS Batch

This presentation gives an a short introduction about our experience deploying large scale genomic pipelines with Nextflow and AWS Batch cloud service.

Paolo Di Tommaso

January 08, 2018
Tweet

Transcript

  1. LARGE SCALE GENOMICS
    WITH NEXTFLOW
    AND AWS BATCH
    Paolo Di Tommaso, CRG (Barcelona)
    RCUK Cloud Workshop, 8 Jan 2018

    View Slide

  2. WHO IS THIS CHAP?
    @PaoloDiTommaso
    Research software engineer
    Comparative Bioinformatics, Notredame Lab
    Center for Genomic Regulation (CRG)
    Author of Nextflow project

    View Slide

  3. GENOMIC WORKFLOWS
    • Data analysis applications to extract information from
    (large) genomic datasets
    • Mash-up of many different tools and scripts
    • Embarrassingly parallelisation, can spawn 100s-100k
    jobs over distributed cluster
    • Complex dependency trees and configuration → very
    fragile ecosystem

    View Slide

  4. AWS BATCH
    • Compute jobs in the cloud in a batch fashion (ie.
    asynchronous)
    • It manages the provisioning and scaling of the cluster
    • It provides the concept of queue
    • A job is a container (cool!)

    View Slide

  5. HOW A TASK LOOKS LIKE?
    aws s3 cp s3://{bucket}/{sample}_r1.fq .
    aws s3 cp s3://{bucket}/{sample}_r2.fq .
    aws s3 sync s3://{bucket}/assets/{reference} .
    bwa mem -t {cpus} {reference}.fa {sample}_r1.fq {sample}_r2.fq \
    | samtools sort -o {sample}.bam
    samtools index {sample}.bam
    aws s3 cp {sample}.bam s3://{bucket}/
    aws s3 cp {sample}.bam.bai s3://{bucket}/

    View Slide

  6. HOW A TASK LOOKS LIKE?
    • Create a Docker image including the job script
    • Upload the Docker image in a public registry
    • Create a job template referencing the upload image
    • Submit the job execution with the AWS command
    line tool

    View Slide

  7. Companion parassite genome annotation pipeline, Steinbiss et al., DOI: 10.1093/nar/gkw292

    View Slide

  8. BOTTLENECKS
    • The need to handle the input downloads and
    output uploads reduce the workflow portability
    • Custom container images (ideally we would like to
    use community container images eg. BioContainers)
    • Orchestrating big real-world workflows can be
    challenging

    View Slide

  9. Orchestration
    & Parallelisation
    Scalability
    & Portability
    Deployment &
    Reproducibility
    containers
    Git GitHub

    View Slide

  10. TASK EXAMPLE
    bwa mem reference.fa sample.fq \
    | samtools sort -o sample.bam

    View Slide

  11. process align_sample {
    input:
    file 'reference.fa' from genome_ch
    file 'sample.fq' from reads_ch
    output:
    file 'sample.bam' into bam_ch
    script:
    """
    bwa mem reference.fa sample.fq \
    | samtools sort -o sample.bam
    """
    }
    TASK EXAMPLE
    bwa mem reference.fa sample.fq \
    | samtools sort -o sample.bam

    View Slide

  12. TASKS COMPOSITION
    process index_sample {
    input:
    file 'sample.bam' from bam_ch
    output:
    file 'sample.bai' into bai_ch
    script:
    """
    samtools index sample.bam
    """
    }
    process align_sample {
    input:
    file 'reference.fa' from genome_ch
    file 'sample.fq' from reads_ch
    output:
    file 'sample.bam' into bam_ch
    script:
    """
    bwa mem reference.fa sample.fq \
    | samtools sort -o sample.bam
    """
    }

    View Slide

  13. REACTIVE NETWORK
    • Declarative computational model for parallel
    process executions
    • Processes wait for data, when an input set is
    ready the process is executed
    • They communicate by using dataflow variables
    i.e. async FIFO queues called channels
    • Parallelisation and tasks dependencies are
    implicitly defined by process in/out declarations

    View Slide

  14. PORTABILITY

    View Slide

  15. PORTABILITY
    process {
    executor = 'slurm'
    queue = 'my-queue'
    memory = '8 GB'
    cpus = 4
    container = 'user/image'
    }

    View Slide

  16. PORTABILITY
    process {
    executor = 'awsbatch'
    queue = 'my-queue'
    memory = '8 GB'
    cpus = 4
    container = 'user/image'
    }

    View Slide

  17. SHOW ME THE CODE !

    View Slide

  18. AWS BATCH BENCHMARK
    • RNA-Seq quantification pipeline *
    • 375 samples take from Encode project
    • 753 Jobs
    • ~65 i3.xlarge spot instances
    • ~23 h wall-time time ~4'850 CPU-hours
    * https://github.com/nextflow-io/rnaseq-encode-nf

    View Slide

  19. View Slide

  20. View Slide

  21. THE BILL
    • $310 (~$1.2 per sample)
    • $90 Ec2 spot instances (1'400 Hrs x i3.xlarge)
    • $220 EBS storage (1 TB x ~1'400 Hrs)
    • Choosing a better sized EBS volume (~200GB) 

    ⤇ $135 ⤇ ~$0.36 per sample

    View Slide

  22. WHO IS USING NEXTFLOW?

    View Slide

  23. View Slide

  24. View Slide

  25. View Slide

  26. TAKE HOME MESSAGE
    • Batch provides a truly scalable elastic computing
    environment for containerised workloads
    • Delegating the cluster provisioning is a big plus
    • Choose carefully the size of EBS storage
    • Nextflow enables the seamless deployment of
    scalable and portable computational workflows

    View Slide

  27. ACKNOWLEDGMENT
    Evan Floden
    Emilio Palumbo
    Cedric Notredame
    Notredame Lab, CRG
    Phil Ewels Brendan Bouffler

    View Slide