Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Nextflow: Reproducible computational workflows ...

Nextflow: Reproducible computational workflows across clouds and clusters

Nextflow: Reproducible computational workflows across clouds and clusters

A seminar for the National Cancer Institute "Containers and Workflows Interest Group".

10th August 2018

Evan Floden

August 10, 2018
Tweet

More Decks by Evan Floden

Other Decks in Research

Transcript

  1. NEXTFLOW REPRODUCIBLE COMPUTATIONAL WORKFLOWS ACROSS CLOUDS AND CLUSTERS NCI Containers

    and Workflows Interest Group Seminar Evan Floden 10 August 2018
  2. AGENDA • The challenges with computational workflows • Nextflow main

    principles • Handling parallelisation and portability • Deployments scenarios • Comparison with other tools • Future plans
  3. GENOMIC WORKFLOWS • Data analysis applications performs computation to generate

    information from (large) genomic datasets • Embarrassingly parallelisation, can spawn 100s-100k jobs over distributed cluster • Mash-up of many different tools and scripts (dependancies!) • Complex dependency trees and configuration → very fragile ecosystem
  4. A LOT OF MOVING PARTS • 70 tasks • 55

    external scripts • 39 software tools & libraries
  5. To reproduce the result of a typical 
 computational biology

    paper
 requires 280 hours. ≈1.7 months!
  6. Platform Amazon Linux Debian Linux Mac OSX Number of chromosomes

    36 36 36 Overall length (bp) 32,032,223 32,032,223 32,032,223 Number of genes 7,781 7,783 7,771 Gene density 236.64 236.64 236.32 Number of coding genes 7,580 7,580 7570 Average coding length (bp) 1,764 1,764 1,762 Number of genes with multiple CDS 113 113 111 Number of genes with known function 4,147 4,147 4,142 Number of t-RNAs 88 90 88 Comparison of the Companion pipeline annotation of Leishmania infantum genome executed across different platforms * * Di Tommaso P, et al., Nextflow enables computational reproducibility, Nature Biotech, 2017
  7. CHALLENGES • Reproducibility: replicate results over time • Portability: run

    across different platforms • Scalability: ie. deploy big distributed workloads • Usability: streamline execution and deployment of complex • Consistency: track changes and revisions consistently for code, config files and binary dependencies
  8. HOW? • Fast prototyping 㱺 custom DSL that enables tasks

    composition, simplifies most use cases + general purpose programming lang. for corner cases • Easy parallelisation 㱺 declarative reactive programming model based on dataflow paradigm, implicit portable parallelism • Self-contained 㱺 functional approach, a task execution is idempotent ie. cannot modify the state of other tasks + isolate dependencies with containers • Portable deployments 㱺 executor abstraction layer + deployment configuration from implementation logic
  9. process align_sample { input: file 'reference.fa' from genome_ch file 'sample.fq'

    from reads_ch output: file 'sample.bam' into bam_ch script: """ bwa mem reference.fa sample.fq \ | samtools sort -o sample.bam """ } TASK EXAMPLE bwa mem reference.fa sample.fq \ | samtools sort -o sample.bam
  10. TASKS COMPOSITION process index_sample { input: file 'sample.bam' from bam_ch

    output: file 'sample.bai' into bai_ch script: """ samtools index sample.bam """ } process align_sample { input: file 'reference.fa' from genome_ch file 'sample.fq' from reads_ch output: file 'sample.bam' into bam_ch script: """ bwa mem reference.fa sample.fq \ | samtools sort -o sample.bam """ }
  11. DATAFLOW • Declarative computational model for parallel process executions •

    Processes wait for data, when an input set is ready the process is executed • They communicate by using dataflow variables i.e. async FIFO queues called channels • Parallelisation and tasks dependencies are implicitly defined by process in/out declarations
  12. HOW PARALLELISATION WORKS data x data y data z task

    1 task 2 task 3 data z channel process out z data y data x out y out x
  13. HOW PARALLELISATION WORKS samples_ch = Channel.fromPath('data/sample.fastq') process FASTQC { input:

    file reads from samples_ch output: file 'fastqc_logs' into fastqc_ch """ mkdir fastqc_logs fastqc -o fastqc_logs -f fastq -q ${reads} """ }
  14. samples_ch = Channel.fromPath('data/*.fastq') process FASTQC { input: file reads from

    samples_ch output: file 'fastqc_logs' into fastqc_ch """ mkdir fastqc_logs fastqc -o fastqc_logs -f fastq -q ${reads} """ } HOW PARALLELISATION WORKS
  15. HANDLING FILE PAIRS Channel.fromFilePairs("*_{1,2}.fq") ( gut, [gut_1.fq, gut_2.fq] ) (

    lung, [lung_1.fq, lung_2.fq] ) ( liver, [liver_1.fq, liver_2.fq] ) gut_1.fq gut_2.fq liver_1.fq liver_2.fq lung_1.fq lung_2.fq
  16. process FASTQC { input: set pair_id, file(reads) from samples_ch output:

    file 'fastqc_logs' into fastqc_ch """ mkdir fastqc_logs fastqc -o fastqc_logs -f fastq -q ${reads} """ } ( gut, [gut_1.fq, gut_2.fq] ) ( lung, [lung_1.fq, lung_2.fq] ) ( liver, [liver_1.fq, liver_2.fq] )
  17. LOCAL EXECUTION • Common development scenario • Dependencies can be

    managed using a container runtime • Parallelisations is managed spawning posix processes • Can scale vertically using fat server / shared mem. machine nextflow OS local storage docker/singularity laptop / workstation
  18. CENTRALISED ORCHESTRATION computer cluster • Nextflow orchestrates workflow execution submitting

    jobs to a compute cluster eg. SLURM • It can run in the head node or a compute node • Requires a shared storage to exchange data between tasks • Ideal for corse-grained parallelisms NFS/Lustre cluster node cluster node cluster node cluster node submit jobs cluster node nextflow
  19. DISTRIBUTED ORCHESTRATION login node NFS/Lustre job request cluster node cluster

    node launcher wrapper nextflow cluster nextflow driver nextflow worker nextflow worker nextflow worker HPC cluster • A single job request allocates the desired computes nodes • Nextflow deploys its own embedded compute cluster • The main instance orchestrate the workflow execution • The worker instances execute workflow jobs (work stealing approach)
  20. jvm docker workers CLOUD DEPLOYMENT jvm docker NF driver NF

    daemon AWS EFS storage master setup cloud
 cluster GH repo Docker hub deploy dependencies upload code & deps computing 
 cluster application pull code
  21. KUBERNETES • Next generation native cloud clustering for containerised workloads

    • There's the need of workflow orchestration • Latest NF version includes a new command that streamline the workflow deployment to K8s
  22. PORTABILITY process { executor = 'slurm' queue = 'my-queue' memory

    = '8 GB' cpus = 4 container = 'user/image' }
  23. PORTABILITY process { executor = 'awsbatch' queue = 'my-queue' memory

    = '8 GB' cpus = 4 container = 'user/image' }
  24. GALAXY vs. NEXTFLOW • Command line oriented tool • Can

    incorporate any tool w/o any extra adapter • Fine control over tasks parallelisation • Scalability 100㱺1M jobs • One liner installer • Suited for production workflows + bioinformaticians • Web based platform • Built-in integration with many tools and dataset • Little control over tasks parallelisation • Scalability 10㱺1K jobs • Complex installation and maintenance • Suited for training + less experienced bioinformaticians
  25. SNAKEMAKE vs. NEXTFLOW • Command line oriented tool • Push

    model • Can manage any data structure • Compute DAG at runtime • Built-in support for clusters and cloud • No (yet) support for sub-workflows • Built-in support for Git/GitHub, etc., manage pipeline revisions • Groovy/JVM based • Command line oriented tool • Pull model • Rules defined using file name patterns • Compute DAG ahead • Custom scripts for cluster deployments • Support for sub-workflows • No support for source code management system • Python based
  26. CWL vs. NEXTFLOW • Language + app. runtime • DSL

    on top of a general purpose programming lang. • Concise, fluent (at least try to be!) • Single implementation, quick iterations • Language specification • Declarative meta- language (YAML/JSON) • Verbose • Many vendors/ implementations (and specification version)
  27. CONTAINER vs. VM • Lighter: MB vs GB • Faster

    startup: ms/secs vs minutes • Virtualise a process/application instead of a OS/Hardware • Immutable: don't change over time, thus guarantee replicability over executions. • Composable: the output of one container is directly consumable as input by another container. • Transparent: they are created with a well defined automated procedure.
  28. CONTAINERISATION • Nextflow envisioned the use of software containers to

    fix computational reproducibility • Mar 2014 (ver 0.7), support for Docker • Dec 2016 (ver 0.23), support for Singularity Nextflow job job job
  29. BENCHMARK* * Di Tommaso P, Palumbo E, Chatzou M, Prieto

    P, Heuer ML, Notredame C. (2015) 
 The impact of Docker containers on the performance of genomic pipelines. PeerJ 3:e1273 
 https://dx.doi.org/10.7717/peerj.1273 container execution can have an impact on short running tasks ie. < 1min
  30. BEST PRACTICES • Helps to isolate dependencies from dev or

    local deployment environment • Provides a reproducibles sandbox for third party users • Binary images preserve against software decay • Make it transparent ie. always include the Dockerfile • Docker image format is de-facto standard, it can be executed by different runtime eg. Singularity, Shifter, uDocker, etc.
  31. • Community effort to collect production ready analysis pipelines built

    with Nextflow • Initially supported by SciLifeLab, QBiC and A*Star Genome Institute Singapore • https://nf-co.re Alexander 
 Peltzer Phil Ewels Andreas Wilm Maxime Garcia + others
  32. IMPROVEMENTS • Better meta-data and provenance handling • Nextflow server

    for real-time monitoring • Workflow composition aka sub-workflows
  33. DATA INTEGRATION • Allow direct query over SQL/NoSQL datasources •

    Uniform access to local and remote big-data repositories such as Athena, DynamoDB, BigQuery, etc • Query result is mapped to Nextflow channel structure triggering process executions
  34. APACHE SPARK • Native support for Apache Spark clusters and

    execution model • Allow hybrid Nextflow and Spark applications • Mix the best of the two worlds, Nextflow for legacy tools/corse grain parallelisation and Spark for fine grain/distributed execution eg. GATK4
  35. NOTEBOOKS • Combine interactive computing with the heavy CPU work

    • Jupyter implementation (iNextflow) • RMarkdown
  36. • Participate in Cloud Work Stream working group • TES:

    Task Execution API (working prototype) • WES: Workflow Execution API • Enable interoperability with GA4GH complaint platforms eg. Cancer Genomics Cloud and Broad FireCloud
  37. CONCLUSION • Data analysis reproducibility is hard and it's often

    underestimated. • Nextflow does not provide a magic solution but enables best-practices and provide support for community and industry standards. • It strictly separates the application logic from the configuration and deployment logic, enabling self-contained workflows. • Applications can be easily deployed across different environment in a reproducible manner with a single command. • The functional/reactive model allows applications to scale to millions of jobs with ease.