Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Enabling reproducible in-silico data analises with Nextflow

Enabling reproducible in-silico data analises with Nextflow

Paolo Di Tommaso

May 01, 2018
Tweet

More Decks by Paolo Di Tommaso

Other Decks in Research

Transcript

  1. ENABLING REPRODUCIBLE 

    IN-SILICO DATA ANALISES
    WITH NEXTFLOW
    Paolo Di Tommaso, CRG
    Wellcome Trust Sanger Institute, 1 May 2018, Cambridge

    View Slide

  2. WHO IS THIS CHAP?
    @PaoloDiTommaso
    Research software engineer
    Comparative Bioinformatics, Notredame Lab
    Center for Genomic Regulation (CRG)
    Author of Nextflow project

    View Slide

  3. AGENDA
    • The challenges with computational workflows
    • Nextflow main principles
    • Handling parallelisation and portability
    • Deployments scenarios
    • Comparison with other tools
    • Future plans

    View Slide

  4. GENOMIC WORKFLOWS
    • Data analysis applications to extract information from
    (large) genomic datasets
    • Embarrassingly parallelisation, can spawn 100s-100k
    jobs over distributed cluster
    • Mash-up of many different tools and scripts
    • Complex dependency trees and configuration → very
    fragile ecosystem

    View Slide

  5. Steinbiss et al., Companion parassite genome annotation pipeline, DOI: 10.1093/nar/gkw292

    View Slide

  6. To reproduce the result of a typical 

    computational biology paper

    requires 280 hours.
    ≈1.7 months!

    View Slide

  7. View Slide

  8. THE SAME APPLICATION
    DEPLOYED IN
    DIFFERENT ENVIRONMENTS
    PRODUCES
    DIFFERENT RESULTS (!)

    View Slide

  9. Platform Amazon Linux Debian Linux Mac OSX
    Number of chromosomes 36 36 36
    Overall length (bp) 32,032,223 32,032,223 32,032,223
    Number of genes 7,781 7,783 7,771
    Gene density 236.64 236.64 236.32
    Number of coding genes 7,580 7,580 7570
    Average coding length (bp) 1,764 1,764 1,762
    Number of genes with multiple CDS 113 113 111
    Number of genes with known function 4,147 4,147 4,142
    Number of t-RNAs 88 90 88
    Comparison of the Companion pipeline annotation of Leishmania
    infantum genome executed across different platforms *
    * Di Tommaso P, et al., Nextflow enables computational reproducibility, Nature Biotech, 2017

    View Slide

  10. CHALLENGES
    • Reproducibility, replicate results over time
    • Portability, run across different platforms
    • Scalability ie. deploy big distributed workloads
    • Usability, streamline execution and deployment of complex
    workloads ie. remove complexity instead of adding new one
    • Consistency ie. track changes and revisions consistently for
    code, config files and binary dependencies

    View Slide

  11. PUSH-THE-BUTTON
    PIPELINES

    View Slide

  12. HOW?
    • Fast prototyping 㱺 custom DSL that enables tasks composition, simplifies
    most use cases + general purpose programming lang. for corner cases
    • Easy parallelisation 㱺 declarative reactive programming model based on
    dataflow paradigm, implicit portable parallelism
    • Self-contained 㱺 functional approach, a task execution is idempotent ie.
    cannot modify the state of other tasks + isolate dependencies with containers
    • Portable deployments 㱺 executor abstraction layer + deployment
    configuration from implementation logic

    View Slide

  13. Orchestration
    & Parallelisation
    Scalability
    & Portability
    Deployment &
    Reproducibility
    containers
    Git GitHub

    View Slide

  14. TASK EXAMPLE
    bwa mem reference.fa sample.fq \
    | samtools sort -o sample.bam

    View Slide

  15. process align_sample {
    input:
    file 'reference.fa' from genome_ch
    file 'sample.fq' from reads_ch
    output:
    file 'sample.bam' into bam_ch
    script:
    """
    bwa mem reference.fa sample.fq \
    | samtools sort -o sample.bam
    """
    }
    TASK EXAMPLE
    bwa mem reference.fa sample.fq \
    | samtools sort -o sample.bam

    View Slide

  16. TASKS COMPOSITION
    process index_sample {
    input:
    file 'sample.bam' from bam_ch
    output:
    file 'sample.bai' into bai_ch
    script:
    """
    samtools index sample.bam
    """
    }
    process align_sample {
    input:
    file 'reference.fa' from genome_ch
    file 'sample.fq' from reads_ch
    output:
    file 'sample.bam' into bam_ch
    script:
    """
    bwa mem reference.fa sample.fq \
    | samtools sort -o sample.bam
    """
    }

    View Slide

  17. DATAFLOW
    • Declarative computational model for parallel
    process executions
    • Processes wait for data, when an input set is
    ready the process is executed
    • They communicate by using dataflow variables
    i.e. async FIFO queues called channels
    • Parallelisation and tasks dependencies are
    implicitly defined by process in/out declarations

    View Slide

  18. HOW PARALLELISATION WORKS
    samples_ch = Channel.fromPath('data/sample.fastq')
    process FASTQC {
    input:
    file reads from samples_ch
    output:
    file 'fastqc_logs' into fastqc_ch
    """
    mkdir fastqc_logs
    fastqc -o fastqc_logs -f fastq -q ${reads}
    """
    }

    View Slide

  19. samples_ch = Channel.fromPath('data/*.fastq')
    process FASTQC {
    input:
    file reads from samples_ch
    output:
    file 'fastqc_logs' into fastqc_ch
    """
    mkdir fastqc_logs
    fastqc -o fastqc_logs -f fastq -q ${reads}
    """
    }
    HOW PARALLELISATION WORKS

    View Slide

  20. IMPLICIT PARALLELISM
    clustalo
    Channel.fromPath("data/*.fastq")
    clustalo
    FASTQC

    View Slide

  21. SUPPORTED PLATFORMS

    View Slide

  22. DEPLOYMENT SCENARIOS

    View Slide

  23. LOCAL EXECUTION
    • Common development
    scenario
    • Dependencies can be
    managed using a container
    runtime
    • Parallelisations is managed
    spawning posix processes
    • Can scale vertically using fat
    server / shared mem. machine
    nextflow
    OS
    local storage
    docker/singularity
    laptop / workstation

    View Slide

  24. CENTRALISED
    ORCHESTRATION
    computer cluster
    • Nextflow orchestrates
    workflow execution
    submitting jobs to a compute
    cluster eg. SLURM
    • It can run in the head node
    or a compute node
    • Requires a shared storage to
    exchange data between tasks
    • Ideal for corse-grained
    parallelisms
    NFS/Lustre
    cluster node
    cluster node
    cluster node
    cluster node
    submit jobs
    cluster node
    nextflow

    View Slide

  25. DISTRIBUTED
    ORCHESTRATION
    login node
    NFS/Lustre
    job request
    cluster node
    cluster node
    launcher
    wrapper
    nextflow cluster
    nextflow driver
    nextflow worker
    nextflow worker
    nextflow worker
    HPC cluster
    • A single job request allocates
    the desired computes nodes
    • Nextflow deploys its own
    embedded compute cluster
    • The main instance
    orchestrate the workflow
    execution
    • The worker instances
    execute workflow jobs
    (work stealing approach)

    View Slide

  26. KUBERNETES
    • Next generation native cloud clustering for
    containerised workloads
    • There's the need of workflow orchestration
    • Latest NF version includes a new command that
    streamline the workflow deployment to K8s

    View Slide

  27. K8S DEPLOYMENT

    View Slide

  28. PORTABILITY

    View Slide

  29. PORTABILITY
    process {
    executor = 'slurm'
    queue = 'my-queue'
    memory = '8 GB'
    cpus = 4
    container = 'user/image'
    }

    View Slide

  30. PORTABILITY
    process {
    executor = 'awsbatch'
    queue = 'my-queue'
    memory = '8 GB'
    cpus = 4
    container = 'user/image'
    }

    View Slide

  31. CONFIGURATION
    DECOUPLING 

    IS THE KEY TO
    PORTABLE DEPLOYMENTS

    View Slide

  32. DEMO!

    View Slide

  33. A QUICK COMPARISON

    View Slide

  34. GALAXY vs. NEXTFLOW
    • Command line oriented tool
    • Can incorporate any tool w/o any
    extra adapter
    • Fine control over tasks
    parallelisation
    • Scalability 100㱺1M jobs
    • One liner installer
    • Suited for production workflows +
    experienced bioinformaticians
    • Web based platform
    • Built-in integration with many tools
    and dataset
    • Little control over tasks parallelisation
    • Scalability 10㱺1K jobs
    • Complex installation and
    maintenance
    • Suited for training + not experienced
    bioinformaticians

    View Slide

  35. SNAKEMAKE vs. NEXTFLOW
    • Command line oriented tool
    • Push model
    • Can manage any data structure
    • Compute DAG at runtime
    • All major container runtimes
    • Built-in support for clusters and cloud
    • No (yet) support for sub-workflows
    • Built-in support for Git/GitHub, etc.,
    manage pipeline revisions
    • Groovy/JVM based
    • Command line oriented tool
    • Pull model
    • Rules defined using file name patterns
    • Compute DAG ahead
    • Built-in support for Singularity
    • Custom scripts for cluster deployments
    • Support for sub-workflows
    • No support for source code
    management system
    • Python based

    View Slide

  36. CWL vs. NEXTFLOW
    • Language + app. runtime
    • DSL on top of a general purpose
    programming lang.
    • Concise, fluent (at least try to be!)
    • Community driven
    • Single implementation, quick
    iterations
    • Language specification
    • Declarative meta-language
    (YAML/JSON)
    • Verbose
    • Committee driven
    • Many vendors/implementations
    (and specification version)

    View Slide

  37. CONTAINERISATION

    View Slide

  38. CONTAINERISATION
    • Nextflow envisioned the use
    of software containers to fix
    computational reproducibility
    • Mar 2014 (ver 0.7), support
    for Docker
    • Dec 2016 (ver 0.23), support
    for Singularity
    Nextflow
    job job job

    View Slide

  39. SINGULARITY FEATURES
    Kurtzer et al. Singularity: Scientific containers for mobility of compute. PLoS ONE 12(5): e0177459

    View Slide

  40. BENCHMARK*
    * Di Tommaso P, Palumbo E, Chatzou M, Prieto P, Heuer ML, Notredame C. (2015) 

    The impact of Docker containers on the performance of genomic pipelines. PeerJ 3:e1273 

    https://dx.doi.org/10.7717/peerj.1273
    container execution can have an impact on short running tasks ie. < 1min

    View Slide

  41. SINGULARITY BENCHMARK
    https://github.com/wresch/python_import_problem
    Singularity image format speeds up Python execution having many imports 

    from a shared file system !

    View Slide

  42. WHEN USE CONTAINERS?
    ALWAYS!

    View Slide

  43. BEST PRACTICES
    • Helps to isolate dependencies from dev or local deployment
    environment
    • Provides a reproducibles sandbox for third party users
    • Binary images preserve against software decay
    • Make it transparent ie. always include the Dockefile
    • Docker image format is de-facto standard, it can be executed by
    different runtime eg. Singularity, Shifter, uDocker, etc.

    View Slide

  44. ERROR RECOVERY
    • Each task outputs are saved in a
    separate directory
    • This allows to safely record
    interrupted executions discarding
    • Dramatically simplify debugging !
    • Computing resources can be defined
    in a *dynamic* manner, so that a
    failing task can be automatically re-
    execute with more memory, longer
    timeout, etc.

    View Slide

  45. EXECUTION REPORT

    View Slide

  46. EXECUTION REPORT

    View Slide

  47. EXECUTION TIMELINE

    View Slide

  48. DAG VISUALISATION

    View Slide

  49. EDITORS !

    View Slide

  50. WHAT'S NEXT

    View Slide

  51. IMPROVEMENTS
    • Built-in support for Bioconda recipies
    • Better meta-data and provenance handling
    • Workflow composition aka sub-workflows
    • More clouds support ie. Azure and GCP

    View Slide

  52. APACHE SPARK
    • Native support for Apache Spark clusters and
    execution model
    • Allow hybrid Nextflow and Spark applications
    • Mix the best of the two worlds, Nextflow for
    legacy tools/corse grain parallelisation and Spark
    for fine grain/distributed execution eg. GATK4

    View Slide

  53. • Partecipate in Cloud Work Stream working group
    • TES: Task Execution API (working prototype)
    • WES: Workflow Execution API
    • Enable interoperability with GA4GH complaint
    platforms eg. Cancer Genomics Cloud and Broad
    FireCloud

    View Slide

  54. WHO IS USING NEXTFLOW?

    View Slide

  55. • Community effort to collect production ready
    analysis pipelines built with Nextflow
    • Initially supported by SciLifeLab, QBiC and A*Star
    Genome Institute Singapore
    • https://nf-core.github.io
    Alexander 

    Peltzer
    Phil
    Ewels
    Andreas
    Wilm

    View Slide

  56. CONCLUSION
    • Data analysis reproducibility is hard and it's often underestimated.
    • Nextflow does not provide a magic solution but enables best-practices
    and provide support for community and industry standards.
    • It strictly separates the application logic from the configuration and
    deployment logic, enabling self-contained workflows.
    • Applications can be easily deployed across different environment in a
    reproducible manner with a single command.
    • The functional/reactive model allows applications to scale to millions of
    jobs with ease.

    View Slide

  57. ACKNOWLEDGMENT
    Evan Floden
    Emilio Palumbo
    Cedric Notredame
    Notredame Lab, CRG
    http://nextflow.io

    View Slide