Nextflow at the heart of UK genomics

Nextflow at the heart of UK genomics

Presentation given at the Nextflow workshop, 22-23 November 2018, Barcelona, Spain

D68d36a42d9c44c29abb391e051e592d?s=128

Vladimir Kiselev

November 23, 2018
Tweet

Transcript

  1. Nextflow at the heart of UK genomics Vladimir Kiselev @wikiselev

    Head of Programme Informatics Wellcome Sanger Institute
  2. • Cell Transcriptional Atlassing • Large scale imaging for spatial

    transcriptomics • Genomics study of cell lines (HipSci) • Computational methods development Cellular Genetics programme …
  3. What is our team doing? • Our team provides Computational/IT

    support for Cellular Genetics programme (~180 people) • We work on both upstream (processing raw data) and downstream (gaining biological insights) analysis pipelines • We interact with all of the faculty members in the programme • We move programme’s pipelines to the Cloud (Sanger introduced Flexible Compute Environment ~1.5 years ago)
  4. Why does Sanger move to the Cloud? or Computational challenges

    in academia
  5. Sequencing data growth • Data security • The time and

    cost of moving large datasets • Costs of redundant copies of the data • Data governance constraints (personally identifiable data) Increase in storage of next-generation sequencing data in the Sequence Read Archive (SRA) Cloud computing for genomic data analysis and collaboration, Ben Langmead & Abhinav Nellore, Nature Genetics Reviews, 2018
  6. Collaborative initiatives Cloud computing for genomic data analysis and collaboration,

    Ben Langmead & Abhinav Nellore, Nature Genetics Reviews, 2018
  7. Reproducibility Bespoke software • Hard to run at other sites

    • Dependencies problem Containers • Docker • Singularity BIDS apps: Improving ease of use, accessibility, and reproducibility of neuroimaging data analysis methods, Gorgolewski KJ et al., PLOS Computational Biology, 2017
  8. Moving to the cloud Flexible Compute Environment, Tim Cutts, Head

    of Scientific Computing at the Sanger Institute, 2016
  9. Last two decades at Sanger…

  10. Sanger classical setup • LSF batch scheduler (not free) •

    Software has to be manually installed (no Docker or Singularity support) • Dependencies • Custom bash scripts
  11. We use Nextflow (early 2018) • Phil Ewels • CWL

    is very hard to start with • Nextflow looked much more advanced than others
  12. Why Nextflow? Orchestration & Parallelisation Scalability & Portability Deployment &

    Reproducibility Flexible Compute Environment, Tim Cutts, Head of Scientific Computing at the Sanger Institute, 2016 Enabling Reproducible In-Silico Data Analysis with Nextflow, Paolo Di Tommaso, 2018 https://speakerdeck.com/pditommaso/enabling-reproducible-in-silico-data-analises-with-nextflow
  13. Nextflow pipelines https://nf-co.re https://github.com/cellgeni

  14. Nextflow on LSF

  15. Our setup • Worked from scratch • Did not need

    any tweaking • Software: bioconda + local installation • Run Nextflow in conda environment • iRods (that is why we forked nf-core) executor { name = 'lsf’ queueSize = 400 } process { queue = 'normal’ }
  16. Errors • Failure is inevitable • Transiently: storage system down,

    samples pending • Cosmic interference: one in every 100K featureCounts runs may fail Memory limit exceeded (retry) process { … errorStrategy = ‘ignore’ … }
  17. Failure reports • iRods not available • Too few reads

    in a sample • Too few reads aligned • Currently explicitly encoded in processes and channels • Propagated as MultiQC table Feature request to Nextflow with onSuccess, onError, onComplete event handlers: Issue #903
  18. Conditional code Started with a lot of conditionally defined code,

    then used: when deactivate a process until deactivate channel and descendants To remove code branching and improve readability if (params.aligner == 'STAR’) { … }
  19. Conditional code ch_fq_irods .mix(ch_fq_dirPE, ch_fq_dirSE) # Mix three input channels

    .into{ch_rnaseq; ch_fastqc; ch_mixcr} # Direct towards different computations ch_rnaseq .until{ ! params.run_rnaseq } # Rnaseq can be turned off .into { ch_star; ch_hisat2; ch_salmon } # and can be run in three modes process star { when: params.run_star … }
  20. Onion feature • Whole Genome Sequencing alignment • Very heavy

    • Working with sub samples (onion) was not easy
  21. Onion feature • groupKey feature now allows for each onion

    to proceed independently • Issue #796 ch_iget_file .map { tag, file -> [tag, file.readLines() ] } .map { tag, lines -> tuple( groupKey(tag, lines.size()), lines ) } .transpose() .set { ch_iget_item }
  22. The portability of Nextflow pipelines

  23. Nextflow Profiles nextflow run cellgeni/rnaseq --samplefile samples.txt --genome GRCh38 -profile

    lsf nextflow run cellgeni/rnaseq --samplefile samples.txt --genome GRCh38 -profile cloud
  24. We use Kubernetes cluster on the Cloud • Kubernetes cluster

    (open source) • Software in Docker containers • No dependencies problem • Scripts are encapsulated in Nextflow pipelines
  25. Why Kubernetes?

  26. Kubernetes • Hardware abstraction • Container orchestration • Nextflow integration

    Kubernetes in Action, Marko Luksa, Manning Publications, 2018
  27. Nextflow Integration • Benchmarking is needed • We are doing

    it • Can you share your experience? Nextflow documentation
  28. Kubernetes origins • In the early 2000s Google thinks of

    a better way of deploying and managing their software/infrastructure to scale globally • In ~2004 Google developed Borg and then Omega and kept it secret until 2014 • In 2014 Google introduced Kubernetes, an open-source system based on the experience gained through Borg, Omega • Kubernetes facilitates the deployment and serving of web applications
  29. Nextflow on Kubernetes

  30. How we run it nextflow kuberun pipeline … vs nextflow

    kuberun login … vs kubectl create pod nextflow.yaml kubectl exec pod bash Least control Not much control (Paolo’s docker image) Full control
  31. iRods secret apiVersion: v1 kind: Secret metadata: name: irods-secret type:

    Opaque data: IRODS_PASSWORD: PASSWORD IRODS_USER_NAME: USERNAME > kubectl create -f secret.yml Kubernetes secret secret.yml process { container = ‘quay.io/cellgeni/rnaseq' $irods { container = 'quay.io/cellgeni/irods' pod = [secret: 'irods-secret', mountPath: '/secret'] beforeScript = "/iinit.sh" } } Nextflow config
  32. process { withName: star { container = 'quay.io/biocontainers/star:2.5.4a--0’ cpus =

    4 } withName: multiqc { container = 'quay.io/biocontainers/multiqc:1.6--py35h24bf2e0_0’ cpus = 1 } withName: indexbam { container = 'quay.io/biocontainers/samtools:1.8--4’ cpus = 1 } withName: mapsummary { container = 'quay.io/biocontainers/pandas:0.23.4--py36hf8a1672_0’ cpus = 1 } }
  33. Benchmarking (memory usage) LSF Kubernetes

  34. Benchmarking (execution time) LSF Kubernetes

  35. Benchmarking (Disk I/O - read) LSF Kubernetes

  36. Benchmarking (Disk I/O - write) LSF Kubernetes

  37. Benchmarking (CPU usage) LSF Kubernetes

  38. Kubernetes issues • Queuing system – K8s is not very

    clever when dealing with pending jobs ü Issues #773 and #824 • IO / file system performance on K8s • S3 data pulling – Issue #686
  39. Throughput (RNAseq) 1000 2000 3000 May Jun Jul Aug Sep

    Oct # samples Bulk RNAseq 0 10000 20000 30000 40000 May Jun Jul Aug Sep Oct Nov # cells Single Cell RNAseq
  40. Throughput (data sharing) 0 5000 10000 15000 20000 25000 May

    Jun Jul Aug Sep Oct Nov # samples 0 5 10 May Jun Jul Aug Sep Oct Nov TB
  41. Bonuses

  42. Our web apps

  43. Downstream analysis with JupyterHub • Jupyter Project (http://jupyter.org) • Multiuser

    • Provides containerized R/python/Julia environments • Rstudio server is included • Big compute in a browser window • We create templates for downstream analysis https://github.com/cellgeni/notebooks • Installation on Kubernetes cluster (one liner): helm upgrade --install jpt jupyterhub/jupyterhub --namespace jpt --version 0.7.0-beta.2 --values jupyter-config.yaml
  44. Acknowledgements Sanger IT • Helen Cousins • Theo Barber-Bany •

    Peter Clapham • Tim Cutts • Stijn van Dongen • Anton Khodak • Daniel Gaffney • Sarah Teichmann • Paolo Di Tomasso • Phil Ewels
  45. We are hiring! https://jobs.sanger.ac.uk/vacancies.html