Development And Operation Of Workflows, Tools And Infrastructure For Data Analysis

Development And Operation Of Workflows, Tools And Infrastructure For Data Analysis

My presentation at Genomics and Synthetic Biology Series UK - https://www.oxfordglobal.co.uk/genomics-and-synthetic-biology-series-uk/

D68d36a42d9c44c29abb391e051e592d?s=128

Vladimir Kiselev

November 09, 2018
Tweet

Transcript

  1. Development And Operation Of Workflows, Tools And Infrastructure For Data

    Analysis Vladimir Kiselev @wikiselev Head of Programme Informatics Wellcome Sanger Institute
  2. • Cell Transcriptional Atlassing • Large scale imaging for spatial

    transcriptomics • Genomics study of cell lines (HipSci) • Computational methods development Cellular Genetics programme …
  3. What is our team doing? • Our team provides Computational/IT

    support for Cellular Genetics programme (~180 people) • We work on both upstream (processing raw data) and downstream (gaining biological insights) analysis pipelines • We interact with all of the faculty members in the programme • We move programme’s pipelines to the Cloud (Sanger introduced Flexible Compute Environment ~1.5 years ago)
  4. Why do we need the Cloud? or Computational challenges in

    academia
  5. Sequencing data growth • Data security • The time and

    cost of moving large datasets • Costs of redundant copies of the data • Data governance constraints (personally identifiable data) Increase in storage of next-generation sequencing data in the Sequence Read Archive (SRA) Cloud computing for genomic data analysis and collaboration, Ben Langmead & Abhinav Nellore, Nature Genetics Reviews, 2018
  6. Collaborative initiatives Cloud computing for genomic data analysis and collaboration,

    Ben Langmead & Abhinav Nellore, Nature Genetics Reviews, 2018
  7. Reproducibility Bespoke software • Hard to run at other sites

    • Dependencies problem Containers • Docker • Singularity BIDS apps: Improving ease of use, accessibility, and reproducibility of neuroimaging data analysis methods, Gorgolewski KJ et al., PLOS Computational Biology, 2017
  8. Moving to the cloud Flexible Compute Environment, Tim Cutts, Head

    of Scientific Computing at the Sanger Institute, 2016
  9. Last two decades at Sanger…

  10. Sanger classical setup • LSF batch scheduler (not free) •

    Software has to be manually installed (no Docker or Singularity support) • Dependencies • Custom bash scripts
  11. We use Nextflow

  12. Why Nextflow? Orchestration & Parallelisation Scalability & Portability Deployment &

    Reproducibility Flexible Compute Environment, Tim Cutts, Head of Scientific Computing at the Sanger Institute, 2016 Enabling Reproducible In-Silico Data Analysis with Nextflow, Paolo Di Tommaso, 2018 https://speakerdeck.com/pditommaso/enabling-reproducible-in-silico-data-analises-with-nextflow
  13. Nextflow Example bwa mem reference.fa sample.fq \ | samtools sort

    -o sample.bam Enabling Reproducible In-Silico Data Analysis with Nextflow, Paolo Di Tommaso, 2018 https://speakerdeck.com/pditommaso/enabling-reproducible-in-silico-data-analises-with-nextflow
  14. Nextflow Example process align_sample { input: file 'reference.fa' from genome_ch

    file 'sample.fq' from reads_ch output: file 'sample.bam' into bam_ch script: """ bwa mem reference.fa sample.fq \ | samtools sort -o sample.bam """ } Nextflow process: Enabling Reproducible In-Silico Data Analysis with Nextflow, Paolo Di Tommaso, 2018 https://speakerdeck.com/pditommaso/enabling-reproducible-in-silico-data-analises-with-nextflow
  15. Nextflow Example process align_sample { input: file 'reference.fa' from genome_ch

    file 'sample.fq' from reads_ch output: file 'sample.bam' into bam_ch script: """ bwa mem reference.fa sample.fq \ | samtools sort -o sample.bam """ } reads_ch = Channel.fromPath('data/*.fq') Implicit parallelism: Nextflow process: Enabling Reproducible In-Silico Data Analysis with Nextflow, Paolo Di Tommaso, 2018 https://speakerdeck.com/pditommaso/enabling-reproducible-in-silico-data-analises-with-nextflow
  16. Nextflow Example process align_sample { input: file 'reference.fa' from genome_ch

    file 'sample.fq' from reads_ch output: file 'sample.bam' into bam_ch script: """ bwa mem reference.fa sample.fq \ | samtools sort -o sample.bam """ } reads_ch = Channel.fromPath('data/*.fq') executor { name = 'lsf' queueSize = 400 } process { queue = 'normal' } Nextflow profile file: Implicit parallelism: Nextflow process: Enabling Reproducible In-Silico Data Analysis with Nextflow, Paolo Di Tommaso, 2018 https://speakerdeck.com/pditommaso/enabling-reproducible-in-silico-data-analises-with-nextflow
  17. Nextflow pipelines https://nf-co.re https://github.com/cellgeni

  18. The portability of Nextflow pipelines

  19. Nextflow Profiles nextflow run cellgeni/rnaseq --samplefile samples.txt --genome GRCh38 -profile

    lsf nextflow run cellgeni/rnaseq --samplefile samples.txt --genome GRCh38 -profile cloud
  20. We use Kubernetes cluster on the Cloud • Kubernetes cluster

    (open source) • Software in Docker containers • No dependencies problem • Scripts are encapsulated in Nextflow pipelines
  21. Why Kubernetes?

  22. Kubernetes • Hardware abstraction • Container orchestration • Nextflow integration

    Kubernetes in Action, Marko Luksa, Manning Publications, 2018
  23. Nextflow Integration • Benchmarking is needed • We are doing

    it • Can you share your experience? Nextflow documentation
  24. Kubernetes origins • In the early 2000s Google thinks of

    a better way of deploying and managing their software/infrastructure to scale globally • In ~2004 Google developed Borg and then Omega and kept it secret until 2014 • In 2014 Google introduced Kubernetes, an open-source system based on the experience gained through Borg, Omega • Kubernetes facilitates the deployment and serving of web applications
  25. Our web apps

  26. Simple web apps (with backend) https://github.com/cellgeni/FORECasT https://github.com/hemberg-lab/scmap-shiny

  27. Downstream analysis with JupyterHub • Jupyter Project (http://jupyter.org) • Multiuser

    • Provides containerized R/python/Julia environments • Rstudio server is included • Big compute in a browser window • We create templates for downstream analysis https://github.com/cellgeni/notebooks • Installation on Kubernetes cluster (one liner): helm upgrade --install jpt jupyterhub/jupyterhub --namespace jpt --version 0.7.0-beta.2 --values jupyter-config.yaml
  28. Downstream analysis with Galaxy • Expression group at EBI •

    Downstream analysis pipelines • Ready to be used • User-friendly • Installation on Kubernetes cluster is also a one liner https://github.com/ebi-gene-expression-group/container-galaxy-sc-tertiary
  29. Acknowledgements • Sanger IT • Helen Cousins • Theo Barber-Bany

    • Peter Clapham • Tim Cutts • Stijn van Dongen • Anton Khodak • Ruben Chazarra • Daniel Gaffney • Sarah Teichmann • Paolo Di Tomasso • Phil Ewels
  30. We are hiring! https://jobs.sanger.ac.uk/vacancies.html