Development And Operation Of Workflows, Tools And Infrastructure For Data Analysis

Slide 1

Slide 1 text

Development And Operation Of Workflows, Tools And Infrastructure For Data Analysis Vladimir Kiselev @wikiselev Head of Programme Informatics Wellcome Sanger Institute

Slide 2

Slide 2 text

• Cell Transcriptional Atlassing • Large scale imaging for spatial transcriptomics • Genomics study of cell lines (HipSci) • Computational methods development Cellular Genetics programme …

Slide 3

Slide 3 text

What is our team doing? • Our team provides Computational/IT support for Cellular Genetics programme (~180 people) • We work on both upstream (processing raw data) and downstream (gaining biological insights) analysis pipelines • We interact with all of the faculty members in the programme • We move programme’s pipelines to the Cloud (Sanger introduced Flexible Compute Environment ~1.5 years ago)

Slide 4

Slide 4 text

Why do we need the Cloud? or Computational challenges in academia

Slide 5

Slide 5 text

Sequencing data growth • Data security • The time and cost of moving large datasets • Costs of redundant copies of the data • Data governance constraints (personally identifiable data) Increase in storage of next-generation sequencing data in the Sequence Read Archive (SRA) Cloud computing for genomic data analysis and collaboration, Ben Langmead & Abhinav Nellore, Nature Genetics Reviews, 2018

Slide 6

Slide 6 text

Collaborative initiatives Cloud computing for genomic data analysis and collaboration, Ben Langmead & Abhinav Nellore, Nature Genetics Reviews, 2018

Slide 7

Slide 7 text

Reproducibility Bespoke software • Hard to run at other sites • Dependencies problem Containers • Docker • Singularity BIDS apps: Improving ease of use, accessibility, and reproducibility of neuroimaging data analysis methods, Gorgolewski KJ et al., PLOS Computational Biology, 2017

Slide 8

Slide 8 text

Moving to the cloud Flexible Compute Environment, Tim Cutts, Head of Scientific Computing at the Sanger Institute, 2016

Slide 9

Slide 9 text

Last two decades at Sanger…

Slide 10

Slide 10 text

Sanger classical setup • LSF batch scheduler (not free) • Software has to be manually installed (no Docker or Singularity support) • Dependencies • Custom bash scripts

Slide 11

Slide 11 text

We use Nextflow

Slide 12

Slide 12 text

Why Nextflow? Orchestration & Parallelisation Scalability & Portability Deployment & Reproducibility Flexible Compute Environment, Tim Cutts, Head of Scientific Computing at the Sanger Institute, 2016 Enabling Reproducible In-Silico Data Analysis with Nextflow, Paolo Di Tommaso, 2018 https://speakerdeck.com/pditommaso/enabling-reproducible-in-silico-data-analises-with-nextflow

Slide 13

Slide 13 text

Nextflow Example bwa mem reference.fa sample.fq \ | samtools sort -o sample.bam Enabling Reproducible In-Silico Data Analysis with Nextflow, Paolo Di Tommaso, 2018 https://speakerdeck.com/pditommaso/enabling-reproducible-in-silico-data-analises-with-nextflow

Slide 14

Slide 14 text

Nextflow Example process align_sample { input: file 'reference.fa' from genome_ch file 'sample.fq' from reads_ch output: file 'sample.bam' into bam_ch script: """ bwa mem reference.fa sample.fq \ | samtools sort -o sample.bam """ } Nextflow process: Enabling Reproducible In-Silico Data Analysis with Nextflow, Paolo Di Tommaso, 2018 https://speakerdeck.com/pditommaso/enabling-reproducible-in-silico-data-analises-with-nextflow

Slide 15

Slide 15 text

Nextflow Example process align_sample { input: file 'reference.fa' from genome_ch file 'sample.fq' from reads_ch output: file 'sample.bam' into bam_ch script: """ bwa mem reference.fa sample.fq \ | samtools sort -o sample.bam """ } reads_ch = Channel.fromPath('data/*.fq') Implicit parallelism: Nextflow process: Enabling Reproducible In-Silico Data Analysis with Nextflow, Paolo Di Tommaso, 2018 https://speakerdeck.com/pditommaso/enabling-reproducible-in-silico-data-analises-with-nextflow

Slide 16

Slide 16 text

Nextflow Example process align_sample { input: file 'reference.fa' from genome_ch file 'sample.fq' from reads_ch output: file 'sample.bam' into bam_ch script: """ bwa mem reference.fa sample.fq \ | samtools sort -o sample.bam """ } reads_ch = Channel.fromPath('data/*.fq') executor { name = 'lsf' queueSize = 400 } process { queue = 'normal' } Nextflow profile file: Implicit parallelism: Nextflow process: Enabling Reproducible In-Silico Data Analysis with Nextflow, Paolo Di Tommaso, 2018 https://speakerdeck.com/pditommaso/enabling-reproducible-in-silico-data-analises-with-nextflow

Slide 17

Slide 17 text

Nextflow pipelines https://nf-co.re https://github.com/cellgeni

Slide 18

Slide 18 text

The portability of Nextflow pipelines

Slide 19

Slide 19 text

Nextflow Profiles nextflow run cellgeni/rnaseq --samplefile samples.txt --genome GRCh38 -profile lsf nextflow run cellgeni/rnaseq --samplefile samples.txt --genome GRCh38 -profile cloud

Slide 20

Slide 20 text

We use Kubernetes cluster on the Cloud • Kubernetes cluster (open source) • Software in Docker containers • No dependencies problem • Scripts are encapsulated in Nextflow pipelines

Slide 21

Slide 21 text

Why Kubernetes?

Slide 22

Slide 22 text

Kubernetes • Hardware abstraction • Container orchestration • Nextflow integration Kubernetes in Action, Marko Luksa, Manning Publications, 2018

Slide 23

Slide 23 text

Nextflow Integration • Benchmarking is needed • We are doing it • Can you share your experience? Nextflow documentation

Slide 24

Slide 24 text

Kubernetes origins • In the early 2000s Google thinks of a better way of deploying and managing their software/infrastructure to scale globally • In ~2004 Google developed Borg and then Omega and kept it secret until 2014 • In 2014 Google introduced Kubernetes, an open-source system based on the experience gained through Borg, Omega • Kubernetes facilitates the deployment and serving of web applications

Slide 25

Slide 25 text

Our web apps

Slide 26

Slide 26 text

Simple web apps (with backend) https://github.com/cellgeni/FORECasT https://github.com/hemberg-lab/scmap-shiny

Slide 27

Slide 27 text

Downstream analysis with JupyterHub • Jupyter Project (http://jupyter.org) • Multiuser • Provides containerized R/python/Julia environments • Rstudio server is included • Big compute in a browser window • We create templates for downstream analysis https://github.com/cellgeni/notebooks • Installation on Kubernetes cluster (one liner): helm upgrade --install jpt jupyterhub/jupyterhub --namespace jpt --version 0.7.0-beta.2 --values jupyter-config.yaml

Slide 28

Slide 28 text

Downstream analysis with Galaxy • Expression group at EBI • Downstream analysis pipelines • Ready to be used • User-friendly • Installation on Kubernetes cluster is also a one liner https://github.com/ebi-gene-expression-group/container-galaxy-sc-tertiary

Slide 29

Slide 29 text

Acknowledgements • Sanger IT • Helen Cousins • Theo Barber-Bany • Peter Clapham • Tim Cutts • Stijn van Dongen • Anton Khodak • Ruben Chazarra • Daniel Gaffney • Sarah Teichmann • Paolo Di Tomasso • Phil Ewels

Slide 30

Slide 30 text

We are hiring! https://jobs.sanger.ac.uk/vacancies.html