Accelerating high-throughput image analysis using flexible Cloud environments

Accelerating high-throughput image analysis using ﬂexible Cloud environments Ola Tarkowska
Solution Architect Cellular Genetics Informatics Wellcome Sanger Institute 2nd March 2021 Informatics Seminar

New Sanger-EBI Initiative for Large-Scale Spatial Genomics https://www.sanger.ac.uk/group/high-throughput-spatial-genomics/

People involved

Instrumentation: The Opera Phenix High Content Screening Microscope one microscope
produces up to ~2.4TB of data per day we are going to have more

Estimated data acquisition in Petabytes End of 2021: 4.8 TB
/ per day 2022: 8.4 TB / per day

Test dataset: running pipeline on the Cloud Sample provided by
Roser Vento Lab Stringer, C., Wang, T., Michaelos, M. et al. Cellpose: a generalist algorithm for cellular segmentation. Nat Methods 18, 100–106 (2021). https://doi.org/10.1038/s41592-020-01018-x ~4.2TB/day to be analysed Tissue [12391 × 8299 pix] Dual-channel raw data (nuclei + cell border) 2D segmentation 40 min 0.4 GB

Preliminary observations • GPU acceleration improved performance. • We need
hundreds of GPUs to accommodate the high-throughput. • Code can be signiﬁcantly optimised or tools can even change. • Estimated data acquisition can grow. We need ﬂexibility

How to be ﬂexible when you need hundreds of GPUs?
CellGen has 2 x V100 SMX2 GPU's We need hundreds of GPU units

Hardware on-premises

Hardware on the Cloud

Flexibility: Google Cloud Platform Compute Engine GPU model Year GPUs
GPU memory Available vCPUs Available memory Tensor cores CUDA cores NVIDIA® Tesla® K80 Kepler 2014 8 GPUs 96 GB GDDR5 1 - 64 vCPUs 1 - 208 GB --- 2,496 NVIDIA® Tesla® P100 Pascal 2016 4 GPUs 64 GB HBM2 1 - 96 vCPUs 1 - 624 GB --- 3,584 NVIDIA® Tesla® P4 Pascal 2016 4 GPUs 32 GB GDDR5 1 - 96 vCPUs 1 - 624 GB --- 2,560 NVIDIA® Tesla® V100 Volta 2017 8 GPUs 128 GB HBM2 1 - 96 vCPUs 1 - 624 GB 640 5,120 NVIDIA® Tesla® T4 Turing 2018 4 GPUs 64 GB GDDR6 1 - 96 vCPUs 1 - 624 GB 320 2,560 NVIDIA® Tesla® A100 Ampere 2020 16 GPUs 640 GB HBM2 Up to 96 vCPUs* Up to 1.3 TB 432 6,912 *The A2 family uses Cascade Lake CPUs

Google Cloud Platform • We ran a PoC with Google
Life Sciences team that was funded by Google • Available GPU accelerators were tested. • The results conﬁrmed needs for ﬂexible hardware. Flexibility comes at a price (cost of the cloud is always higher than on premises)

Estimated Cloud Cost First estimation of price £600,000 assuming •
current tool • current data production rate • current GPU landscape All of them can change

Flexibility allows to reduce higher cost Broad’s GATK pipeline 10x
cost reduction In our genes: How Google Cloud helps the Broad Institute slash the cost of research https://cloud.google.com/blog/topics/inside-google-cloud/our-genes-how-google-cloud-helps-broad-institute-slash-cost-research

Built-in tools to reduce the cost

How to run analysis on the Cloud with minimum eﬀort?

Nextﬂow on LSF NF-Tower provided by Martin Prete

Nextﬂow on GCP using ‘google-lifescience’ executor

Our infrastructure on GCP https://cloud.google.com/life-sciences Nextﬂow Tower UI

Recipe for processing large volumes of data • Flexibility •
Future optimisation and cost reduction. • Nextﬂow + NF-tower as a platform • Google team support

Acknowledgement Google Team Hatem Nawar Evi Karakozoglou Marina Perkins Ilias
Katsardis Annalisa Pawlosky Ulrike Gupta Bayraktar Lab Tong Li Omer Bayraktar CellGenI Team Ola Tarkowska Vladimir Kiselev Sanger IT Pete Claphman Tim Cutts Nextﬂow Paolo Di Tommaso Evan Floden

Thank you for listening

Accelerating high-throughput image analysis us...

Accelerating high-throughput image analysis using flexible Cloud environments

Ola Tarkowska

More Decks by Ola Tarkowska

Other Decks in Research

Featured

Transcript

Accelerating high-throughput image analysis using ﬂexible Cloud environments Ola Tarkowska

New Sanger-EBI Initiative for Large-Scale Spatial Genomics https://www.sanger.ac.uk/group/high-throughput-spatial-genomics/

People involved

Instrumentation: The Opera Phenix High Content Screening Microscope one microscope

Estimated data acquisition in Petabytes End of 2021: 4.8 TB

Test dataset: running pipeline on the Cloud Sample provided by

Preliminary observations • GPU acceleration improved performance. • We need

How to be ﬂexible when you need hundreds of GPUs?

Hardware on-premises

Hardware on the Cloud

Flexibility: Google Cloud Platform Compute Engine GPU model Year GPUs

Google Cloud Platform • We ran a PoC with Google

Estimated Cloud Cost First estimation of price £600,000 assuming •

Flexibility allows to reduce higher cost Broad’s GATK pipeline 10x