Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Large Scale Data Ecosystem for Single Cell Imaging

Ola Tarkowska
February 12, 2019

Large Scale Data Ecosystem for Single Cell Imaging

Towards decentralized lifescience image data resources. Large scale data ecosystem for single cell imaging, automated processing and visualizations at Sanger.

Growing number of images produced daily challenge data management systems. File system storage in not enough to organize the data. Indexing data in data management systems like OMERO are essential because they allow to integrate the data from decentralized data resources, enabling programmatic access through remote API in Virtual Analysis Environment, supported by multiple programming languages.

Ola Tarkowska

February 12, 2019
Tweet

More Decks by Ola Tarkowska

Other Decks in Research

Transcript

  1. Large Scale Data Ecosystem for Single Cell Imaging Automated workflows,

    data types & visualisations Ola Tarkowska Solutions Architect Cellular Genetics Informatics Wellcome Sanger Institute 28th February 2020
  2. Cellular Genetics programme • Single Cell Sequencing • Single Cell

    Imaging • Data Integration • Computational methods development
  3. single molecule FISH workflow ISS (in situ sequencing) 150 genes

    RNAscope 12 genes Bayraktar, O.A. et al. Astrocyte layers in the mammalian cerebral cortex revealed by a single-cell in situ transcriptomic map. Nat Neurosci 23, 500–509 (2020). https://doi.org/10.1038/s41593-020-0602-1
  4. Multidimensional data set at TB scale “Multidimensional data is acquired

    and written to disk in many small files, each of which contain a subset of one or more dimensions from the complete dataset.” • 19340 px x 35872 px x 30 Z-planes • Pixels Size 0.3µm • 5 Channels • 18,000 files -> 40GB
  5. Image dimensions and data set sizes Typical sample slide: 20mm

    x 15mm @ “40X” (.15mpp) = 80,000 x 60,000 pixels = 4.8Gp @ “40X” (.30mpp) or “20X” (.15mpp) = 40,000 x 30,000 pixels = 2.4Gp Digital Pathology - Whole Slide Images (WSI) @ .25mpp (“40X”) = 80,000 x 60,000 pixels = 4.8Gp = 15GB Strip of cerebral cortex 4,400 images 6 GB Whole mouse brain section 10,000 images 40 GB Human endometrium section 87,000 images 340 GB Heart section (3 sections) 127,000 images 480 GB
  6. What is the efficient way to store the data? Strutz

    M et al. Transforming a Local Medical Image Analysis for Running on a Hadoop Cluster https://doi.org/10.1016/j.procs.2017.05.227; Available via license: CC BY-NC-ND 4.0 Besson S et al. Bringing Open Data to Whole Slide Imaging https://doi.org/10.1007/978-3-030-23937-4_1
  7. Histology images The Image Data Resource: A Bioimage Data Integration

    and Publication Platform. Williams et al. Nat Methods Volume 14 (2017) p.775-781. https://doi.org/10.1038/nmeth.4326
  8. Current deployment Multiple production servers (VPN only) OMERO - 16

    VCPUs, 64 GB RAM WEB - 8 VCPUs, 16 GB RAM Storage - 102TB raw | 2.4T processed NFS Dedicated 15 PB EMC Isilon, OneFS v8.0.1.2. NFS
  9. Analysis in Jupyter Van der Walt et al. scikit-image: Image

    processing in Python http://dx.doi.org/10.7717/peerj.453 scikit-image is a collection of algorithms for image processing in Python. Thresholding - the simplest way to segment objects from a background
  10. Biostitch OpenCV based image stitcher for Opera Phenix HCS System

    Vasyl Vaskivskyi (Cellular Genetics Informatics) https://github.com/VasylVaskivskyi/biostitch High stitching precision Adaptive registration of images Suitable for pipeline usage Requires only metadata from microscope Output suitable for OME visualisation tools Metadata handling and conversion to OME-TIFF Efficient memory usage Loading images into batches Dataset size: 6 GB VM spec: 22GB RAM, 16 core CPU
  11. How to represent analysis results in OMERO at scale? Image

    analysis results generated by Tong Li (Bayraktar’s group) Cellpose: a generalist algorithm for cellular segmentation https://doi.org/10.1101/2020.02.02.931238 2D Cellpose with the full-resolution nuclei image. A software for manual labelling and for curation of the automated results.
  12. Handling metadata The Image Data Resource: A Bioimage Data Integration

    and Publication Platform. Williams et al. Nat Methods Volume 14 (2017) p.775-781. https://doi.org/10.1038/nmeth.4326
  13. Publishing figures Schleicher et al (2017) The Ndc80 complex targets

    Bod1 to human mitotic kinetochores https://doi.org/10.1098/rsob.170099
  14. ?

  15. People behind Ola Tarkowska Solution Architect Kenny Roberts Postdoctoral Fellow

    (Bayraktar’s group) Omer Bayraktar Group Leader Kwasi Kwakwa Senior Bioinformatician (Marioni’s group) Vasyl Vaskivskyi Software Developer Vladimir Kiselev Cellular Genetics Informatics Team Leader Spatial Transcriptomics Lab Informatics Support Team Tong Li Postdoctoral Fellow (Bayraktar’s group)
  16. Acknowledgement Sanger IT Teams OME consortium: Prof. Jason Swedlow Josh

    Moore Sébastien Besson Eleanor Williams Simon Li Vladimir Kiselev Maria Keays Vasyl Vasylkiewski Martin Prete Stijn van Dongen Anton Khodak Simon Murray Omer Bayraktar Kenny Roberts Tong Li Kwasi Kwakwa
  17. Working life MSc Eng. in Software Engineering from the Faculty

    of Electrical, Electronic, Computer and Control Engineering of Lodz University of Technology Scholarship from EU Socrates-Erasmus Schema, University d’Artois 2007-2017 University of Dundee – Open Microscopy Environment • Virtual Microscope for medical students • Image Data Resource 2017-2018 EMBL-EBI – metagenomics team • REST APIs for microbiome derived data sets • MG-toolkit – meta-analysis 2019 - present Wellcome Sanger Institute • Helping to build data and software infrastructure for Single Cell Imaging