Save 37% off PRO during our Black Friday Sale! »

WEHI Bioinformatics Seminar

WEHI Bioinformatics Seminar

Single-cells, simulation and kidneys in a dish

Single-cell RNA sequencing (scRNA-seq) is rapidly becoming a tool of choice for biologists wishing to investigate gene expression at greater resolution, particularly in areas such as development and differentiation. Single-cell data presents an array of bioinformatics challenges, data is sparse (for both biological and technical reasons), quality control is difficult and it is unclear how to replicate measurements. As scRNA-seq datasets have become available so have a plethora of analysis methods. Evaluation of these methods relies on having a truth to test against or a deep biological knowledge to interpret the results. Unfortunately current scRNA-seq simulations are frequently poorly documented, not reproducible and do not demonstrate similarity to real data or experimental designs. In this talk I will present Splatter, a Bioconductor package for simulating scRNA-seq data that is designed to address these issues. Splatter provides a consistent, easy to use interface to several previously published simulations allowing researchers to estimate parameters, produce synthetic datasets and compare how well they replicate real data. Splatter also includes Splat, our own simulation model. Based on a gamma-Poisson hierarchical model, Splat includes additional features often seen in scRNA-Seq data, such as dropout, and can be used to simulate complex experiments including multiple cell types, differentiation lineages and multiple batches. I will also discuss an analysis of a complex kidney organoid dataset, showing how more cells and different levels of clustering help to reveal greater biological insight.

9d81fd2d95185ac557a4a6a1e2139657?s=128

Luke Zappia

July 17, 2017
Tweet

Transcript

  1. Single-cells, simulation and kidneys in a dish Luke Zappia MCRI

    Bioinformatics @_lazappi_
  2. Bulk RNA-seq ACTGACTCCA TCAGTACTGA CGTGTCATAG GATTGACCTA Gene Sample 1 A

    43 B 3 C 17 D 24
  3. Single-cell RNA-seq ACTGACTCCA TCAGTACTGA CGTGTCATAG GATTGACCTA ACTGACTCCA TCAGTACTGA CGTGTCATAG GATTGACCTA

    ACTGACTCCA TCAGTACTGA CGTGTCATAG GATTGACCTA ACTGACTCCA TCAGTACTGA CGTGTCATAG GATTGACCTA Gene Cell 1 Cell 2 Cell 3 Cell 4 A 12 10 9 0 B 0 0 1 4 C 9 6 0 0 D 7 0 4 0
  4. Moore’s Law Sevensson et al. arXiv 1704.01379, 2017

  5. None
  6. Unique Molecular Identifiers UMIs 5’ 3’ AAAA (PCR){BC}[UMI]TTTT 5 4

    Aligned reads De-duplication and counting
  7. Gene Cell 1 Cell 2 Cell 3 Cell 4 A

    12 10 9 0 B 0 0 1 4 C 9 6 0 0 D 7 0 4 0
  8. Gene Cell 1 Cell 2 Cell 3 Cell 4 A

    12 0 10 9 0 B 0 0 1 4 C 9 6 0 0 D 7 0 4 0 Bad cell? Low expression? Cell type specific? Cell cycle? Dropout?
  9. Analysis Over 120 packages - www.scRNA-tools.org Identify cell types -

    Clustering - Lineage tracing
  10. None
  11. Simulation Biology Evaluation

  12. Simulation

  13. Simulations Provide a truth to test against BUT - Often

    poorly documented and explained - Not easily reproducible or reusable - Don’t demonstrate similarity to real data
  14. None
  15. Splatter Bioconductor package Collection of simulation methods Consistent, easy to

    use, interface Functions for comparison
  16. Negative binomial

  17. Splat Negative binomial Expression outliers Defined library sizes Mean-variance trend

    Dropout
  18. Simple - Negative binomial Lun - NB with cell factors

    Lun ATL, Bach K, Marioni JC. Genome Biology (2016). DOI: 10.1186/s13059-016-0947-7. Lun 2 - Sampled NB with batch effects Lun ATL, Marioni JC. Biostatistics (2017). DOI: 10.1093/biostatistics/kxw055. Simulations scDD - NB with bimodality Korthauer KD, et al. Genome Biology (2016). DOI: 10.1186/s13059-016-1077-y. BASiCS - NB with spike-ins Vallejos CA, Marioni JC, Richardson S. PLoS Comp. Bio. (2015). DOI: 10.1371/journal.pcbi.1004333.
  19. Using Splatter params1 <- splatEstimate(real.data) params2 <- simpleEstimate(real.data) sim1 <-

    splatSimulate(params1, ...) sim2 <- simpleSimulate(params2, ...) datasets <- list(Real = real.data, Splat = sim1, Simple = sim2) comp <- compareSCESets(datasets) diff <- diffSCESets(datasets, ref = “Real”) 1. Estimate 2. Simulate 3. Compare
  20. Real data 3 HapMap individuals 3 plates each 200 random

    cells Tung P-Y et al. Sci. Rep. (2017) DOI:10.1038/srep39921 A1 A2 A3 A B1 B2 B3 B C1 C2 C3 C Tung et al. iPSCs, C1 capture
  21. Means Difference in Means

  22. Zeros per cell Difference in zeros

  23. Mean-zeros Difference

  24. Rank 1 8 Full-length

  25. Complex simulations Groups Batches Paths

  26. Example evaluation Parameters - Estimated from Tung data Simulation -

    400 cells - 3 groups (60%, 25%, 15%) - 10% DE (~1700 genes) - 20 replicates Method - SC3 - k-means consensus clustering - Differential expression - Marker genes
  27. Clustering Gene identification

  28. Simulation summary Simulations are a great tool But they should

    be: - Reusable - Reproducible - Realistic Splatter is our solution bioRxiv 10.1101/133173
  29. Biology

  30. The kidney OpenStax College, CC BY 3.0 via Wikimedia Commons

  31. Organoids Day 0 4 7 10 18 25 CHIR FGF9

    FGF9 CHIR Form pellets No GF iPSCs organoid
  32. GATA3 ECAD LTL WT1 CD + DT + PT +

    Glo
  33. Fluidigm experiment 4 organoids C1 capture Full-length No spike-ins

  34. Analysis Alignment Quantification Quality control Clustering Gene detection Interpretation STAR

    featureCounts scater SC3 SC3 Biologists
  35. Quality control Cells - Alignment - Quantification - Expression 278

    -> 155 Genes - Expression - Class 23388
  36. Clustering

  37. 10x experiment 3 organoids Chromium capture UMI ~7000 cells

  38. Analysis Alignment Quantification Quality control Clustering Gene detection Interpretation CellRanger

    CellRanger scater Seurat Seurat Biologists
  39. Three clusters Vasculature Epithelium “Stroma”

  40. Many clusters

  41. Vasculature Proximal tubule Podocytes

  42. Mesangium Renal stroma

  43. Nephron? Neuronal?

  44. ?

  45. Cluster tree Resolution 0.01 0.1 0.2 0.3 0.4 0.5 0.6

    0.7 0.8 0.9 1.0
  46. Resolution 0.01 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

    0.9 1.0
  47. Resolution 0.01 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

    0.9 1.0 Vasculature
  48. Resolution 0.01 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

    0.9 1.0 Proximal Tubule Podocytes
  49. Resolution 0.01 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

    0.9 1.0 Mesangium
  50. Resolution 0.01 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

    0.9 1.0 Renal stroma
  51. Resolution 0.01 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

    0.9 1.0 Nephron/neuronal?
  52. Resolution 0.01 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

    0.9 1.0 ?
  53. Summary Kidney organoids are complex More cells help Cluster relationships

    can be useful Background knowledge is vital
  54. Acknowledgements Everyone that makes tools and data available Supervisors Alicia

    Oshlack Melissa Little MCRI Bioinformatics Belinda Phipson Breon Schmidt MCRI KDDR Alex Combes
  55. bioconductor.org/packages/splatter bioRxiv: “Splatter: simulation of single-cell RNA sequencing data” @scRNAtools

    www.scRNA-tools.org @_lazappi_ oshlacklab.com