Upgrade to Pro — share decks privately, control downloads, hide ads and more …

BiocAsia 2017

9d81fd2d95185ac557a4a6a1e2139657?s=47 Luke Zappia
November 17, 2017

BiocAsia 2017

Single-cell RNA sequencing (scRNA-seq) has opened up a range of opportunities for investigating the transcriptome, but with the dramatic increase in resolution comes an array of bioinformatics challenges. Single-cell data is relatively sparse (for both biological and technical reasons), quality control is difficult and it is unclear if methods designed for bulk RNA-seq are appropriate for scRNA-seq data. Researchers have risen to address these challenges and there are now more than 140 scRNA-seq analysis tools available. However, with so many tools available researchers are faced with the difficult task of choosing which to use, making it important to be able to assess and compare the performance, quality and limitations of each tool. One common approach is to test methods on simulated datasets where the true answers are known. To aid this process we have developed Splatter, a Bioconductor R package for reproducible simulation of scRNA-seq datasets
(bioconductor.org/packages/splatter).

Splatter is a simulation framework that provides access to a variety of simulation models, allowing users to estimate parameters from real data in order to easily generate realistic synthetic scRNA-seq datasets. As part of Splatter we also introduce our own simulation model, Splat, capable of reproducing scRNA-seq datasets with multiple groups of cells, differentiation paths or batch effects. Here we will discuss some how Splatter can be used to develop and compare analysis tools. We will also touch on our experience developing Splatter, some of the design choices we made and how we have integrated other Bioconductor packages such as the SingleCellExperiment class.

9d81fd2d95185ac557a4a6a1e2139657?s=128

Luke Zappia

November 17, 2017
Tweet

Transcript

  1. Splatter, a package for simulating single-cell RNA sequencing data Luke

    Zappia @_lazappi_
  2. Single-cell RNA-seq ACTGACTCCA TCAGTACTGA CGTGTCATAG GATTGACCTA ACTGACTCCA TCAGTACTGA CGTGTCATAG GATTGACCTA

    ACTGACTCCA TCAGTACTGA CGTGTCATAG GATTGACCTA ACTGACTCCA TCAGTACTGA CGTGTCATAG GATTGACCTA Gene Cell 1 Cell 2 Cell 3 Cell 4 A 12 10 9 0 B 0 0 0 1 C 9 6 0 0 D 7 0 4 0
  3. Sevensson et al. arXiv 1704.01379, 2017

  4. Gene Cell 1 Cell 2 Cell 3 Cell 4 A

    12 0 10 9 0 B 0 0 0 1 C 9 6 0 0 D 7 0 4 0 Bad cell? Low expression? Cell type specific? Cell cycle? Dropout?
  5. None
  6. None
  7. None
  8. www. .org

  9. Provide a truth to test against BUT - Often poorly

    documented and explained - Not easily reproducible or reusable - Don’t demonstrate similarity to real data Simulations
  10. “Splatter: simulation of single-cell RNA sequencing data.” Genome Biology (2017)

    DOI: 10.1186/s13059-017-1305-0
  11. The idea... Multiple simulations, same interface Real data Parameters Dataset

    Estimation Simulation
  12. Building a package Bioconductor Devtools Roxygen testthat Checkmate GitHub Codecov

    (covr) Travis/Appveyor?
  13. Checkmate fact <- function(n, method = "stirling") { if (length(n)

    != 1) stop("Argument 'n' must have length 1") if (!is.numeric(n)) stop("Argument 'n' must be numeric") if (is.na(n)) stop("Argument 'n' may not be NA") if (is.double(n)) { if (is.nan(n)) stop("Argument 'n' may not be NaN") if (is.infinite(n)) stop("Argument 'n' must be finite") if (abs(n - round(n, 0)) > sqrt(.Machine$double.eps)) stop("Argument 'n' must be an integerish value") n <- as.integer(n) } fact <- function(n, method = "stirling") { assertCount(n) assertChoice(method, c("stirling", "factorial")) if (method == "factorial") factorial(n) else sqrt(2 * pi * n) * (n / exp(1))^n }
  14. Counts Parameters Estimation

  15. Storing parameters Custom object Same structure, names - One object?

    Different objects
  16. Params SimpleParams SplatParams LunParams SCDDParams (Virtual)

  17. None
  18. None
  19. Parameters Dataset Simulation

  20. Storing simulations Simulated counts Intermediate values Parameters?

  21. SCESet assayData featureData phenoData AnnotatedDataFrame AnnotatedDataFrame Matrix ExpressionSet

  22. SingleCellExperiment assays rowData colData DataFrame DataFrame Matrix SummarizedExperiment metadata List

  23. None
  24. Science!

  25. Splat Negative binomial Expression outliers Defined library sizes Mean-variance trend

    Dropout
  26. Simple - Negative binomial Lun - NB with cell factors

    Lun ATL, Bach K, Marioni JC. Genome Biology (2016). DOI: 10.1186/s13059-016-0947-7. Lun 2 - Sampled NB with batch effects Lun ATL, Marioni JC. Biostatistics (2017). DOI: 10.1093/biostatistics/kxw055. scDD - NB with bimodality Korthauer KD, et al. Genome Biology (2016). DOI: 10.1186/s13059-016-1077-y. BASiCS - NB with spike-ins Vallejos CA, Marioni JC, Richardson S. PLoS Comp. Bio. (2015). DOI: 10.1371/journal.pcbi.1004333. Simulations
  27. 1. Estimate 2. Simulate 3. Compare params1 <- splatEstimate(real.data) params2

    <- simpleEstimate(real.data) sim1 <- splatSimulate(params1, ...) sim2 <- simpleSimulate(params2, ...) datasets <- list(Real = real.data, Splat = sim1, Simple = sim2) comp <- compareSCESets(datasets) diff <- diffSCESets(datasets, ref = “Real”) Using Splatter
  28. Means Difference in Means

  29. Zeros per cell Difference in zeros

  30. Mean-zeros Difference

  31. Rank 1 8

  32. Rank 1 8 Full-length Full-length

  33. Complex simulations Groups Batches Paths

  34. SingleCellExperiment Batch effects Simulations - BASiCS - mfa - PhenoPath

    - ZINB-WaVE New in Splatter 1.2.0 Bioconductor 3.6
  35. Many tools for scRNA-seq analysis Catalogued in the scRNA-tools database

    Can be tested using synthetic datasets Splatter is our package for simulating scRNA-seq data Making a package is not as hard as you think Summary
  36. @_lazappi_ oshlacklab.com Supervisors Alicia Oshlack Melissa Little MCRI Bioinformatics Belinda

    Phipson Breon Schmidt Everyone that makes tools and data available www.scRNA-tools.org @scRNAtools “Splatter: simulation of single-cell RNA sequencing data.” Genome Biology (2017) DOI: 10.1186/s13059-017-1305-0 “Exploring the single-cell RNA-seq analysis landscape with the scRNA-tools database” bioRxiv (2017) DOI: 10.1101/206573 bioconductor.org/packages/ splatter
  37. Negative binomial

  38. Real data 3 HapMap individuals 3 plates each 200 random

    cells Tung P-Y et al. Sci. Rep. (2017) DOI:10.1038/srep39921 A1 A2 A3 A B1 B2 B3 B C1 C2 C3 C Tung et al. iPSCs, C1 capture