Slide 1

Slide 1 text

Splatter, a package for simulating single-cell RNA sequencing data Luke Zappia @_lazappi_

Slide 2

Slide 2 text

Single-cell RNA-seq ACTGACTCCA TCAGTACTGA CGTGTCATAG GATTGACCTA ACTGACTCCA TCAGTACTGA CGTGTCATAG GATTGACCTA ACTGACTCCA TCAGTACTGA CGTGTCATAG GATTGACCTA ACTGACTCCA TCAGTACTGA CGTGTCATAG GATTGACCTA Gene Cell 1 Cell 2 Cell 3 Cell 4 A 12 10 9 0 B 0 0 0 1 C 9 6 0 0 D 7 0 4 0

Slide 3

Slide 3 text

Sevensson et al. arXiv 1704.01379, 2017

Slide 4

Slide 4 text

Gene Cell 1 Cell 2 Cell 3 Cell 4 A 12 0 10 9 0 B 0 0 0 1 C 9 6 0 0 D 7 0 4 0 Bad cell? Low expression? Cell type specific? Cell cycle? Dropout?

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

No content

Slide 7

Slide 7 text

No content

Slide 8

Slide 8 text

www. .org

Slide 9

Slide 9 text

Provide a truth to test against BUT - Often poorly documented and explained - Not easily reproducible or reusable - Don’t demonstrate similarity to real data Simulations

Slide 10

Slide 10 text

“Splatter: simulation of single-cell RNA sequencing data.” Genome Biology (2017) DOI: 10.1186/s13059-017-1305-0

Slide 11

Slide 11 text

The idea... Multiple simulations, same interface Real data Parameters Dataset Estimation Simulation

Slide 12

Slide 12 text

Building a package Bioconductor Devtools Roxygen testthat Checkmate GitHub Codecov (covr) Travis/Appveyor?

Slide 13

Slide 13 text

Checkmate fact <- function(n, method = "stirling") { if (length(n) != 1) stop("Argument 'n' must have length 1") if (!is.numeric(n)) stop("Argument 'n' must be numeric") if (is.na(n)) stop("Argument 'n' may not be NA") if (is.double(n)) { if (is.nan(n)) stop("Argument 'n' may not be NaN") if (is.infinite(n)) stop("Argument 'n' must be finite") if (abs(n - round(n, 0)) > sqrt(.Machine$double.eps)) stop("Argument 'n' must be an integerish value") n <- as.integer(n) } fact <- function(n, method = "stirling") { assertCount(n) assertChoice(method, c("stirling", "factorial")) if (method == "factorial") factorial(n) else sqrt(2 * pi * n) * (n / exp(1))^n }

Slide 14

Slide 14 text

Counts Parameters Estimation

Slide 15

Slide 15 text

Storing parameters Custom object Same structure, names - One object? Different objects

Slide 16

Slide 16 text

Params SimpleParams SplatParams LunParams SCDDParams (Virtual)

Slide 17

Slide 17 text

No content

Slide 18

Slide 18 text

No content

Slide 19

Slide 19 text

Parameters Dataset Simulation

Slide 20

Slide 20 text

Storing simulations Simulated counts Intermediate values Parameters?

Slide 21

Slide 21 text

SCESet assayData featureData phenoData AnnotatedDataFrame AnnotatedDataFrame Matrix ExpressionSet

Slide 22

Slide 22 text

SingleCellExperiment assays rowData colData DataFrame DataFrame Matrix SummarizedExperiment metadata List

Slide 23

Slide 23 text

No content

Slide 24

Slide 24 text

Science!

Slide 25

Slide 25 text

Splat Negative binomial Expression outliers Defined library sizes Mean-variance trend Dropout

Slide 26

Slide 26 text

Simple - Negative binomial Lun - NB with cell factors Lun ATL, Bach K, Marioni JC. Genome Biology (2016). DOI: 10.1186/s13059-016-0947-7. Lun 2 - Sampled NB with batch effects Lun ATL, Marioni JC. Biostatistics (2017). DOI: 10.1093/biostatistics/kxw055. scDD - NB with bimodality Korthauer KD, et al. Genome Biology (2016). DOI: 10.1186/s13059-016-1077-y. BASiCS - NB with spike-ins Vallejos CA, Marioni JC, Richardson S. PLoS Comp. Bio. (2015). DOI: 10.1371/journal.pcbi.1004333. Simulations

Slide 27

Slide 27 text

1. Estimate 2. Simulate 3. Compare params1 <- splatEstimate(real.data) params2 <- simpleEstimate(real.data) sim1 <- splatSimulate(params1, ...) sim2 <- simpleSimulate(params2, ...) datasets <- list(Real = real.data, Splat = sim1, Simple = sim2) comp <- compareSCESets(datasets) diff <- diffSCESets(datasets, ref = “Real”) Using Splatter

Slide 28

Slide 28 text

Means Difference in Means

Slide 29

Slide 29 text

Zeros per cell Difference in zeros

Slide 30

Slide 30 text

Mean-zeros Difference

Slide 31

Slide 31 text

Rank 1 8

Slide 32

Slide 32 text

Rank 1 8 Full-length Full-length

Slide 33

Slide 33 text

Complex simulations Groups Batches Paths

Slide 34

Slide 34 text

SingleCellExperiment Batch effects Simulations - BASiCS - mfa - PhenoPath - ZINB-WaVE New in Splatter 1.2.0 Bioconductor 3.6

Slide 35

Slide 35 text

Many tools for scRNA-seq analysis Catalogued in the scRNA-tools database Can be tested using synthetic datasets Splatter is our package for simulating scRNA-seq data Making a package is not as hard as you think Summary

Slide 36

Slide 36 text

@_lazappi_ oshlacklab.com Supervisors Alicia Oshlack Melissa Little MCRI Bioinformatics Belinda Phipson Breon Schmidt Everyone that makes tools and data available www.scRNA-tools.org @scRNAtools “Splatter: simulation of single-cell RNA sequencing data.” Genome Biology (2017) DOI: 10.1186/s13059-017-1305-0 “Exploring the single-cell RNA-seq analysis landscape with the scRNA-tools database” bioRxiv (2017) DOI: 10.1101/206573 bioconductor.org/packages/ splatter

Slide 37

Slide 37 text

Negative binomial

Slide 38

Slide 38 text

Real data 3 HapMap individuals 3 plates each 200 random cells Tung P-Y et al. Sci. Rep. (2017) DOI:10.1038/srep39921 A1 A2 A3 A B1 B2 B3 B C1 C2 C3 C Tung et al. iPSCs, C1 capture