BiocAsia 2017

Splatter, a package for simulating single-cell RNA sequencing data Luke
Zappia @_lazappi_

Single-cell RNA-seq ACTGACTCCA TCAGTACTGA CGTGTCATAG GATTGACCTA ACTGACTCCA TCAGTACTGA CGTGTCATAG GATTGACCTA
ACTGACTCCA TCAGTACTGA CGTGTCATAG GATTGACCTA ACTGACTCCA TCAGTACTGA CGTGTCATAG GATTGACCTA Gene Cell 1 Cell 2 Cell 3 Cell 4 A 12 10 9 0 B 0 0 0 1 C 9 6 0 0 D 7 0 4 0

Sevensson et al. arXiv 1704.01379, 2017

Gene Cell 1 Cell 2 Cell 3 Cell 4 A
12 0 10 9 0 B 0 0 0 1 C 9 6 0 0 D 7 0 4 0 Bad cell? Low expression? Cell type speciﬁc? Cell cycle? Dropout?

www. .org

Provide a truth to test against BUT - Often poorly
documented and explained - Not easily reproducible or reusable - Don’t demonstrate similarity to real data Simulations

“Splatter: simulation of single-cell RNA sequencing data.” Genome Biology (2017)
DOI: 10.1186/s13059-017-1305-0

The idea... Multiple simulations, same interface Real data Parameters Dataset
Estimation Simulation

Building a package Bioconductor Devtools Roxygen testthat Checkmate GitHub Codecov
(covr) Travis/Appveyor?

Checkmate fact <- function(n, method = "stirling") { if (length(n)
!= 1) stop("Argument 'n' must have length 1") if (!is.numeric(n)) stop("Argument 'n' must be numeric") if (is.na(n)) stop("Argument 'n' may not be NA") if (is.double(n)) { if (is.nan(n)) stop("Argument 'n' may not be NaN") if (is.infinite(n)) stop("Argument 'n' must be finite") if (abs(n - round(n, 0)) > sqrt(.Machine$double.eps)) stop("Argument 'n' must be an integerish value") n <- as.integer(n) } fact <- function(n, method = "stirling") { assertCount(n) assertChoice(method, c("stirling", "factorial")) if (method == "factorial") factorial(n) else sqrt(2 * pi * n) * (n / exp(1))^n }

Counts Parameters Estimation

Storing parameters Custom object Same structure, names - One object?
Different objects

Params SimpleParams SplatParams LunParams SCDDParams (Virtual)

Parameters Dataset Simulation

Storing simulations Simulated counts Intermediate values Parameters?

SCESet assayData featureData phenoData AnnotatedDataFrame AnnotatedDataFrame Matrix ExpressionSet

SingleCellExperiment assays rowData colData DataFrame DataFrame Matrix SummarizedExperiment metadata List

Science!

Splat Negative binomial Expression outliers Deﬁned library sizes Mean-variance trend
Dropout

Simple - Negative binomial Lun - NB with cell factors
Lun ATL, Bach K, Marioni JC. Genome Biology (2016). DOI: 10.1186/s13059-016-0947-7. Lun 2 - Sampled NB with batch effects Lun ATL, Marioni JC. Biostatistics (2017). DOI: 10.1093/biostatistics/kxw055. scDD - NB with bimodality Korthauer KD, et al. Genome Biology (2016). DOI: 10.1186/s13059-016-1077-y. BASiCS - NB with spike-ins Vallejos CA, Marioni JC, Richardson S. PLoS Comp. Bio. (2015). DOI: 10.1371/journal.pcbi.1004333. Simulations

1. Estimate 2. Simulate 3. Compare params1 <- splatEstimate(real.data) params2
<- simpleEstimate(real.data) sim1 <- splatSimulate(params1, ...) sim2 <- simpleSimulate(params2, ...) datasets <- list(Real = real.data, Splat = sim1, Simple = sim2) comp <- compareSCESets(datasets) diff <- diffSCESets(datasets, ref = “Real”) Using Splatter

Means Difference in Means

Zeros per cell Difference in zeros

Mean-zeros Difference

Rank 1 8

Rank 1 8 Full-length Full-length

Complex simulations Groups Batches Paths

SingleCellExperiment Batch effects Simulations - BASiCS - mfa - PhenoPath
- ZINB-WaVE New in Splatter 1.2.0 Bioconductor 3.6

Many tools for scRNA-seq analysis Catalogued in the scRNA-tools database
Can be tested using synthetic datasets Splatter is our package for simulating scRNA-seq data Making a package is not as hard as you think Summary

@_lazappi_ oshlacklab.com Supervisors Alicia Oshlack Melissa Little MCRI Bioinformatics Belinda
Phipson Breon Schmidt Everyone that makes tools and data available www.scRNA-tools.org @scRNAtools “Splatter: simulation of single-cell RNA sequencing data.” Genome Biology (2017) DOI: 10.1186/s13059-017-1305-0 “Exploring the single-cell RNA-seq analysis landscape with the scRNA-tools database” bioRxiv (2017) DOI: 10.1101/206573 bioconductor.org/packages/ splatter

Negative binomial

Real data 3 HapMap individuals 3 plates each 200 random
cells Tung P-Y et al. Sci. Rep. (2017) DOI:10.1038/srep39921 A1 A2 A3 A B1 B2 B3 B C1 C2 C3 C Tung et al. iPSCs, C1 capture

BiocAsia 2017

BiocAsia 2017

Luke Zappia

More Decks by Luke Zappia

Other Decks in Science

Featured

Transcript

Splatter, a package for simulating single-cell RNA sequencing data Luke

Single-cell RNA-seq ACTGACTCCA TCAGTACTGA CGTGTCATAG GATTGACCTA ACTGACTCCA TCAGTACTGA CGTGTCATAG GATTGACCTA

Sevensson et al. arXiv 1704.01379, 2017

Gene Cell 1 Cell 2 Cell 3 Cell 4 A

www. .org

Provide a truth to test against BUT - Often poorly

“Splatter: simulation of single-cell RNA sequencing data.” Genome Biology (2017)

The idea... Multiple simulations, same interface Real data Parameters Dataset

Building a package Bioconductor Devtools Roxygen testthat Checkmate GitHub Codecov

Checkmate fact <- function(n, method = "stirling") { if (length(n)

Counts Parameters Estimation

Storing parameters Custom object Same structure, names - One object?

Params SimpleParams SplatParams LunParams SCDDParams (Virtual)

Parameters Dataset Simulation

Storing simulations Simulated counts Intermediate values Parameters?

SCESet assayData featureData phenoData AnnotatedDataFrame AnnotatedDataFrame Matrix ExpressionSet

SingleCellExperiment assays rowData colData DataFrame DataFrame Matrix SummarizedExperiment metadata List

Science!

Splat Negative binomial Expression outliers Deﬁned library sizes Mean-variance trend

Simple - Negative binomial Lun - NB with cell factors

1. Estimate 2. Simulate 3. Compare params1 <- splatEstimate(real.data) params2

Means Difference in Means

Zeros per cell Difference in zeros

Mean-zeros Difference

Rank 1 8

Rank 1 8 Full-length Full-length

Complex simulations Groups Batches Paths

SingleCellExperiment Batch effects Simulations - BASiCS - mfa - PhenoPath

Many tools for scRNA-seq analysis Catalogued in the scRNA-tools database

@_lazappi_ oshlacklab.com Supervisors Alicia Oshlack Melissa Little MCRI Bioinformatics Belinda

Negative binomial

Real data 3 HapMap individuals 3 plates each 200 random