Orchestrating Single-Cell RNA-sequencing Analysis with Bioconductor

Orchestrating Single-Cell RNA-sequencing Analysis with Bioconductor Stephanie Hicks Assistant Professor,
Biostatistics Johns Hopkins Bloomberg School of Public Health Bioconductor Asia Conference November 29, 2018 AMSI BioinfoSummer December 4, 2018

Teaching: Data Science Research: Genomics (single-cell RNA-seq) • Bioconductor user
and developer (since 2009/2010) Other fun things about me: • Co-founded Baltimore • Creating a children’s book featuring women statisticians and data scientists ABOUT ME JOHNS HOPKINS BLOOMBERG SCHOOL OF PUBLIC HEALTH

• Began in 2001 and now includes 12 core developers
• Big priorities: reproducible research and high-quality documentation • Vignettes • Large community support • Workflows (super helpful to newbs) • Teaching resources and open development

New to R/Bioconductor? How to get started? #rcatladies #rladies

Workflows

Interoperability within R/Bioconductor data("airway", package = "airway") # SummarizedExperiment object
se <- airway # Simple subsetting se[1:100, 1:10]

se <- airway # Differential expression with DESeq2 library("DESeq2") dds <- DESeqDataSet(se = airway, design = ~ cell + dex) dds <- DESeq(dds) deRes <- as.data.frame(results(dds)) # Correction for multiple testing using Benjamini & Hochberg (1995) approach p.adjust(dds$pvalue, method = "BH")

se <- airway # Differential expression with DESeq2 library("DESeq2") dds <- DESeqDataSet(se = airway, design = ~ cell + dex) dds <- DESeq(dds) deRes <- as.data.frame(results(dds)) # Correction for multiple testing using IHW ihwRes <- ihw(pvalue ~ baseMean, data = deRes, alpha = 0.1)

cool, but isn’t this talk supposed to be about single-cell
RNA-sequencing?

Single-cell RNA-seq (scRNA-seq) Cell 1 Cell 2 … Gene 1
18 0 Gene 2 1010 506 Gene 3 0 49 Gene 4 22 0 … Read Counts Gene 1 Compare gene expression profiles of single cells Tissue (e.g. tumor) Isolate and sequence individual cells Cells Genes Principal Component 2 Principal Component 1 Cell 1

Svensson et al. (2017).

Variability in scRNA-Seq data Signal Noise Extrinsic noise (regulation by
transcription factors) vs Intrinstic noise (stochastic bursting/firing, cell cycle) overdispersion, batch effects, sequencing depth, gc bias, amplification bias e.g. capture efficiency (starting amount of mRNA) Variability visible in bulk RNA-Seq Additional variability in scRNA-Seq data Total variation Technical noise Biological noise

Common types scRNA-Seq data Adapted from Kolodziejczyk et al. (2015).
Molecular Cell 58 Heterogeneous cell populations Purified cell populations Battle droids Super battle droids! Big room full of unknown R-Series droids

Orchestrating Single-Cell RNA-sequencing Analysis with Bioconductor

Common data structures for single-cell data bioconductor/packages/ SingleCellExperiment

Common data structures for single-cell data Biocverse = project to
create an interactive data viz of the biocoductor ecosystem Shian Su

Let’s get started with some data!

> library(TENxBrainData) > tenx = TENxBrainData() # easy access >
tenx # big data! class: SingleCellExperiment dim: 27998 1306127 > pryr::object_size(tenx) 200 MB

library(TENxPBMCData) tenx_pbmc3k <- TENxPBMCData(dataset = "pbmc3k") tenx_pbmc3k ## class: SingleCellExperiment
## dim: 32738 2700 ## metadata(0): ## assays(1): counts ## rownames(32738): ENSG00000243485 ENSG00000237613 ... ## ENSG00000215616 ENSG00000215611 ## rowData names(3): ENSEMBL_ID Symbol_TENx Symbol ## colnames: NULL ## colData names(11): Sample Barcode ... Individual Date_published ## reducedDimNames(0): ## spikeNames(0):

counts(tenx_pbmc3k) ## <32738 x 2700> DelayedMatrix object of type "integer":
## [,1] [,2] [,3] [,4] ... [,2697] [,2698] ## ENSG00000243485 0 0 0 0 . 0 0 ## ENSG00000237613 0 0 0 0 . 0 0 ## ENSG00000186092 0 0 0 0 . 0 0 ## ENSG00000238009 0 0 0 0 . 0 0 ## ENSG00000239945 0 0 0 0 . 0 0 ## ... . . . . . . . ## ENSG00000215635 0 0 0 0 . 0 0 ## ENSG00000268590 0 0 0 0 . 0 0 ## ENSG00000251180 0 0 0 0 . 0 0 ## ENSG00000215616 0 0 0 0 . 0 0 ## ENSG00000215611 0 0 0 0 . 0 0 ## [,2699] [,2700] ## ENSG00000243485 0 0

Simulate scRNA-seq counts with • Uses a gamma-Poisson distribution used
to generate a gene by cell matrix of counts (SingleCellExperiment) Zappia, Phipson, Oshlack (2017). Genome Biol library(splatter) tenx_pbmc3k_sub <- as.matrix(counts(tenx_pbmc3k)) params <- splatEstimate(tenx_pbmc3k_sub[,1:100]) sim <- splatSimulate(params) sim ## class: SingleCellExperiment ## dim: 32738 100 ## metadata(1): Params ## assays(6): BatchCellMeans BaseCellMeans ... TrueCounts counts ## rownames(32738): Gene1 Gene2 ... Gene32737 Gene32738 … (and more!)

Calculate quality control metrics library(scater) tenx_pbmc3k <- calculateQCMetrics(tenx_pbmc3k)

library(scater) tenx_pbmc3k <- calculateQCMetrics(tenx_pbmc3k) colnames(colData(tenx_pbmc3k)) ## [1] "Sample" "Barcode" ##
[3] "Sequence" "Library" ## [5] "Cell_ranger_version" "Tissue_status" ## [7] "Barcode_type" "Chemistry" ## [9] "Sequence_platform" "Individual" ## [11] "Date_published" "is_cell_control" ## [13] "total_features_by_counts" "log10_total_features_by_counts" ## [15] "total_counts" "log10_total_counts" ## [17] "pct_counts_in_top_50_features" "pct_counts_in_top_100_features" ## [19] "pct_counts_in_top_200_features" "pct_counts_in_top_500_features"

colnames(rowData(tenx_pbmc3k)) ## [1] "ENSEMBL_ID" "Symbol_TENx" ## [3] "Symbol" "is_feature_control" ##
[5] "mean_counts" "log10_mean_counts" ## [7] "n_cells_by_counts" "pct_dropout_by_counts" ## [9] "total_counts" "log10_total_counts"

What if my data is more “raw” than a SingleCellExperiment
object?

Going from “raw” data to “clean” data Kindly borrowed from
Davis McCarthy’s Slides at Genome Informatics 2016 https://speakerdeck.com/davismcc/what-do-we-need-computationally-to-make-the-most-of-single-cell-rna-seq-data

Tian , Su, Dong, … , Ritchie (2018). PLoS Comp
Bio McCarthy, Campbell, Lun and Wills (2017). Bioinf SingleCellExperiment SingleCellExperiment SingleCellExperiment

Exploratory Data Analysis Adapted from Stegle et al. (2015) Nature
Reviews Genetics 16: 133-145 Lun et al. (2016) F1000 Cell-level quality control Gene-level quality control isOutlier(sce$unmapped_reads, type=”greater”, nmads=3)

Workflows

https://www.bioconductor.org/packages/release/ workflows/html/simpleSingleCell.html

https://www.bioconductor.org/packages/release/workflows/ vignettes/simpleSingleCell/inst/doc/xtra-5-bigdata.html

So… what about normalization and dealing with other technical variation
in scRNA-Seq data? Much to learn, we still have ….

Normalization • Without Spike-ins • Between-sample normalization methods • Global
scaling factors mostly developed for bulk RNA-Seq

Normalization “Case studies and simulated data sets indicated that …
bulk RNA-seq normalization methods are not appropriate in the context of scRNA-seq, where—on account of biological heterogeneity and technical artifacts—we typically observe more heterogeneous and sparser data sets.” Vallejos et al. (2017) Nature Methods

Normalization “The choice of the normalization method affects downstream analyses,
such as highly variable gene (HVG) detection that aim to uncover heterogeneity within the data.” Vallejos et al. (2017) Nature Methods

Normalization Vallejos et al. (2017) Nature Methods “Global-scaling methods rely
on different assumptions, these methods all fail if the number or fold change of differentially expressed genes across the cell population is too high.”

Normalization • Without Spike-ins • scran and SCnorm (standalone) •
BASiCs and zinbwave • scone • Convert relative read counts into mRNA counts • Census (Monocle)

Normalization • Without Spike-ins • scran and SCnorm (standalone) •
BASiCs and zinbwave • scone • Convert relative read counts into mRNA counts • Census (Monocle) • With Spike-ins • Spike-ins: theoretically a good idea, but many challenges still remain for scRNA-Seq • UMIs: Reduces amplification bias, not appropriate for isoform or allele-specific expression • Biological (nuisance?) variability • differences among cells in cell-cycle stage or cell size (try cyclone() in scran)

“Hey, someone told me of this thing called batch effects….
Should I be worried about them in my scRNA-Seq data?”

Discovering new cell types or subtypes • Lots of sparsity
à distance estimates • Cell-to-cell variation

Or is it a batch effect?

Adjusting for known & unknown covariates / batch effects •
RUVseq() or svaseq() • Linear models with e.g. removeBatchEffect() in limma or scater • ComBat() in sva • mnnCorrect() in scran

Feature selection Supervised • Differentially expressed genes between homogeneous subpops

Unsupervised • Identifying genes via a null model • Highly variable genes • trendVar() and decomposeVar() in scran

Unsupervised • Identifying genes via a null model • High dropout genes • M3drop

Unsupervised • Identifying genes via a null model • Highly and lowly variable genes • BASiCs

Unsupervised • Gene-gene correlations • PCA loadings

• BiocSingular • Exact and approximate methods for SVD and
PCA • Easily interacts with Bioconductor packages or workflows • Uses BiocParallel framework Dimensionality reduction https://github.com/LTLA/BiocSingular

Data visualization: Plot cells along first two Principal Components −5
0 5 −5 0 5 Component 1: 3% variance Component 2: 2% variance replicate r1 r2 r3 total_features 7000 7500 8000 8500 9000 9500

• Hierarchical clustering Clustering The hierarchical clustering dendrogram http://hemberg-lab.github.io/scRNA.seq.course

• Hierarchical clustering • K-means Clustering Schematic representation of the
k-means clustering http://hemberg-lab.github.io/scRNA.seq.course

• Hierarchical clustering • K-means • Graph-based methods Clustering http://hemberg-lab.github.io/scRNA.seq.course
Schematic representation of the graph network

SC3 (Kiselev et al. 2017) • Utilizes k-means & consensus
clustering A few CRAN & Bioconductor packages for clustering scRNA-seq that I like/use (but by no means exhaustive) clustree (Zappia and Oshlack 2018) • Visualization of relationships between clusters at multiple resolutions clusterExperiment (Purdom and Risso) • Resampling based consensus clustering

mbkmeans (Risso, Purdom and Hicks) • Mini-batch k-means (Sculley, 2010)
• Faster than regular k-means, and more accurate than stochastic gradient descent • No need to keep all the data in memory, but only one batch at the time, which makes it a good candidate for our purpose A recent contribution: mini-batch k-means

R / C++: ClusterR package (on CRAN) Python: implemented as
part of scikit-learn Mini-batch k-means: existing implementations

https://github.com/drisso/mbkmeans • Extend the ClusterR C++ implementation to allow HDF5
input (beachmat) • Interface with SingleCellExperiment (through beachmat and DelayedArray) • Starting to benchmark against Python implementation (already HDF5-ready) both directly and through the reticulate and BiocSklearn R packages • reticulate = embeds an interactive Python session within R session, translates between Python and R objects • BiocSklearn = Interfaces to sci-kit learn in Python allowing implementation of k-means with large out-of-memory data Mini-batch k-means: our approach

Interactive SummarizedExperiment (iSEE) https://marionilab.cruk.cam.ac.uk/iSEE_pbmc4k/ Creators: Federico Marini, Aaron Lun, Charlotte
Soneson, and Kevin Rue-Albrecht

Next: downstream analyses! • Imputation • SAVER, DrImpute, scImpute, MAGIC
• Andrews and Hemberg (2018). False signals induced by single-cell imputation. F1000 • Differential expression • SCDE, MAST, scDD, BASiCs, M3Drop… • zinbwave dropout weights + edgeR/DESeq2 • Trajectory analyses / pseudotime inference • slingshot, ouija, Monocole3 • Gene set analyses • Gene expression networks • Cell ontology / annotation

Challenge: Data standardizations and Interoperability

Disclosure: I’m biased towards R, but have a great deal
of respect for all languages!

Common data structures for single-cell data bioconductor/packages/ SingleCellExperiment scater, MAST,
scDD, scPipe, scran, splatter, zinbwave, DropletUtils, clusterExperiment, SC3, destiny, and BASiCS

Interoperability within R/Bioconductor

Convert between data structures for single-cell data

Interface with Python code using reticulate R package $ pip
install sklearn # install Python dependency $ R --vanilla # start R > sklearn = reticulate::import("sklearn") # load interface to sklearn > clf = sklearn$svm$SVC(gamma = .001, C=100.) # create sklearn SVM classifier

Challenge: Scalability and interactivity

Scalable interactivity with large data

Interactive Data Visualization (iSEE) https://marionilab.cruk.cam.ac.uk/iSEE_pbmc4k/ Creators: Federico Marini, Aaron Lun,
Charlotte Soneson, and Kevin Rue-Albrecht

Challenge: More human readable code

Standard Bioconductor Data Structures: GRanges GenomicRanges (GRanges) Lee et al.
(2018), bioRixv

Create granges object library(GenomicRanges) gr <- GRanges(seqnames = "chr1", strand
= c("+", "-"), ranges = IRanges(start = c(102012,520211), end=c(120303, 526211)), gene_id = c(1001,2151), score = c(10, 25)) gr ## GRanges object with 2 ranges and 2 metadata columns: ## seqnames ranges strand | gene_id score ## <Rle> <IRanges> <Rle> | <numeric> <numeric> ## [1] chr1 102012-120303 + | 1001 10 ## [2] chr1 520211-526211 - | 2151 25 ## ------- ## seqinfo: 1 sequence from an unspecified genome

Things you can do with Granges objects width(gr) ## [1]
18292 6001 gr[gr$score > 15, ] ## GRanges object with 1 range and 2 metadata columns: ## seqnames ranges strand | gene_id score ## <Rle> <IRanges> <Rle> | <numeric> <numeric> ## [1] chr1 520211-526211 - | 2151 25 ## ------- ## seqinfo: 1 sequence from an unspecified genome

Genomic verbs/actions + tidy data = plyranges • Goal: Write
human readable analysis workflows • Idea: Define an API (i.e. extend dplyr) that maps relational genomic algebra to “verbs” that act on ”tidy” genomic data • Another great idea: Borrow dplyr’s syntax and design principles • And another great idea: Compose verbs together with pipe operator from magrittr Stuart Lee Di Cook Michael Lawrence Lee et al. (2018), bioRixv

library(plyranges) gr %>% filter(score > 15) ## GRanges object with
1 range and 2 metadata columns: ## seqnames ranges strand | gene_id score ## <Rle> <IRanges> <Rle> | <numeric> <numeric> ## [1] chr1 520211-526211 - | 2151 25 ## ------- ## seqinfo: 1 sequence from an unspecified genome gr %>% filter(score > 15) %>% width() ## [1] 6001

$ pip install sklearn # install Python dependency $ R
--vanilla # start R > sklearn = reticulate::import("sklearn") # load interface to sklearn > clf = sklearn$svm$SVC(gamma = .001, C=100.) # create sklearn SVM classifier > library(HCABrowser) > hca = HumanCellAtlas() # authorized access > hca %>% filter(organ == "brain") %>% # familiar paradigm results() %>% as_tibble() # lazy access # A tibble: 432 x 6 name size uuid bundle_fqid bundle_url orga <fct> <int> <fct> <fct> <fct> <fct> 1 a3a818c2-6c1… 29544 92732f5… ede36b63-2570-4… https://dss.inte… brain 2 a3a818c2-6c1… 4341 1e1cdfc… ede36b63-2570-4… https://dss.inte… brain 3 a3a818c2-6c1… 457 6499887… ede36b63-2570-4… https://dss.inte… brain # ... with 429 more rows Usability > library(TENxBrainData) > tenx = TENxBrainData() # easy access > tenx # big data! class: SingleCellExperiment dim: 27998 1306127 > pryr::object_size(tenx) 200 MB Interoperability Scalability Interactivity • ½ million downloads annually • >1600 open-source, open-development packages • Robust and well-documented software • Large-scale and remote access to HCA data • Standard data representations • State-of-the-art computational methods Workflows

Feel free to send comments/questions: Twitter: @stephaniehicks Email: [email protected] #BiocAsia
#rladies Thank you!

Orchestrating Single-Cell RNA-sequencing Analys...

Orchestrating Single-Cell RNA-sequencing Analysis with Bioconductor

More Decks by Stephanie Hicks

Other Decks in Science

Featured

Transcript