Orchestrating Single-Cell RNA-sequencing
Analysis with Bioconductor
Stephanie Hicks
Assistant Professor, Biostatistics
Johns Hopkins Bloomberg School of Public Health
Bioconductor Asia Conference
November 29, 2018
AMSI BioinfoSummer
December 4, 2018
Slide 2
Slide 2 text
Teaching: Data Science
Research: Genomics (single-cell RNA-seq)
• Bioconductor user and developer (since 2009/2010)
Other fun things about me:
• Co-founded Baltimore
• Creating a children’s book featuring women statisticians and data scientists
ABOUT ME JOHNS HOPKINS BLOOMBERG
SCHOOL OF PUBLIC HEALTH
Slide 3
Slide 3 text
• Began in 2001 and now includes 12 core developers
• Big priorities: reproducible research and high-quality documentation
• Vignettes
• Large community support
• Workflows (super helpful to newbs)
• Teaching resources
and open development
Slide 4
Slide 4 text
No content
Slide 5
Slide 5 text
No content
Slide 6
Slide 6 text
New to
R/Bioconductor?
How to get started?
#rcatladies
#rladies
Slide 7
Slide 7 text
Workflows
Slide 8
Slide 8 text
Workflows
Slide 9
Slide 9 text
Interoperability within R/Bioconductor
data("airway", package = "airway")
# SummarizedExperiment object
se <- airway
# Simple subsetting
se[1:100, 1:10]
Slide 10
Slide 10 text
Interoperability within R/Bioconductor
data("airway", package = "airway")
# SummarizedExperiment object
se <- airway
# Differential expression with DESeq2
library("DESeq2")
dds <- DESeqDataSet(se = airway,
design = ~ cell + dex)
dds <- DESeq(dds)
deRes <- as.data.frame(results(dds))
# Correction for multiple testing using Benjamini & Hochberg (1995) approach
p.adjust(dds$pvalue, method = "BH")
Slide 11
Slide 11 text
Interoperability within R/Bioconductor
data("airway", package = "airway")
# SummarizedExperiment object
se <- airway
# Differential expression with DESeq2
library("DESeq2")
dds <- DESeqDataSet(se = airway,
design = ~ cell + dex)
dds <- DESeq(dds)
deRes <- as.data.frame(results(dds))
# Correction for multiple testing using IHW
ihwRes <- ihw(pvalue ~ baseMean,
data = deRes, alpha = 0.1)
Slide 12
Slide 12 text
cool, but isn’t this talk supposed to be
about single-cell RNA-sequencing?
Variability in scRNA-Seq data
Signal
Noise
Extrinsic noise (regulation by transcription factors)
vs Intrinstic noise (stochastic bursting/firing, cell cycle)
overdispersion, batch effects, sequencing
depth, gc bias, amplification bias
e.g. capture efficiency (starting amount of mRNA)
Variability visible in bulk
RNA-Seq
Additional variability in
scRNA-Seq data
Total variation
Technical noise Biological noise
Slide 16
Slide 16 text
Common types scRNA-Seq data
Adapted from Kolodziejczyk et al. (2015). Molecular Cell 58
Heterogeneous
cell populations
Purified cell populations
Battle droids Super battle droids!
Big room full of unknown R-Series droids
Slide 17
Slide 17 text
Orchestrating Single-Cell RNA-sequencing
Analysis with Bioconductor
Slide 18
Slide 18 text
Common data structures for single-cell data
bioconductor/packages/
SingleCellExperiment
Slide 19
Slide 19 text
Common data structures for single-cell data
Biocverse =
project to create
an interactive
data viz of the
biocoductor
ecosystem
Shian Su
What if my data is more “raw” than
a SingleCellExperiment
object?
Slide 31
Slide 31 text
Going from “raw” data to “clean” data
Kindly borrowed from Davis McCarthy’s Slides at Genome Informatics 2016
https://speakerdeck.com/davismcc/what-do-we-need-computationally-to-make-the-most-of-single-cell-rna-seq-data
Slide 32
Slide 32 text
Tian , Su, Dong, … , Ritchie (2018). PLoS Comp Bio McCarthy, Campbell, Lun and
Wills (2017). Bioinf
SingleCellExperiment
SingleCellExperiment
SingleCellExperiment
Slide 33
Slide 33 text
Exploratory Data Analysis
Adapted from Stegle et al. (2015) Nature Reviews Genetics 16: 133-145
Lun et al. (2016) F1000
Cell-level quality control
Gene-level
quality control
isOutlier(sce$unmapped_reads,
type=”greater”, nmads=3)
So… what about normalization and dealing with
other technical variation in scRNA-Seq data?
Much to learn, we still have ….
Slide 39
Slide 39 text
Normalization
• Without Spike-ins
• Between-sample normalization methods
• Global scaling factors mostly developed for bulk RNA-Seq
Slide 40
Slide 40 text
Normalization
• Without Spike-ins
• Between-sample normalization methods
• Global scaling factors mostly developed for bulk RNA-Seq
Slide 41
Slide 41 text
Normalization
“Case studies and simulated data sets indicated that
… bulk RNA-seq normalization methods are not
appropriate in the context of scRNA-seq, where—on
account of biological heterogeneity and technical
artifacts—we typically observe more heterogeneous
and sparser data sets.”
Vallejos et al. (2017) Nature Methods
Slide 42
Slide 42 text
Normalization
“The choice of the
normalization method affects
downstream analyses, such as
highly variable gene (HVG)
detection that aim to uncover
heterogeneity within the
data.”
Vallejos et al. (2017) Nature Methods
Slide 43
Slide 43 text
Normalization
Vallejos et al. (2017) Nature Methods
“Global-scaling methods
rely on different
assumptions, these methods
all fail if the number or fold
change of differentially
expressed genes across the
cell population is too high.”
Slide 44
Slide 44 text
Normalization
• Without Spike-ins
• scran and SCnorm (standalone)
• BASiCs and zinbwave
• scone
• Convert relative read counts into mRNA counts
• Census (Monocle)
Slide 45
Slide 45 text
Normalization
• Without Spike-ins
• scran and SCnorm (standalone)
• BASiCs and zinbwave
• scone
• Convert relative read counts into mRNA counts
• Census (Monocle)
• With Spike-ins
• Spike-ins: theoretically a good idea, but many challenges still remain for scRNA-Seq
• UMIs: Reduces amplification bias, not appropriate for isoform or allele-specific
expression
• Biological (nuisance?) variability
• differences among cells in cell-cycle stage or cell size (try cyclone() in scran)
Slide 46
Slide 46 text
“Hey, someone told me of this thing called
batch effects…. Should I be worried about them in my
scRNA-Seq data?”
Slide 47
Slide 47 text
Discovering new cell types or subtypes
• Lots of sparsity à distance estimates
• Cell-to-cell variation
Slide 48
Slide 48 text
Or is it a batch effect?
Slide 49
Slide 49 text
50
Slide 50
Slide 50 text
51
Slide 51
Slide 51 text
Adjusting for known & unknown covariates / batch effects
• RUVseq() or svaseq()
• Linear models with e.g.
removeBatchEffect() in
limma or scater
• ComBat() in sva
• mnnCorrect() in scran
Slide 52
Slide 52 text
Feature selection
Supervised
• Differentially expressed genes between homogeneous subpops
Slide 53
Slide 53 text
Feature selection
Supervised
• Differentially expressed genes between homogeneous subpops
Unsupervised
• Identifying genes via a null model
• Highly variable genes
• trendVar() and
decomposeVar() in scran
Slide 54
Slide 54 text
Feature selection
Supervised
• Differentially expressed genes between homogeneous subpops
Unsupervised
• Identifying genes via a null model
• High dropout genes
• M3drop
Slide 55
Slide 55 text
Feature selection
Supervised
• Differentially expressed genes between homogeneous subpops
Unsupervised
• Identifying genes via a null model
• Highly and lowly variable genes
• BASiCs
• BiocSingular
• Exact and approximate methods for SVD and PCA
• Easily interacts with Bioconductor packages or workflows
• Uses BiocParallel framework
Dimensionality reduction https://github.com/LTLA/BiocSingular
Slide 58
Slide 58 text
Data visualization:
Plot cells along first two Principal Components
−5
0
5
−5 0 5
Component 1: 3% variance
Component 2: 2% variance
replicate
r1
r2
r3
total_features
7000
7500
8000
8500
9000
9500
Slide 59
Slide 59 text
• Hierarchical clustering
Clustering
The hierarchical clustering dendrogram
http://hemberg-lab.github.io/scRNA.seq.course
Slide 60
Slide 60 text
• Hierarchical clustering
• K-means
Clustering
Schematic representation of the k-means clustering
http://hemberg-lab.github.io/scRNA.seq.course
Slide 61
Slide 61 text
• Hierarchical clustering
• K-means
• Graph-based methods
Clustering
http://hemberg-lab.github.io/scRNA.seq.course
Schematic representation of the graph network
Slide 62
Slide 62 text
SC3 (Kiselev et al. 2017)
• Utilizes k-means & consensus clustering
A few CRAN & Bioconductor packages for clustering
scRNA-seq that I like/use (but by no means exhaustive)
clustree (Zappia and Oshlack 2018)
• Visualization of relationships between clusters at multiple resolutions
clusterExperiment (Purdom and Risso)
• Resampling based consensus clustering
Slide 63
Slide 63 text
mbkmeans (Risso, Purdom and Hicks)
• Mini-batch k-means (Sculley, 2010)
• Faster than regular k-means, and more accurate than stochastic
gradient descent
• No need to keep all the data in memory, but only one batch at the
time, which makes it a good candidate for our purpose
A recent contribution: mini-batch k-means
Slide 64
Slide 64 text
R / C++: ClusterR package (on CRAN)
Python: implemented as part of scikit-learn
Mini-batch k-means: existing
implementations
Slide 65
Slide 65 text
https://github.com/drisso/mbkmeans
• Extend the ClusterR C++ implementation to allow HDF5 input (beachmat)
• Interface with SingleCellExperiment (through beachmat and DelayedArray)
• Starting to benchmark against Python implementation (already HDF5-ready)
both directly and through the reticulate and BiocSklearn R packages
• reticulate = embeds an interactive Python session within R session,
translates between Python and R objects
• BiocSklearn = Interfaces to sci-kit learn in Python allowing
implementation of k-means with large out-of-memory data
Mini-batch k-means: our approach
Slide 66
Slide 66 text
Interactive SummarizedExperiment (iSEE)
https://marionilab.cruk.cam.ac.uk/iSEE_pbmc4k/
Creators: Federico Marini,
Aaron Lun, Charlotte Soneson,
and Kevin Rue-Albrecht
Challenge:
Data standardizations
and Interoperability
Slide 69
Slide 69 text
Disclosure: I’m biased towards R, but have a great deal of respect for all languages!
Slide 70
Slide 70 text
Common data structures for single-cell data
bioconductor/packages/
SingleCellExperiment
scater, MAST, scDD, scPipe, scran, splatter, zinbwave,
DropletUtils, clusterExperiment, SC3, destiny, and BASiCS
Slide 71
Slide 71 text
Interoperability within R/Bioconductor
Slide 72
Slide 72 text
Convert between data structures for single-cell data
Slide 73
Slide 73 text
Interface with Python code using reticulate R package
$ pip install sklearn # install Python dependency
$ R --vanilla # start R
> sklearn = reticulate::import("sklearn") # load interface to sklearn
> clf = sklearn$svm$SVC(gamma = .001, C=100.) # create sklearn SVM classifier
Slide 74
Slide 74 text
Challenge:
Scalability and interactivity
Slide 75
Slide 75 text
Scalable interactivity with large data
Slide 76
Slide 76 text
Interactive Data Visualization (iSEE)
https://marionilab.cruk.cam.ac.uk/iSEE_pbmc4k/
Creators: Federico Marini,
Aaron Lun, Charlotte Soneson,
and Kevin Rue-Albrecht
Slide 77
Slide 77 text
Challenge:
More human readable code
Slide 78
Slide 78 text
Standard Bioconductor Data Structures: GRanges
GenomicRanges (GRanges)
Lee et al. (2018), bioRixv
Things you can do with
Granges objects
width(gr)
## [1] 18292 6001
gr[gr$score > 15, ]
## GRanges object with 1 range and 2 metadata columns:
## seqnames ranges strand | gene_id score
## |
## [1] chr1 520211-526211 - | 2151 25
## -------
## seqinfo: 1 sequence from an unspecified genome
Slide 81
Slide 81 text
Genomic verbs/actions +
tidy data = plyranges
• Goal: Write human readable analysis
workflows
• Idea: Define an API (i.e. extend dplyr) that
maps relational genomic algebra to “verbs”
that act on ”tidy” genomic data
• Another great idea: Borrow dplyr’s syntax
and design principles
• And another great idea: Compose verbs
together with pipe operator from magrittr
Stuart Lee
Di Cook
Michael Lawrence
Lee et al. (2018), bioRixv
Slide 82
Slide 82 text
library(plyranges)
gr %>%
filter(score > 15)
## GRanges object with 1 range and 2 metadata columns:
## seqnames ranges strand | gene_id score
## |
## [1] chr1 520211-526211 - | 2151 25
## -------
## seqinfo: 1 sequence from an unspecified genome
gr %>%
filter(score > 15) %>%
width()
## [1] 6001
Slide 83
Slide 83 text
$ pip install sklearn # install Python dependency
$ R --vanilla # start R
> sklearn = reticulate::import("sklearn") # load interface to sklearn
> clf = sklearn$svm$SVC(gamma = .001, C=100.) # create sklearn SVM classifier
> library(HCABrowser)
> hca = HumanCellAtlas() # authorized access
> hca %>% filter(organ == "brain") %>% # familiar paradigm
results() %>% as_tibble() # lazy access
# A tibble: 432 x 6
name size uuid bundle_fqid bundle_url orga
1 a3a818c2-6c1… 29544 92732f5… ede36b63-2570-4… https://dss.inte… brain
2 a3a818c2-6c1… 4341 1e1cdfc… ede36b63-2570-4… https://dss.inte… brain
3 a3a818c2-6c1… 457 6499887… ede36b63-2570-4… https://dss.inte… brain
# ... with 429 more rows
Usability
> library(TENxBrainData)
> tenx = TENxBrainData() # easy access
> tenx # big data!
class: SingleCellExperiment
dim: 27998 1306127
> pryr::object_size(tenx)
200 MB
Interoperability Scalability
Interactivity
● ½ million downloads annually
● >1600 open-source, open-development packages
● Robust and well-documented software
● Large-scale and remote access to HCA data
● Standard data representations
● State-of-the-art computational methods
Workflows