Orchestrating Single-Cell RNA-sequencing Analysis with Bioconductor

68c6191fa302627da003b9ac1eaba4b5?s=47 Stephanie Hicks
November 30, 2018

Orchestrating Single-Cell RNA-sequencing Analysis with Bioconductor

* Keynote at Bioconductor Asia in Melbourne, Australia on November 29, 2018 (https://bioconductor.github.io/BiocAsia/)
* Invited presentation at the Australian Mathematical Sciences Institute (AMSI) BioInfoSummer in Perth, Australia on December 4, 2018 (https://bis.amsi.org.au)

68c6191fa302627da003b9ac1eaba4b5?s=128

Stephanie Hicks

November 30, 2018
Tweet

Transcript

  1. 1.

    Orchestrating Single-Cell RNA-sequencing Analysis with Bioconductor Stephanie Hicks Assistant Professor,

    Biostatistics Johns Hopkins Bloomberg School of Public Health Bioconductor Asia Conference November 29, 2018 AMSI BioinfoSummer December 4, 2018
  2. 2.

    Teaching: Data Science Research: Genomics (single-cell RNA-seq) • Bioconductor user

    and developer (since 2009/2010) Other fun things about me: • Co-founded Baltimore • Creating a children’s book featuring women statisticians and data scientists ABOUT ME JOHNS HOPKINS BLOOMBERG SCHOOL OF PUBLIC HEALTH
  3. 3.

    • Began in 2001 and now includes 12 core developers

    • Big priorities: reproducible research and high-quality documentation • Vignettes • Large community support • Workflows (super helpful to newbs) • Teaching resources and open development
  4. 4.
  5. 5.
  6. 10.

    Interoperability within R/Bioconductor data("airway", package = "airway") # SummarizedExperiment object

    se <- airway # Differential expression with DESeq2 library("DESeq2") dds <- DESeqDataSet(se = airway, design = ~ cell + dex) dds <- DESeq(dds) deRes <- as.data.frame(results(dds)) # Correction for multiple testing using Benjamini & Hochberg (1995) approach p.adjust(dds$pvalue, method = "BH")
  7. 11.

    Interoperability within R/Bioconductor data("airway", package = "airway") # SummarizedExperiment object

    se <- airway # Differential expression with DESeq2 library("DESeq2") dds <- DESeqDataSet(se = airway, design = ~ cell + dex) dds <- DESeq(dds) deRes <- as.data.frame(results(dds)) # Correction for multiple testing using IHW ihwRes <- ihw(pvalue ~ baseMean, data = deRes, alpha = 0.1)
  8. 13.

    Single-cell RNA-seq (scRNA-seq) Cell 1 Cell 2 … Gene 1

    18 0 Gene 2 1010 506 Gene 3 0 49 Gene 4 22 0 … Read Counts Gene 1 Compare gene expression profiles of single cells Tissue (e.g. tumor) Isolate and sequence individual cells Cells Genes Principal Component 2 Principal Component 1 Cell 1
  9. 15.

    Variability in scRNA-Seq data Signal Noise Extrinsic noise (regulation by

    transcription factors) vs Intrinstic noise (stochastic bursting/firing, cell cycle) overdispersion, batch effects, sequencing depth, gc bias, amplification bias e.g. capture efficiency (starting amount of mRNA) Variability visible in bulk RNA-Seq Additional variability in scRNA-Seq data Total variation Technical noise Biological noise
  10. 16.

    Common types scRNA-Seq data Adapted from Kolodziejczyk et al. (2015).

    Molecular Cell 58 Heterogeneous cell populations Purified cell populations Battle droids Super battle droids! Big room full of unknown R-Series droids
  11. 19.

    Common data structures for single-cell data Biocverse = project to

    create an interactive data viz of the biocoductor ecosystem Shian Su
  12. 21.

    > library(TENxBrainData) > tenx = TENxBrainData() # easy access >

    tenx # big data! class: SingleCellExperiment dim: 27998 1306127 > pryr::object_size(tenx) 200 MB
  13. 22.
  14. 23.
  15. 24.

    library(TENxPBMCData) tenx_pbmc3k <- TENxPBMCData(dataset = "pbmc3k") tenx_pbmc3k ## class: SingleCellExperiment

    ## dim: 32738 2700 ## metadata(0): ## assays(1): counts ## rownames(32738): ENSG00000243485 ENSG00000237613 ... ## ENSG00000215616 ENSG00000215611 ## rowData names(3): ENSEMBL_ID Symbol_TENx Symbol ## colnames: NULL ## colData names(11): Sample Barcode ... Individual Date_published ## reducedDimNames(0): ## spikeNames(0):
  16. 25.

    counts(tenx_pbmc3k) ## <32738 x 2700> DelayedMatrix object of type "integer":

    ## [,1] [,2] [,3] [,4] ... [,2697] [,2698] ## ENSG00000243485 0 0 0 0 . 0 0 ## ENSG00000237613 0 0 0 0 . 0 0 ## ENSG00000186092 0 0 0 0 . 0 0 ## ENSG00000238009 0 0 0 0 . 0 0 ## ENSG00000239945 0 0 0 0 . 0 0 ## ... . . . . . . . ## ENSG00000215635 0 0 0 0 . 0 0 ## ENSG00000268590 0 0 0 0 . 0 0 ## ENSG00000251180 0 0 0 0 . 0 0 ## ENSG00000215616 0 0 0 0 . 0 0 ## ENSG00000215611 0 0 0 0 . 0 0 ## [,2699] [,2700] ## ENSG00000243485 0 0
  17. 26.

    Simulate scRNA-seq counts with • Uses a gamma-Poisson distribution used

    to generate a gene by cell matrix of counts (SingleCellExperiment) Zappia, Phipson, Oshlack (2017). Genome Biol library(splatter) tenx_pbmc3k_sub <- as.matrix(counts(tenx_pbmc3k)) params <- splatEstimate(tenx_pbmc3k_sub[,1:100]) sim <- splatSimulate(params) sim ## class: SingleCellExperiment ## dim: 32738 100 ## metadata(1): Params ## assays(6): BatchCellMeans BaseCellMeans ... TrueCounts counts ## rownames(32738): Gene1 Gene2 ... Gene32737 Gene32738 … (and more!)
  18. 28.

    library(scater) tenx_pbmc3k <- calculateQCMetrics(tenx_pbmc3k) colnames(colData(tenx_pbmc3k)) ## [1] "Sample" "Barcode" ##

    [3] "Sequence" "Library" ## [5] "Cell_ranger_version" "Tissue_status" ## [7] "Barcode_type" "Chemistry" ## [9] "Sequence_platform" "Individual" ## [11] "Date_published" "is_cell_control" ## [13] "total_features_by_counts" "log10_total_features_by_counts" ## [15] "total_counts" "log10_total_counts" ## [17] "pct_counts_in_top_50_features" "pct_counts_in_top_100_features" ## [19] "pct_counts_in_top_200_features" "pct_counts_in_top_500_features"
  19. 29.

    colnames(rowData(tenx_pbmc3k)) ## [1] "ENSEMBL_ID" "Symbol_TENx" ## [3] "Symbol" "is_feature_control" ##

    [5] "mean_counts" "log10_mean_counts" ## [7] "n_cells_by_counts" "pct_dropout_by_counts" ## [9] "total_counts" "log10_total_counts"
  20. 31.

    Going from “raw” data to “clean” data Kindly borrowed from

    Davis McCarthy’s Slides at Genome Informatics 2016 https://speakerdeck.com/davismcc/what-do-we-need-computationally-to-make-the-most-of-single-cell-rna-seq-data
  21. 32.

    Tian , Su, Dong, … , Ritchie (2018). PLoS Comp

    Bio McCarthy, Campbell, Lun and Wills (2017). Bioinf SingleCellExperiment SingleCellExperiment SingleCellExperiment
  22. 33.

    Exploratory Data Analysis Adapted from Stegle et al. (2015) Nature

    Reviews Genetics 16: 133-145 Lun et al. (2016) F1000 Cell-level quality control Gene-level quality control isOutlier(sce$unmapped_reads, type=”greater”, nmads=3)
  23. 34.
  24. 35.
  25. 38.

    So… what about normalization and dealing with other technical variation

    in scRNA-Seq data? Much to learn, we still have ….
  26. 41.

    Normalization “Case studies and simulated data sets indicated that …

    bulk RNA-seq normalization methods are not appropriate in the context of scRNA-seq, where—on account of biological heterogeneity and technical artifacts—we typically observe more heterogeneous and sparser data sets.” Vallejos et al. (2017) Nature Methods
  27. 42.

    Normalization “The choice of the normalization method affects downstream analyses,

    such as highly variable gene (HVG) detection that aim to uncover heterogeneity within the data.” Vallejos et al. (2017) Nature Methods
  28. 43.

    Normalization Vallejos et al. (2017) Nature Methods “Global-scaling methods rely

    on different assumptions, these methods all fail if the number or fold change of differentially expressed genes across the cell population is too high.”
  29. 44.

    Normalization • Without Spike-ins • scran and SCnorm (standalone) •

    BASiCs and zinbwave • scone • Convert relative read counts into mRNA counts • Census (Monocle)
  30. 45.

    Normalization • Without Spike-ins • scran and SCnorm (standalone) •

    BASiCs and zinbwave • scone • Convert relative read counts into mRNA counts • Census (Monocle) • With Spike-ins • Spike-ins: theoretically a good idea, but many challenges still remain for scRNA-Seq • UMIs: Reduces amplification bias, not appropriate for isoform or allele-specific expression • Biological (nuisance?) variability • differences among cells in cell-cycle stage or cell size (try cyclone() in scran)
  31. 46.

    “Hey, someone told me of this thing called batch effects….

    Should I be worried about them in my scRNA-Seq data?”
  32. 47.

    Discovering new cell types or subtypes • Lots of sparsity

    à distance estimates • Cell-to-cell variation
  33. 49.

    50

  34. 50.

    51

  35. 51.

    Adjusting for known & unknown covariates / batch effects •

    RUVseq() or svaseq() • Linear models with e.g. removeBatchEffect() in limma or scater • ComBat() in sva • mnnCorrect() in scran
  36. 53.

    Feature selection Supervised • Differentially expressed genes between homogeneous subpops

    Unsupervised • Identifying genes via a null model • Highly variable genes • trendVar() and decomposeVar() in scran
  37. 54.

    Feature selection Supervised • Differentially expressed genes between homogeneous subpops

    Unsupervised • Identifying genes via a null model • High dropout genes • M3drop
  38. 55.

    Feature selection Supervised • Differentially expressed genes between homogeneous subpops

    Unsupervised • Identifying genes via a null model • Highly and lowly variable genes • BASiCs
  39. 56.
  40. 57.

    • BiocSingular • Exact and approximate methods for SVD and

    PCA • Easily interacts with Bioconductor packages or workflows • Uses BiocParallel framework Dimensionality reduction https://github.com/LTLA/BiocSingular
  41. 58.

    Data visualization: Plot cells along first two Principal Components −5

    0 5 −5 0 5 Component 1: 3% variance Component 2: 2% variance replicate r1 r2 r3 total_features 7000 7500 8000 8500 9000 9500
  42. 60.

    • Hierarchical clustering • K-means Clustering Schematic representation of the

    k-means clustering http://hemberg-lab.github.io/scRNA.seq.course
  43. 62.

    SC3 (Kiselev et al. 2017) • Utilizes k-means & consensus

    clustering A few CRAN & Bioconductor packages for clustering scRNA-seq that I like/use (but by no means exhaustive) clustree (Zappia and Oshlack 2018) • Visualization of relationships between clusters at multiple resolutions clusterExperiment (Purdom and Risso) • Resampling based consensus clustering
  44. 63.

    mbkmeans (Risso, Purdom and Hicks) • Mini-batch k-means (Sculley, 2010)

    • Faster than regular k-means, and more accurate than stochastic gradient descent • No need to keep all the data in memory, but only one batch at the time, which makes it a good candidate for our purpose A recent contribution: mini-batch k-means
  45. 64.

    R / C++: ClusterR package (on CRAN) Python: implemented as

    part of scikit-learn Mini-batch k-means: existing implementations
  46. 65.

    https://github.com/drisso/mbkmeans • Extend the ClusterR C++ implementation to allow HDF5

    input (beachmat) • Interface with SingleCellExperiment (through beachmat and DelayedArray) • Starting to benchmark against Python implementation (already HDF5-ready) both directly and through the reticulate and BiocSklearn R packages • reticulate = embeds an interactive Python session within R session, translates between Python and R objects • BiocSklearn = Interfaces to sci-kit learn in Python allowing implementation of k-means with large out-of-memory data Mini-batch k-means: our approach
  47. 67.

    Next: downstream analyses! • Imputation • SAVER, DrImpute, scImpute, MAGIC

    • Andrews and Hemberg (2018). False signals induced by single-cell imputation. F1000 • Differential expression • SCDE, MAST, scDD, BASiCs, M3Drop… • zinbwave dropout weights + edgeR/DESeq2 • Trajectory analyses / pseudotime inference • slingshot, ouija, Monocole3 • Gene set analyses • Gene expression networks • Cell ontology / annotation
  48. 70.

    Common data structures for single-cell data bioconductor/packages/ SingleCellExperiment scater, MAST,

    scDD, scPipe, scran, splatter, zinbwave, DropletUtils, clusterExperiment, SC3, destiny, and BASiCS
  49. 73.

    Interface with Python code using reticulate R package $ pip

    install sklearn # install Python dependency $ R --vanilla # start R > sklearn = reticulate::import("sklearn") # load interface to sklearn > clf = sklearn$svm$SVC(gamma = .001, C=100.) # create sklearn SVM classifier
  50. 79.

    Create granges object library(GenomicRanges) gr <- GRanges(seqnames = "chr1", strand

    = c("+", "-"), ranges = IRanges(start = c(102012,520211), end=c(120303, 526211)), gene_id = c(1001,2151), score = c(10, 25)) gr ## GRanges object with 2 ranges and 2 metadata columns: ## seqnames ranges strand | gene_id score ## <Rle> <IRanges> <Rle> | <numeric> <numeric> ## [1] chr1 102012-120303 + | 1001 10 ## [2] chr1 520211-526211 - | 2151 25 ## ------- ## seqinfo: 1 sequence from an unspecified genome
  51. 80.

    Things you can do with Granges objects width(gr) ## [1]

    18292 6001 gr[gr$score > 15, ] ## GRanges object with 1 range and 2 metadata columns: ## seqnames ranges strand | gene_id score ## <Rle> <IRanges> <Rle> | <numeric> <numeric> ## [1] chr1 520211-526211 - | 2151 25 ## ------- ## seqinfo: 1 sequence from an unspecified genome
  52. 81.

    Genomic verbs/actions + tidy data = plyranges • Goal: Write

    human readable analysis workflows • Idea: Define an API (i.e. extend dplyr) that maps relational genomic algebra to “verbs” that act on ”tidy” genomic data • Another great idea: Borrow dplyr’s syntax and design principles • And another great idea: Compose verbs together with pipe operator from magrittr Stuart Lee Di Cook Michael Lawrence Lee et al. (2018), bioRixv
  53. 82.

    library(plyranges) gr %>% filter(score > 15) ## GRanges object with

    1 range and 2 metadata columns: ## seqnames ranges strand | gene_id score ## <Rle> <IRanges> <Rle> | <numeric> <numeric> ## [1] chr1 520211-526211 - | 2151 25 ## ------- ## seqinfo: 1 sequence from an unspecified genome gr %>% filter(score > 15) %>% width() ## [1] 6001
  54. 83.

    $ pip install sklearn # install Python dependency $ R

    --vanilla # start R > sklearn = reticulate::import("sklearn") # load interface to sklearn > clf = sklearn$svm$SVC(gamma = .001, C=100.) # create sklearn SVM classifier > library(HCABrowser) > hca = HumanCellAtlas() # authorized access > hca %>% filter(organ == "brain") %>% # familiar paradigm results() %>% as_tibble() # lazy access # A tibble: 432 x 6 name size uuid bundle_fqid bundle_url orga <fct> <int> <fct> <fct> <fct> <fct> 1 a3a818c2-6c1… 29544 92732f5… ede36b63-2570-4… https://dss.inte… brain 2 a3a818c2-6c1… 4341 1e1cdfc… ede36b63-2570-4… https://dss.inte… brain 3 a3a818c2-6c1… 457 6499887… ede36b63-2570-4… https://dss.inte… brain # ... with 429 more rows Usability > library(TENxBrainData) > tenx = TENxBrainData() # easy access > tenx # big data! class: SingleCellExperiment dim: 27998 1306127 > pryr::object_size(tenx) 200 MB Interoperability Scalability Interactivity • ½ million downloads annually • >1600 open-source, open-development packages • Robust and well-documented software • Large-scale and remote access to HCA data • Standard data representations • State-of-the-art computational methods Workflows