Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Orchestrating Single-Cell RNA-sequencing Analysis with Bioconductor

Stephanie Hicks
November 30, 2018

Orchestrating Single-Cell RNA-sequencing Analysis with Bioconductor

* Keynote at Bioconductor Asia in Melbourne, Australia on November 29, 2018 (https://bioconductor.github.io/BiocAsia/)
* Invited presentation at the Australian Mathematical Sciences Institute (AMSI) BioInfoSummer in Perth, Australia on December 4, 2018 (https://bis.amsi.org.au)

Stephanie Hicks

November 30, 2018
Tweet

More Decks by Stephanie Hicks

Other Decks in Science

Transcript

  1. Orchestrating Single-Cell RNA-sequencing
    Analysis with Bioconductor
    Stephanie Hicks
    Assistant Professor, Biostatistics
    Johns Hopkins Bloomberg School of Public Health
    Bioconductor Asia Conference
    November 29, 2018
    AMSI BioinfoSummer
    December 4, 2018

    View Slide

  2. Teaching: Data Science
    Research: Genomics (single-cell RNA-seq)
    • Bioconductor user and developer (since 2009/2010)
    Other fun things about me:
    • Co-founded Baltimore
    • Creating a children’s book featuring women statisticians and data scientists
    ABOUT ME JOHNS HOPKINS BLOOMBERG
    SCHOOL OF PUBLIC HEALTH

    View Slide

  3. • Began in 2001 and now includes 12 core developers
    • Big priorities: reproducible research and high-quality documentation
    • Vignettes
    • Large community support
    • Workflows (super helpful to newbs)
    • Teaching resources
    and open development

    View Slide

  4. View Slide

  5. View Slide

  6. New to
    R/Bioconductor?
    How to get started?
    #rcatladies
    #rladies

    View Slide

  7. Workflows

    View Slide

  8. Workflows

    View Slide

  9. Interoperability within R/Bioconductor
    data("airway", package = "airway")
    # SummarizedExperiment object
    se <- airway
    # Simple subsetting
    se[1:100, 1:10]

    View Slide

  10. Interoperability within R/Bioconductor
    data("airway", package = "airway")
    # SummarizedExperiment object
    se <- airway
    # Differential expression with DESeq2
    library("DESeq2")
    dds <- DESeqDataSet(se = airway,
    design = ~ cell + dex)
    dds <- DESeq(dds)
    deRes <- as.data.frame(results(dds))
    # Correction for multiple testing using Benjamini & Hochberg (1995) approach
    p.adjust(dds$pvalue, method = "BH")

    View Slide

  11. Interoperability within R/Bioconductor
    data("airway", package = "airway")
    # SummarizedExperiment object
    se <- airway
    # Differential expression with DESeq2
    library("DESeq2")
    dds <- DESeqDataSet(se = airway,
    design = ~ cell + dex)
    dds <- DESeq(dds)
    deRes <- as.data.frame(results(dds))
    # Correction for multiple testing using IHW
    ihwRes <- ihw(pvalue ~ baseMean,
    data = deRes, alpha = 0.1)

    View Slide

  12. cool, but isn’t this talk supposed to be
    about single-cell RNA-sequencing?

    View Slide

  13. Single-cell RNA-seq (scRNA-seq)
    Cell 1 Cell 2 …
    Gene 1 18 0
    Gene 2 1010 506
    Gene 3 0 49
    Gene 4 22 0

    Read Counts
    Gene 1
    Compare gene expression
    profiles of single cells
    Tissue (e.g. tumor)
    Isolate and sequence
    individual cells
    Cells
    Genes
    Principal Component 2
    Principal Component 1
    Cell 1

    View Slide

  14. Svensson et al. (2017).

    View Slide

  15. Variability in scRNA-Seq data
    Signal
    Noise
    Extrinsic noise (regulation by transcription factors)
    vs Intrinstic noise (stochastic bursting/firing, cell cycle)
    overdispersion, batch effects, sequencing
    depth, gc bias, amplification bias
    e.g. capture efficiency (starting amount of mRNA)
    Variability visible in bulk
    RNA-Seq
    Additional variability in
    scRNA-Seq data
    Total variation
    Technical noise Biological noise

    View Slide

  16. Common types scRNA-Seq data
    Adapted from Kolodziejczyk et al. (2015). Molecular Cell 58
    Heterogeneous
    cell populations
    Purified cell populations
    Battle droids Super battle droids!
    Big room full of unknown R-Series droids

    View Slide

  17. Orchestrating Single-Cell RNA-sequencing
    Analysis with Bioconductor

    View Slide

  18. Common data structures for single-cell data
    bioconductor/packages/
    SingleCellExperiment

    View Slide

  19. Common data structures for single-cell data
    Biocverse =
    project to create
    an interactive
    data viz of the
    biocoductor
    ecosystem
    Shian Su

    View Slide

  20. Let’s get started with some data!

    View Slide

  21. > library(TENxBrainData)
    > tenx = TENxBrainData() # easy access
    > tenx # big data!
    class: SingleCellExperiment
    dim: 27998 1306127
    > pryr::object_size(tenx)
    200 MB

    View Slide

  22. View Slide

  23. View Slide

  24. library(TENxPBMCData)
    tenx_pbmc3k <- TENxPBMCData(dataset = "pbmc3k")
    tenx_pbmc3k
    ## class: SingleCellExperiment
    ## dim: 32738 2700
    ## metadata(0):
    ## assays(1): counts
    ## rownames(32738): ENSG00000243485 ENSG00000237613 ...
    ## ENSG00000215616 ENSG00000215611
    ## rowData names(3): ENSEMBL_ID Symbol_TENx Symbol
    ## colnames: NULL
    ## colData names(11): Sample Barcode ... Individual Date_published
    ## reducedDimNames(0):
    ## spikeNames(0):

    View Slide

  25. counts(tenx_pbmc3k)
    ## <32738 x 2700> DelayedMatrix object of type "integer":
    ## [,1] [,2] [,3] [,4] ... [,2697] [,2698]
    ## ENSG00000243485 0 0 0 0 . 0 0
    ## ENSG00000237613 0 0 0 0 . 0 0
    ## ENSG00000186092 0 0 0 0 . 0 0
    ## ENSG00000238009 0 0 0 0 . 0 0
    ## ENSG00000239945 0 0 0 0 . 0 0
    ## ... . . . . . . .
    ## ENSG00000215635 0 0 0 0 . 0 0
    ## ENSG00000268590 0 0 0 0 . 0 0
    ## ENSG00000251180 0 0 0 0 . 0 0
    ## ENSG00000215616 0 0 0 0 . 0 0
    ## ENSG00000215611 0 0 0 0 . 0 0
    ## [,2699] [,2700]
    ## ENSG00000243485 0 0

    View Slide

  26. Simulate scRNA-seq counts with
    • Uses a gamma-Poisson distribution used to generate a gene
    by cell matrix of counts (SingleCellExperiment)
    Zappia, Phipson, Oshlack (2017). Genome Biol
    library(splatter)
    tenx_pbmc3k_sub <-
    as.matrix(counts(tenx_pbmc3k))
    params <- splatEstimate(tenx_pbmc3k_sub[,1:100])
    sim <- splatSimulate(params)
    sim
    ## class: SingleCellExperiment
    ## dim: 32738 100
    ## metadata(1): Params
    ## assays(6): BatchCellMeans BaseCellMeans ... TrueCounts counts
    ## rownames(32738): Gene1 Gene2 ... Gene32737 Gene32738 … (and more!)

    View Slide

  27. Calculate quality control metrics
    library(scater)
    tenx_pbmc3k <- calculateQCMetrics(tenx_pbmc3k)

    View Slide

  28. library(scater)
    tenx_pbmc3k <- calculateQCMetrics(tenx_pbmc3k)
    colnames(colData(tenx_pbmc3k))
    ## [1] "Sample" "Barcode"
    ## [3] "Sequence" "Library"
    ## [5] "Cell_ranger_version" "Tissue_status"
    ## [7] "Barcode_type" "Chemistry"
    ## [9] "Sequence_platform" "Individual"
    ## [11] "Date_published" "is_cell_control"
    ## [13] "total_features_by_counts" "log10_total_features_by_counts"
    ## [15] "total_counts" "log10_total_counts"
    ## [17] "pct_counts_in_top_50_features" "pct_counts_in_top_100_features"
    ## [19] "pct_counts_in_top_200_features" "pct_counts_in_top_500_features"

    View Slide

  29. colnames(rowData(tenx_pbmc3k))
    ## [1] "ENSEMBL_ID" "Symbol_TENx"
    ## [3] "Symbol" "is_feature_control"
    ## [5] "mean_counts" "log10_mean_counts"
    ## [7] "n_cells_by_counts" "pct_dropout_by_counts"
    ## [9] "total_counts" "log10_total_counts"

    View Slide

  30. What if my data is more “raw” than
    a SingleCellExperiment
    object?

    View Slide

  31. Going from “raw” data to “clean” data
    Kindly borrowed from Davis McCarthy’s Slides at Genome Informatics 2016
    https://speakerdeck.com/davismcc/what-do-we-need-computationally-to-make-the-most-of-single-cell-rna-seq-data

    View Slide

  32. Tian , Su, Dong, … , Ritchie (2018). PLoS Comp Bio McCarthy, Campbell, Lun and
    Wills (2017). Bioinf
    SingleCellExperiment
    SingleCellExperiment
    SingleCellExperiment

    View Slide

  33. Exploratory Data Analysis
    Adapted from Stegle et al. (2015) Nature Reviews Genetics 16: 133-145
    Lun et al. (2016) F1000
    Cell-level quality control
    Gene-level
    quality control
    isOutlier(sce$unmapped_reads,
    type=”greater”, nmads=3)

    View Slide

  34. Workflows

    View Slide

  35. Workflows

    View Slide

  36. https://www.bioconductor.org/packages/release/
    workflows/html/simpleSingleCell.html

    View Slide

  37. https://www.bioconductor.org/packages/release/workflows/
    vignettes/simpleSingleCell/inst/doc/xtra-5-bigdata.html

    View Slide

  38. So… what about normalization and dealing with
    other technical variation in scRNA-Seq data?
    Much to learn, we still have ….

    View Slide

  39. Normalization
    • Without Spike-ins
    • Between-sample normalization methods
    • Global scaling factors mostly developed for bulk RNA-Seq

    View Slide

  40. Normalization
    • Without Spike-ins
    • Between-sample normalization methods
    • Global scaling factors mostly developed for bulk RNA-Seq

    View Slide

  41. Normalization
    “Case studies and simulated data sets indicated that
    … bulk RNA-seq normalization methods are not
    appropriate in the context of scRNA-seq, where—on
    account of biological heterogeneity and technical
    artifacts—we typically observe more heterogeneous
    and sparser data sets.”
    Vallejos et al. (2017) Nature Methods

    View Slide

  42. Normalization
    “The choice of the
    normalization method affects
    downstream analyses, such as
    highly variable gene (HVG)
    detection that aim to uncover
    heterogeneity within the
    data.”
    Vallejos et al. (2017) Nature Methods

    View Slide

  43. Normalization
    Vallejos et al. (2017) Nature Methods
    “Global-scaling methods
    rely on different
    assumptions, these methods
    all fail if the number or fold
    change of differentially
    expressed genes across the
    cell population is too high.”

    View Slide

  44. Normalization
    • Without Spike-ins
    • scran and SCnorm (standalone)
    • BASiCs and zinbwave
    • scone
    • Convert relative read counts into mRNA counts
    • Census (Monocle)

    View Slide

  45. Normalization
    • Without Spike-ins
    • scran and SCnorm (standalone)
    • BASiCs and zinbwave
    • scone
    • Convert relative read counts into mRNA counts
    • Census (Monocle)
    • With Spike-ins
    • Spike-ins: theoretically a good idea, but many challenges still remain for scRNA-Seq
    • UMIs: Reduces amplification bias, not appropriate for isoform or allele-specific
    expression
    • Biological (nuisance?) variability
    • differences among cells in cell-cycle stage or cell size (try cyclone() in scran)

    View Slide

  46. “Hey, someone told me of this thing called
    batch effects…. Should I be worried about them in my
    scRNA-Seq data?”

    View Slide

  47. Discovering new cell types or subtypes
    • Lots of sparsity à distance estimates
    • Cell-to-cell variation

    View Slide

  48. Or is it a batch effect?

    View Slide

  49. 50

    View Slide

  50. 51

    View Slide

  51. Adjusting for known & unknown covariates / batch effects
    • RUVseq() or svaseq()
    • Linear models with e.g.
    removeBatchEffect() in
    limma or scater
    • ComBat() in sva
    • mnnCorrect() in scran

    View Slide

  52. Feature selection
    Supervised
    • Differentially expressed genes between homogeneous subpops

    View Slide

  53. Feature selection
    Supervised
    • Differentially expressed genes between homogeneous subpops
    Unsupervised
    • Identifying genes via a null model
    • Highly variable genes
    • trendVar() and
    decomposeVar() in scran

    View Slide

  54. Feature selection
    Supervised
    • Differentially expressed genes between homogeneous subpops
    Unsupervised
    • Identifying genes via a null model
    • High dropout genes
    • M3drop

    View Slide

  55. Feature selection
    Supervised
    • Differentially expressed genes between homogeneous subpops
    Unsupervised
    • Identifying genes via a null model
    • Highly and lowly variable genes
    • BASiCs

    View Slide

  56. Feature selection
    Supervised
    • Differentially expressed genes between homogeneous subpops
    Unsupervised
    • Gene-gene correlations
    • PCA loadings

    View Slide

  57. • BiocSingular
    • Exact and approximate methods for SVD and PCA
    • Easily interacts with Bioconductor packages or workflows
    • Uses BiocParallel framework
    Dimensionality reduction https://github.com/LTLA/BiocSingular

    View Slide

  58. Data visualization:
    Plot cells along first two Principal Components
    −5
    0
    5
    −5 0 5
    Component 1: 3% variance
    Component 2: 2% variance
    replicate
    r1
    r2
    r3
    total_features
    7000
    7500
    8000
    8500
    9000
    9500

    View Slide

  59. • Hierarchical clustering
    Clustering
    The hierarchical clustering dendrogram
    http://hemberg-lab.github.io/scRNA.seq.course

    View Slide

  60. • Hierarchical clustering
    • K-means
    Clustering
    Schematic representation of the k-means clustering
    http://hemberg-lab.github.io/scRNA.seq.course

    View Slide

  61. • Hierarchical clustering
    • K-means
    • Graph-based methods
    Clustering
    http://hemberg-lab.github.io/scRNA.seq.course
    Schematic representation of the graph network

    View Slide

  62. SC3 (Kiselev et al. 2017)
    • Utilizes k-means & consensus clustering
    A few CRAN & Bioconductor packages for clustering
    scRNA-seq that I like/use (but by no means exhaustive)
    clustree (Zappia and Oshlack 2018)
    • Visualization of relationships between clusters at multiple resolutions
    clusterExperiment (Purdom and Risso)
    • Resampling based consensus clustering

    View Slide

  63. mbkmeans (Risso, Purdom and Hicks)
    • Mini-batch k-means (Sculley, 2010)
    • Faster than regular k-means, and more accurate than stochastic
    gradient descent
    • No need to keep all the data in memory, but only one batch at the
    time, which makes it a good candidate for our purpose
    A recent contribution: mini-batch k-means

    View Slide

  64. R / C++: ClusterR package (on CRAN)
    Python: implemented as part of scikit-learn
    Mini-batch k-means: existing
    implementations

    View Slide

  65. https://github.com/drisso/mbkmeans
    • Extend the ClusterR C++ implementation to allow HDF5 input (beachmat)
    • Interface with SingleCellExperiment (through beachmat and DelayedArray)
    • Starting to benchmark against Python implementation (already HDF5-ready)
    both directly and through the reticulate and BiocSklearn R packages
    • reticulate = embeds an interactive Python session within R session,
    translates between Python and R objects
    • BiocSklearn = Interfaces to sci-kit learn in Python allowing
    implementation of k-means with large out-of-memory data
    Mini-batch k-means: our approach

    View Slide

  66. Interactive SummarizedExperiment (iSEE)
    https://marionilab.cruk.cam.ac.uk/iSEE_pbmc4k/
    Creators: Federico Marini,
    Aaron Lun, Charlotte Soneson,
    and Kevin Rue-Albrecht

    View Slide

  67. Next: downstream analyses!
    • Imputation
    • SAVER, DrImpute, scImpute, MAGIC
    • Andrews and Hemberg (2018). False signals induced by single-cell imputation. F1000
    • Differential expression
    • SCDE, MAST, scDD, BASiCs, M3Drop…
    • zinbwave dropout weights + edgeR/DESeq2
    • Trajectory analyses / pseudotime inference
    • slingshot, ouija, Monocole3
    • Gene set analyses
    • Gene expression networks
    • Cell ontology / annotation

    View Slide

  68. Challenge:
    Data standardizations
    and Interoperability

    View Slide

  69. Disclosure: I’m biased towards R, but have a great deal of respect for all languages!

    View Slide

  70. Common data structures for single-cell data
    bioconductor/packages/
    SingleCellExperiment
    scater, MAST, scDD, scPipe, scran, splatter, zinbwave,
    DropletUtils, clusterExperiment, SC3, destiny, and BASiCS

    View Slide

  71. Interoperability within R/Bioconductor

    View Slide

  72. Convert between data structures for single-cell data

    View Slide

  73. Interface with Python code using reticulate R package
    $ pip install sklearn # install Python dependency
    $ R --vanilla # start R
    > sklearn = reticulate::import("sklearn") # load interface to sklearn
    > clf = sklearn$svm$SVC(gamma = .001, C=100.) # create sklearn SVM classifier

    View Slide

  74. Challenge:
    Scalability and interactivity

    View Slide

  75. Scalable interactivity with large data

    View Slide

  76. Interactive Data Visualization (iSEE)
    https://marionilab.cruk.cam.ac.uk/iSEE_pbmc4k/
    Creators: Federico Marini,
    Aaron Lun, Charlotte Soneson,
    and Kevin Rue-Albrecht

    View Slide

  77. Challenge:
    More human readable code

    View Slide

  78. Standard Bioconductor Data Structures: GRanges
    GenomicRanges (GRanges)
    Lee et al. (2018), bioRixv

    View Slide

  79. Create granges object
    library(GenomicRanges)
    gr <- GRanges(seqnames = "chr1", strand = c("+", "-"),
    ranges = IRanges(start = c(102012,520211),
    end=c(120303, 526211)),
    gene_id = c(1001,2151),
    score = c(10, 25))
    gr
    ## GRanges object with 2 ranges and 2 metadata columns:
    ## seqnames ranges strand | gene_id score
    ## |
    ## [1] chr1 102012-120303 + | 1001 10
    ## [2] chr1 520211-526211 - | 2151 25
    ## -------
    ## seqinfo: 1 sequence from an unspecified genome

    View Slide

  80. Things you can do with
    Granges objects
    width(gr)
    ## [1] 18292 6001
    gr[gr$score > 15, ]
    ## GRanges object with 1 range and 2 metadata columns:
    ## seqnames ranges strand | gene_id score
    ## |
    ## [1] chr1 520211-526211 - | 2151 25
    ## -------
    ## seqinfo: 1 sequence from an unspecified genome

    View Slide

  81. Genomic verbs/actions +
    tidy data = plyranges
    • Goal: Write human readable analysis
    workflows
    • Idea: Define an API (i.e. extend dplyr) that
    maps relational genomic algebra to “verbs”
    that act on ”tidy” genomic data
    • Another great idea: Borrow dplyr’s syntax
    and design principles
    • And another great idea: Compose verbs
    together with pipe operator from magrittr
    Stuart Lee
    Di Cook
    Michael Lawrence
    Lee et al. (2018), bioRixv

    View Slide

  82. library(plyranges)
    gr %>%
    filter(score > 15)
    ## GRanges object with 1 range and 2 metadata columns:
    ## seqnames ranges strand | gene_id score
    ## |
    ## [1] chr1 520211-526211 - | 2151 25
    ## -------
    ## seqinfo: 1 sequence from an unspecified genome
    gr %>%
    filter(score > 15) %>%
    width()
    ## [1] 6001

    View Slide

  83. $ pip install sklearn # install Python dependency
    $ R --vanilla # start R
    > sklearn = reticulate::import("sklearn") # load interface to sklearn
    > clf = sklearn$svm$SVC(gamma = .001, C=100.) # create sklearn SVM classifier
    > library(HCABrowser)
    > hca = HumanCellAtlas() # authorized access
    > hca %>% filter(organ == "brain") %>% # familiar paradigm
    results() %>% as_tibble() # lazy access
    # A tibble: 432 x 6
    name size uuid bundle_fqid bundle_url orga

    1 a3a818c2-6c1… 29544 92732f5… ede36b63-2570-4… https://dss.inte… brain
    2 a3a818c2-6c1… 4341 1e1cdfc… ede36b63-2570-4… https://dss.inte… brain
    3 a3a818c2-6c1… 457 6499887… ede36b63-2570-4… https://dss.inte… brain
    # ... with 429 more rows
    Usability
    > library(TENxBrainData)
    > tenx = TENxBrainData() # easy access
    > tenx # big data!
    class: SingleCellExperiment
    dim: 27998 1306127
    > pryr::object_size(tenx)
    200 MB
    Interoperability Scalability
    Interactivity
    ● ½ million downloads annually
    ● >1600 open-source, open-development packages
    ● Robust and well-documented software
    ● Large-scale and remote access to HCA data
    ● Standard data representations
    ● State-of-the-art computational methods
    Workflows

    View Slide

  84. Feel free to send comments/questions:
    Twitter: @stephaniehicks
    Email: [email protected]
    #BiocAsia
    #rladies
    Thank you!

    View Slide