Upgrade to Pro — share decks privately, control downloads, hide ads and more …

HCA-LA-2021

 HCA-LA-2021

Talk for the Human Cell Atlas 2021 workshop. See https://github.com/lcolladotor/HCA_LA_2021 for more.

Tweet

More Decks by Leonardo Collado-Torres

Other Decks in Science

Transcript

  1. Getting started with scRNA-seq analyses with Bioconductor Leonardo Collado Torres

    lcolladotor.github.io 2021-05-03 Human Cell Atlas Latin America
  2. Slides & Code • Slides : https://docs.google.com/presentation/d/1CgBtdoE5GAhbqAaGzxR5HEkUTbXNZsKEfym9099 6vWQ/edit?usp=sharing • Code

    https://lcolladotor.github.io/HCA_LA_2021 • R Code https://github.com/lcolladotor/HCA_LA_2021/blob/master/2021-05-03_code_HCA_LA.R
  3. LIEBER INSTITUTE for BRAIN DEVELOPMENT R code for installing R/Bioconductor

    packages used here ## For installing Bioconductor packages if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") ## Install required packages BiocManager::install( c('SingleCellExperiment', 'scran', 'scater', 'scRNAseq', 'AnnotationHub', 'ExperimentHub', 'uwot', ‘Rtsne’, 'pryr', 'sessioninfo'))
  4. LIEBER INSTITUTE for BRAIN DEVELOPMENT https://lcolladotor.github.io/RBiocIntro2020/ July 2020 - Spanish

    https://youtu.be/SZVPXT2McQM & https://www.youtube.com/channel/UCHCdYfAXVzJIUkMoMSGiZMw
  5. LIEBER INSTITUTE for BRAIN DEVELOPMENT Find news about the latest

    developments Twitter • https://twitter.com/bioconductor • https://twitter.com/LIBDrstats (my work R club) • https://twitter.com/CDSBMexico • #rstats: general • #rstatsES: general & in Spanish • #BioC2021: upcoming Bioconductor conference
  6. LIEBER INSTITUTE for BRAIN DEVELOPMENT Join a community: virtual “academic

    department” Slack • Bioconductor (English) https://bioc-community.herokuapp.com • CDSB (Spanish) https://join.slack.com/t/comunidadbioinfo/shared_invite/z t-8lsvpm84-Fne1W0hadk6cpjgJS17Tnw • LatinR (Spanish, English, Portuguese) https://latin-r.com/blog/communities Blog post with Stephanie Hicks http://lcolladotor.github.io/2018/06/19/using-slack-for-academic-departmental-communication/
  7. LIEBER INSTITUTE for BRAIN DEVELOPMENT Foster a welcoming environment: follow

    the code of conduct! http://bioconductor.org/about/ code-of-conduct/
  8. LIEBER INSTITUTE for BRAIN DEVELOPMENT About Bioconductor Bioconductor provides tools

    for the analysis and comprehension of high-throughput genomic data. Bioconductor uses the R statistical programming language, and is open source and open development. http://bioconductor.org/
  9. LIEBER INSTITUTE for BRAIN DEVELOPMENT R/Bioconductor packages (BioC 3.12 from

    October 2020) • Software: 1975 • Annotation: 971 • Experimental Data: 398 • Workflow guides: 28 http://bioconductor.org/packages/release/BiocViews.html#___S oftware
  10. LIEBER INSTITUTE for BRAIN DEVELOPMENT Exercise: navigating the biocViews terms

    (tree-like structure) Who is the author of the top ranked Bioconductor workflow that mentions “cancer” in the title? Hint: • http://bioconductor.org/ -> • http://bioconductor.org/packages/release/BiocViews.html#_ __Software ->
  11. LIEBER INSTITUTE for BRAIN DEVELOPMENT Learn more about it before

    installing it: text description & DOI https://bioconductor.org/packages/scran
  12. LIEBER INSTITUTE for BRAIN DEVELOPMENT Learn more about it before

    installing it: vignette https://bioconductor.org/packages/scran
  13. LIEBER INSTITUTE for BRAIN DEVELOPMENT Learn more about it before

    installing it: vignette http://bioconductor.org/packages/release/bioc/vignettes/scran/inst/doc/scran.html
  14. LIEBER INSTITUTE for BRAIN DEVELOPMENT Ok, now we can install

    it (if we have the right version of R) https://bioconductor.org/packages/scran
  15. LIEBER INSTITUTE for BRAIN DEVELOPMENT OSCA Book & paper http://bioconductor.org/books/release/OSCA/

    https://doi.org/10.1038/s41592-019-0654-x This website is free to use, and is licensed under the Creative Commons Attribution-NonCommercial-NoDerivs 3.0 License.
  16. Single-cell RNA-sequencing (scRNA-seq) workshop Peter Hickey @PeteHaitch Single Cell Open

    Research Endeavour (SCORE), Walter and Eliza Hall Institute of Medical Research bit.ly/WEHIscRNAworkshop This material has been adapted from Pete’s WEHI course
  17. LIEBER INSTITUTE for BRAIN DEVELOPMENT Other versions of this material

    • (March 2020; 4 days) Intro to OSCA for LGC-UNAM at LIIGH-UNAM undergrad students (code in English) https://lcolladotor.github.io/osca_LIIGH_UNAM_2020/ • (August 2020; 2.5 days) Intro to OSCA at CSBD2020 https://github.com/ComunidadBioInfo/cdsb2020 (code comments in Spanish) • (August 2021; 5 days) http://www.nnb.unam.mx/
  18. Background Common data infrastructure makes life easier - Can switch

    between different tools more easily - No need to convert between formats What data do we need to analyse a scRNA-seq experiment?
  19. What data do we start with? Cell 1 Cell 2

    ... Cell N Gene 1 0 1 ... 0 Gene 2 1 3 ... 0 ... ... ... ... ... Gene M 2 2 4 Gene counts
  20. What data do we start with? Barcode Donor ... Treatment

    Cell 1 ACTGTA D1 ... Drug Cell 2 TGCATA D1 ... Control ... ... ... ... ... Cell N CCTATA D6 Drug Information about the cells
  21. What data do we start with? ID Symbol ... Chromosome

    Gene 1 ENSG00000155816 FMN2 ... 1 Gene 2 ENSG00000229807 XIST ... X ... ... ... ... ... Gene M ENSG00000139618 BRCA2 13 Information about the genes
  22. What data do we create in an analysis? Cell 1

    Cell 2 ... Cell N Gene 1 0 0.6 ... 0 Gene 2 0.3 0.8 ... 0 ... ... ... ... ... Gene M 0.35 0.67 2.1 Log-normalized gene expression
  23. What data do we create in an analysis? PCA 1

    PCA 2 ... PCA K Cell 1 0.93 1.28 ... 0.03 Cell 2 0.32 1.22 ... 0.09 ... ... ... ... ... Cell N -0.66 1.00 0.15 Dimensionality reduction t-SNE 1 t-SNE 2 Cell 1 1.24 8.93 Cell 2 -0.33 7.85 ... ... ... Cell N 0.46 3.41
  24. How to coordinate this? Cell 1 Cell 2 ... Cell

    N Gene 1 0 1 ... 0 Gene 2 1 3 ... 0 ... ... ... ... ... Gene M 2 2 4 Gene counts Barcode Donor ... Treatment Cell 1 ACTGTA D1 ... Drug Cell 2 TGCATA D1 ... Control ... ... ... ... ... Cell N CCTATA D6 Drug Information about the cells ID Symbol ... Chromosome Gene 1 ENSG00000155816 FMN2 ... 1 Gene 2 ENSG00000229807 XIST ... X ... ... ... ... ... Gene M ENSG00000139618 BRCA2 13 Information about the genes
  25. How to coordinate this? Cell 1 Cell 2 ... Cell

    N Gene 1 0 1 ... 0 Gene 2 1 3 ... 0 ... ... ... ... ... Gene M 2 2 4 Gene counts Barcode Donor ... Treatment Cell 1 ACTGTA D1 ... Drug Cell 2 TGCATA D1 ... Control ... ... ... ... ... Cell N CCTATA D6 Drug Information about the cells ID Symbol ... Chromosome Gene 1 ENSG00000155816 FMN2 ... 1 Gene 2 ENSG00000229807 XIST ... X ... ... ... ... ... Gene M ENSG00000139618 BRCA2 13 Information about the genes
  26. How to coordinate this? Cell 1 Cell 2 ... Cell

    N Gene 1 0 1 ... 0 Gene 2 1 3 ... 0 ... ... ... ... ... Gene M 2 2 4 Gene counts Barcode Donor ... Treatment Cell 1 ACTGTA D1 ... Drug Cell 2 TGCATA D1 ... Control ... ... ... ... ... Cell N CCTATA D6 Drug Information about the cells ID Symbol ... Chromosome Gene 1 ENSG00000155816 FMN2 ... 1 Gene 2 ENSG00000229807 XIST ... X ... ... ... ... ... Gene M ENSG00000139618 BRCA2 13 Information about the genes
  27. How to coordinate this? Cell 1 Cell 2 ... Cell

    N Gene 1 0 1 ... 0 Gene 2 1 3 ... 0 ... ... ... ... ... Gene M 2 2 4 Gene counts Barcode Donor ... Treatment Cell 1 ACTGTA D1 ... Drug Cell 2 TGCATA D1 ... Control ... ... ... ... ... Cell N CCTATA D6 Drug Information about the cells ID Symbol ... Chromosome Gene 1 ENSG00000155816 FMN2 ... 1 Gene 2 ENSG00000229807 XIST ... X ... ... ... ... ... Gene M ENSG00000139618 BRCA2 13 Information about the genes
  28. How to coordinate this? Cell 1 Cell 2 ... Cell

    N Gene 1 0 1 ... 0 Gene 2 1 3 ... 0 ... ... ... ... ... Gene M 2 2 4 Gene counts Barcode Donor ... Treatment Cell 1 ACTGTA D1 ... Drug Cell 2 TGCATA D1 ... Control ... ... ... ... ... Cell N CCTATA D6 Drug Information about the cells ID Symbol ... Chromosome Gene 1 ENSG00000155816 FMN2 ... 1 Gene 2 ENSG00000229807 XIST ... X ... ... ... ... ... Gene M ENSG00000139618 BRCA2 13 Information about the genes Also have to coordinate derived data (log-normalized gene expression, PCA, t-SNE, etc.)
  29. The anatomy of a SingleCellExperiment Each piece of (meta)data in

    the SingleCellExperiment is represented by a separate 'slot' An analogy - An SCE is a cargo ship - Each 'slot' is a cargo box - Certain cargo boxes (slots) expect certain types of cargo (data)
  30. Illustrative dataset: 416B library(scRNAseq) sce.416b <- LunSpikeInData(which="416b") Lun, A. T.

    L., Calero-Nieto, F. J., Haim-Vilmovsky, L., Göttgens, B. & Marioni, J. C. Assessing the reliability of spike-in normalization for analyses of single-cell RNA sequencing data. Genome Res. 27, 1795–1806 (2017) - Immortalized mouse myeloid progenitor cell line processed using SmartSeq2 - https://osca.bioconductor.org/lun-416b-cell-line-smart-seq2.html
  31. Storing primary experimental data Cell 1 Cell 2 ... Cell

    N Gene 1 0 1 ... 0 Gene 2 1 3 ... 0 ... ... ... ... ... Gene M 2 2 4 Gene counts
  32. Filling the assays slot # Load the SingleCellExperiment package library(SingleCellExperiment)

    # Extract the count matrix from the 416b dataset counts.416b <- counts(sce.416b) # Construct a new SCE from the counts matrix sce <- SingleCellExperiment( assays = list(counts = counts.416b))
  33. Filling the assays slot # Access the counts matrix from

    the assays slot # WARNING: This will flood RStudio with output! # 1. The general method assay(sce, "counts") # 2. The special method for the assay named "counts" counts(sce) # Tip: Limit the output to just a few samples & genes counts(sce)[1:30, 1:2]
  34. Adding to the assays slot sce <- scater::logNormCounts(sce) # Inspect

    the object we just updated sce 👉We overwrote our previous 'sce' by reassigning the results back to 'sce' (possible because this particular function returns a 'SingleCellExperiment' that contains the results in addition to the original data) ⚠ Some functions - especially those outside the single-cell oriented Bioconductor packages - do not do this, then we need to do extra work
  35. Adding more assays # Access the logcounts matrix from the

    assays slot # WARNING: This will flood RStudio with output! # 1. The general method assay(sce, "logcounts") # 2. The special method for the assay named "logcounts" logcounts(sce) # Tip: Limit the output to just a few samples & genes logcounts(sce)[1:30, 1:2]
  36. Adding more assays # assign a new entry to assays

    slot assay(sce, "counts_100") <- assay(sce, "counts") + 100 # List the assays in the object assays(sce) 🤔 What if I want to add a custom assay?
  37. Storing metadata Barcode Donor ... Treatment Cell 1 ACTGTA D1

    ... Drug Cell 2 TGCATA D1 ... Control ... ... ... ... ... Cell N CCTATA D6 Drug Information about the cells ID Symbol ... Chromosome Gene 1 ENSG00000155816 FMN2 ... 1 Gene 2 ENSG00000229807 XIST ... X ... ... ... ... ... Gene M ENSG00000139618 BRCA2 13 Information about the genes Information about the experiment - "Look at expression of genes A, B, and C" - "There could be problems with samples from the first batch because of issue with FACS sort"
  38. Storing metadata on the columns (cells) Barcode Donor ... Treatment

    Cell 1 ACTGTA D1 ... Drug Cell 2 TGCATA D1 ... Control ... ... ... ... ... Cell N CCTATA D6 Drug Information about the cells
  39. Filling the colData slot # Extract the sample metadata from

    the 416b dataset colData.416b <- colData(sce.416b) # Add some of the sample metadata to our SCE colData(sce) <- colData.416b[, c("phenotype", "block")] # Inspect the object we just updated sce # Access the sample metadata from our SCE colData(sce) # Access a specific column of sample metadata from our SCE sce$block
  40. Adding to the colData slot # Example of function that

    adds extra fields to colData sce <- scater::addPerCellQC(sce.416b) # Access the sample metadata from our updated SCE colData(sce)
  41. Using colData for subsetting # E.g., subset data to just

    wild type cells # Remember, cells are columns of the SCE sce[, sce$phenotype == "wild type phenotype"] 🤔 What if I want to subset the data to certain cells?
  42. Storing metadata on the rows (features/genes) ID Symbol ... Chromosome

    Gene 1 ENSG00000155816 FMN2 ... 1 Gene 2 ENSG00000229807 XIST ... X ... ... ... ... ... Gene M ENSG00000139618 BRCA2 13 Information about the genes
  43. Adding to the rowData slot # Access the feature metadata

    from our SCE # It's currently empty! rowData(sce) # Example of function that adds extra fields to rowData sce <- scater::addPerFeatureQC(sce) # Access the feature metadata from our updated SCE rowData(sce)
  44. Adding to the rowData slot # Download the relevant Ensembl

    annotation database # using AnnotationHub resources library(AnnotationHub) ah <- AnnotationHub() query(ah, c("Mus musculus", "Ensembl", "v97")) 🤔 What if I want to add the chromosome of each gene?
  45. Adding to the rowData slot # Annotate each gene with

    its chromosome location ensdb <- ah[["AH73905"]] chromosome <- mapIds(ensdb, keys=rownames(sce), keytype="GENEID", column="SEQNAME") rowData(sce)$chromosome <- chromosome # Access the feature metadata from our updated SCE rowData(sce)
  46. Using rowData for subsetting # E.g., subset data to just

    genes on chromosome 3 # NOTE: which() needed to cope with NA chromosome names sce[which(rowData(sce)$chromosome == "3"), ] 🤔 What if I want to subset the data to certain features?
  47. Storing other metadata 😬 Information about the experiment - "Look

    at expression of genes A, B, and C" - "There could be problems with samples from the first batch because of issue with FACS sort"
  48. Adding to the metadata slot # Access the metadata from

    our SCE # It's currently empty! metadata(sce) # The metadata slot is Vegas - anything goes metadata(sce) <- list( favourite_genes=c("Shh", "Nck1", "Diablo"), analyst=c("Pete")) # Access the metadata from our updated SCE metadata(sce)
  49. Background So far, we've covered: - 'assays' (primary data) -

    'colData' (cell metadata) - 'rowData' (feature metadata) - 'metadata' (experiment-level metadata) 🤓 The above is a SummarizedExperiment
  50. So why do we need SingleCellExperiment? For single-cell data, we

    commonly also have: - Dimensionality reduction results (e.g., t-SNE, UMAP) - Special or 'alternative' types of features (e.g., spike-in RNAs, antibody-derived sequencing tags) 🤓 SingleCellExperiment is an extension of SummarizedExperiment
  51. Storing dimensionality reduction results PCA 1 PCA 2 ... PCA

    K Cell 1 0.93 1.28 ... 0.03 Cell 2 0.32 1.22 ... 0.09 ... ... ... ... ... Cell N -0.66 1.00 0.15 Dimensionality reduction t-SNE 1 t-SNE 2 Cell 1 1.24 8.93 Cell 2 -0.33 7.85 ... ... ... Cell N 0.46 3.41
  52. Adding to the reducedDims slot # E.g., add the PCA

    of logcounts # NOTE: We'll learn more about PCA later sce <- scater::runPCA(sce) # Inspect the object we just updated sce # Access the PCA matrix from the reducedDims slot reducedDim(sce, "PCA") 🤔 What if I want to store a dimensionality reduced version of the data?
  53. Adding to the reducedDims slot # E.g., add a t-SNE

    representation of logcounts # NOTE: We'll learn more about t-SNE later sce <- scater::runTSNE(sce) # Inspect the object we just updated sce # Access the t-SNE matrix from the reducedDims slot reducedDim(sce, "TSNE") 🤔 What if I want to store a dimensionality reduced version of the data?
  54. Adding to the reducedDims slot # E.g., add a 'manual'

    UMAP representation of logcounts # NOTE: We'll learn more about UMAP later and a # simpler way to compute it. u <- uwot::umap(t(logcounts(sce)), n_component=2) # Add the UMAP matrix to the reducedDims slot # Access the UMAP matrix from the reducedDims slot reducedDim(sce, "UMAP") <- u # List the dimensionality reduction results stored in # the object reducedDims(sce) 🤔 What if I want to store a dimensionality reduced version of the data?
  55. Reasons for 💩 Cell damage during dissociation Failures in library

    preparation Inefficient reverse transcription PCR amplification Plate alignment issues during FACS
  56. Signs of 💩 'Cells' featuring one or more of: ⚠

    Low total counts ⚠ Few expressed genes ⚠ High proportion of reads coming from mitochondria ⚠ High proportion of reads coming from spike-ins
  57. Consequences of 💩 Distinct cluster(s) with lots of 💩, complicating

    interpretation of results Distortion of population heterogeneity Artificial 'upregulation' of certain genes
  58. Avoiding (or mitigating) 💩 🥇Careful experimental design and sample preparation

    🥈Quality control (QC) on the cells Identifying/removing 'bad' cells Identifying/removing 'bad' features
  59. Illustrative dataset: 416B library(scRNAseq) sce.416b <- LunSpikeInData(which="416b") sce.416b$block <- factor(sce.416b$block)

    Lun, A. T. L., Calero-Nieto, F. J., Haim-Vilmovsky, L., Göttgens, B. & Marioni, J. C. Assessing the reliability of spike-in normalization for analyses of single-cell RNA sequencing data. Genome Res. 27, 1795–1806 (2017) - Immortalized mouse myeloid progenitor cell line processed using SmartSeq2 - A constant amount of ERCC spike-in RNA was added to each cell’s lysate prior to library preparation. - https://osca.bioconductor.org/lun-416b-cell-line-smart-seq2.html
  60. Common QC metrics sum: Library size detected: Number of expressed

    genes/features altexps_ERCC_percent: Percentage of reads mapped to (ERCC) spike-in transcripts subsets_Mito_percent: Percentage of reads mapped to mitochondrial transcripts
  61. Aside: Identifying the mitochondrial transcripts # Download the relevant Ensembl

    annotation database # using AnnotationHub resources library(AnnotationHub) ah <- AnnotationHub() query(ah, c("Mus musculus", "Ensembl", "v97")) # Annotate each gene with its chromosome location ens.mm.v97 <- ah[["AH73905"]] location <- mapIds(ens.mm.v97, keys=rownames(sce.416b), keytype="GENEID", column="SEQNAME") # Identify the mitochondrial genes is.mito <- which(location=="MT")
  62. Visualizing QC metrics scater::plotColData() plotColData(sce.416b, x="block", y="detected") plotColData(sce.416b, x="block", y="detected")

    + scale_y_log10() plotColData(sce.416b, x="block", y="detected", other_fields="phenotype") + scale_y_log10() + facet_wrap(~phenotype) ▶ Try plotting one of the other QC metrics
  63. Using fixed thresholds # Example thresholds qc.lib <- sce.416b$sum <

    100000 qc.nexprs <- sce.416b$detected < 5000 qc.spike <- sce.416b$altexps_ERCC_percent > 10 qc.mito <- sce.416b$subsets_Mito_percent > 10 discard <- qc.lib | qc.nexprs | qc.spike | qc.mito # Summarize the number of cells removed for each reason DataFrame(LibSize=sum(qc.lib), NExprs=sum(qc.nexprs), SpikeProp=sum(qc.spike), MitoProp=sum(qc.mito), Total=sum(discard))
  64. Using fixed thresholds ➕ Simple ➖ Hard to determine appropriate

    thresholds for each experimental protocol and biological system
  65. Using adaptive thresholds scater::isOutlier() qc.lib2 <- isOutlier(sce.416b$sum, log=TRUE, type="lower") qc.nexprs2

    <- isOutlier(sce.416b$detected, log=TRUE, type="lower") qc.spike2 <- isOutlier(sce.416b$altexps_ERCC_percent, type="higher") qc.mito2 <- isOutlier(sce.416b$subsets_Mito_percent, type="higher") discard2 <- qc.lib2 | qc.nexprs2 | qc.spike2 | qc.mito2
  66. Using adaptive thresholds scater::isOutlier() # Extract the thresholds attr(qc.lib2, "thresholds")

    attr(qc.nexprs2, "thresholds") # Summarize the number of cells removed for each reason. DataFrame(LibSize=sum(qc.lib2), NExprs=sum(qc.nexprs2), SpikeProp=sum(qc.spike2), MitoProp=sum(qc.mito2), Total=sum(discard2))
  67. How scater::isOutlier() works - Assume that most of the dataset

    consists of high-quality cells1 - (Optional: log-transform the QC metric) - Compute median QC metric - Compute median absolute deviation (MAD2) of QC metric - Identify outliers as those cells with a QC metric more than 3 MADs3 from the median in the "problematic" direction. - You can control how many MADs are acceptable - You can decide which direction is problematic 1But we'll relax that assumption in a few moments 2MAD is similar to standard deviation 3Loosely, 3 MADs will retain 99% of non-outliers values that follow a Normal distribution
  68. Assumptions of (adaptive) outlier detection Most cells are of acceptable

    quality ⚠ Adaptive thresholds will fail if majority of cells are of unacceptably low quality 🔬 Experimental support (e.g., imaging, cell viability) 🤔 What is 'acceptable' will vary by cell type Poor QC metrics are driven by technical factors rather than biological processes ⚠ Highly heterogeneous cell populations may contain cells with systematically low RNA content, for example 🤔 What cell types might be discarded by your QC metric thresholds
  69. Considering experimental factors More complex experiments may involve batches of

    cells with different experimental parameters (e.g., sequencing depth) 👉 Need to consider QC metrics stratified by batch
  70. Adaptive outlier detection accounting for batch batch <- paste0(sce.416b$phenotype, "-",

    sce.416b$block) qc.lib3 <- isOutlier(sce.416b$sum, log=TRUE, type="lower", batch=batch) qc.nexprs3 <- isOutlier(sce.416b$detected, log=TRUE, type="lower", batch=batch) qc.spike3 <- isOutlier(sce.416b$altexps_ERCC_percent, type="higher", batch=batch) qc.mito3 <- isOutlier(sce.416b$subsets_Mito_percent, type="higher", batch=batch) discard3 <- qc.lib3 | qc.nexprs3 | qc.spike3 | qc.mito3
  71. Adaptive outlier detection accounting for batch # Extract the thresholds

    attr(qc.lib3, "thresholds") attr(qc.nexprs3, "thresholds") # Summarize the number of cells removed for each reason DataFrame(LibSize=sum(qc.lib3), NExprs=sum(qc.nexprs3), SpikeProp=sum(qc.spike3), MitoProp=sum(qc.mito3), Total=sum(discard3)) ⚠ Use of batch requires assumption that most cells in each batch are of acceptable quality 🤔 What if an entire batch failed
  72. Illustrative dataset: Grun pancreas sce.grun <- GrunPancreasData() sce.grun <- addPerCellQC(sce.grun)

    Grün, D. et al. De Novo Prediction of Stem Cell Identity using Single-Cell Transcriptome Data. Cell Stem Cell 19, 266–277 (2016) Human pancreas samples from various donors processed with CEL-Seq2 https://osca.bioconductor.org/grun-human-pancreas-cel-seq2.html
  73. What to do if an entire batch failed plotColData(sce.grun, x="donor",

    y="altexps_ERCC_percent") ❓What systematic patterns does this show ❓How might this affect the assumptions required for outlier detection
  74. Adaptive outlier detection using a subset of the data discard.ercc

    <- isOutlier(sce.grun$altexps_ERCC_percent, type="higher", batch=sce.grun$donor) discard.ercc2 <- isOutlier(sce.grun$altexps_ERCC_percent, type="higher", batch=sce.grun$donor, subset=sce.grun$donor %in% c("D17", "D2", "D7")) plotColData(sce.grun, x="donor", y="altexps_ERCC_percent", colour_by=data.frame(discard = discard.ercc)) plotColData(sce.grun, x="donor", y="altexps_ERCC_percent", colour_by=data.frame(discard = discard.ercc2))
  75. Checking diagnostic plots # Add info about which cells are

    outliers sce.416b$discard <- discard2 # Look at this plot for each QC metric plotColData(sce.416b, x="block", y="sum", colour_by="discard", other_fields="phenotype") + facet_wrap(~phenotype) + scale_y_log10()
  76. Checking diagnostic plots # Another useful diagnostic plot plotColData(sce.416b, x="sum",

    y="subsets_Mito_percent", colour_by="discard", other_fields=c("block", "phenotype")) + facet_grid(block~phenotype) ⚠ Cells with both large total counts and large mitochondrial counts may be high-quality cells that happen to be highly metabolically active