HCA-LA-2021 - Speaker Deck

Slide 1

Slide 1 text

Getting started with scRNA-seq analyses with Bioconductor Leonardo Collado Torres lcolladotor.github.io 2021-05-03 Human Cell Atlas Latin America

Slide 2

Slide 2 text

Slides & Code ● Slides : https://docs.google.com/presentation/d/1CgBtdoE5GAhbqAaGzxR5HEkUTbXNZsKEfym9099 6vWQ/edit?usp=sharing ● Code https://lcolladotor.github.io/HCA_LA_2021 ● R Code https://github.com/lcolladotor/HCA_LA_2021/blob/master/2021-05-03_code_HCA_LA.R

Slide 3

Slide 3 text

LIEBER INSTITUTE for BRAIN DEVELOPMENT R code for installing R/Bioconductor packages used here ## For installing Bioconductor packages if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") ## Install required packages BiocManager::install( c('SingleCellExperiment', 'scran', 'scater', 'scRNAseq', 'AnnotationHub', 'ExperimentHub', 'uwot', ‘Rtsne’, 'pryr', 'sessioninfo'))

Slide 4

Slide 4 text

Continuous learning & community Me: Learning to teach

Slide 5

Slide 5 text

LIEBER INSTITUTE for BRAIN DEVELOPMENT http://bioconductor.org/about/community-advisory-board/ https://comunidadbioinfo.github.io/

Slide 6

Slide 6 text

LIEBER INSTITUTE for BRAIN DEVELOPMENT http://research.libd.org/rstatsclub/ & https://www.youtube.com/c/LeonardoColladoTorres https://youtu.be/lqxtgpD-heM May 2020 (English)

Slide 7

Slide 7 text

LIEBER INSTITUTE for BRAIN DEVELOPMENT https://lcolladotor.github.io/RBiocIntro2020/ July 2020 - Spanish https://youtu.be/SZVPXT2McQM & https://www.youtube.com/channel/UCHCdYfAXVzJIUkMoMSGiZMw

Slide 8

Slide 8 text

LIEBER INSTITUTE for BRAIN DEVELOPMENT Find news about the latest developments Twitter ● https://twitter.com/bioconductor ● https://twitter.com/LIBDrstats (my work R club) ● https://twitter.com/CDSBMexico ● #rstats: general ● #rstatsES: general & in Spanish ● #BioC2021: upcoming Bioconductor conference

Slide 9

Slide 9 text

LIEBER INSTITUTE for BRAIN DEVELOPMENT Join a community: virtual “academic department” Slack ● Bioconductor (English) https://bioc-community.herokuapp.com ● CDSB (Spanish) https://join.slack.com/t/comunidadbioinfo/shared_invite/z t-8lsvpm84-Fne1W0hadk6cpjgJS17Tnw ● LatinR (Spanish, English, Portuguese) https://latin-r.com/blog/communities Blog post with Stephanie Hicks http://lcolladotor.github.io/2018/06/19/using-slack-for-academic-departmental-communication/

Slide 10

Slide 10 text

LIEBER INSTITUTE for BRAIN DEVELOPMENT Foster a welcoming environment: follow the code of conduct! http://bioconductor.org/about/ code-of-conduct/

Slide 11

Slide 11 text

Bioconductor

Slide 12

Slide 12 text

LIEBER INSTITUTE for BRAIN DEVELOPMENT About Bioconductor Bioconductor provides tools for the analysis and comprehension of high-throughput genomic data. Bioconductor uses the R statistical programming language, and is open source and open development. http://bioconductor.org/

Slide 13

Slide 13 text

LIEBER INSTITUTE for BRAIN DEVELOPMENT R/Bioconductor packages (BioC 3.12 from October 2020) ● Software: 1975 ● Annotation: 971 ● Experimental Data: 398 ● Workflow guides: 28 http://bioconductor.org/packages/release/BiocViews.html#___S oftware

Slide 14

Slide 14 text

LIEBER INSTITUTE for BRAIN DEVELOPMENT Exercise: navigating the biocViews terms (tree-like structure) Who is the author of the top ranked Bioconductor workflow that mentions “cancer” in the title? Hint: ● http://bioconductor.org/ -> ● http://bioconductor.org/packages/release/BiocViews.html#_ __Software ->

Slide 15

Slide 15 text

LIEBER INSTITUTE for BRAIN DEVELOPMENT Answer: Tiago Chedraoui Silva http://bioconductor.org/packages/release/BiocViews.html#___W orkflow

Slide 16

Slide 16 text

LIEBER INSTITUTE for BRAIN DEVELOPMENT Let’s say you found a package of interest: scran

Slide 17

Slide 17 text

LIEBER INSTITUTE for BRAIN DEVELOPMENT Learn more about it before installing it: text description & DOI https://bioconductor.org/packages/scran

Slide 18

Slide 18 text

LIEBER INSTITUTE for BRAIN DEVELOPMENT Learn more about it before installing it: vignette https://bioconductor.org/packages/scran

Slide 19

Slide 19 text

LIEBER INSTITUTE for BRAIN DEVELOPMENT Learn more about it before installing it: vignette http://bioconductor.org/packages/release/bioc/vignettes/scran/inst/doc/scran.html

Slide 20

Slide 20 text

LIEBER INSTITUTE for BRAIN DEVELOPMENT Ok, now we can install it (if we have the right version of R) https://bioconductor.org/packages/scran

Slide 21

Slide 21 text

LIEBER INSTITUTE for BRAIN DEVELOPMENT R and Bioconductor versions https://twitter.com/lcolladotor/status/1388641318604316673?s=20

Slide 22

Slide 22 text

Orchestrating Single Cell Analysis (OSCA) with Bioconductor

Slide 23

Slide 23 text

LIEBER INSTITUTE for BRAIN DEVELOPMENT OSCA Book & paper http://bioconductor.org/books/release/OSCA/ https://doi.org/10.1038/s41592-019-0654-x This website is free to use, and is licensed under the Creative Commons Attribution-NonCommercial-NoDerivs 3.0 License.

Slide 24

Slide 24 text

Single-cell RNA-sequencing (scRNA-seq) workshop Peter Hickey @PeteHaitch Single Cell Open Research Endeavour (SCORE), Walter and Eliza Hall Institute of Medical Research bit.ly/WEHIscRNAworkshop This material has been adapted from Pete’s WEHI course

Slide 25

Slide 25 text

LIEBER INSTITUTE for BRAIN DEVELOPMENT Other versions of this material ● (March 2020; 4 days) Intro to OSCA for LGC-UNAM at LIIGH-UNAM undergrad students (code in English) https://lcolladotor.github.io/osca_LIIGH_UNAM_2020/ ● (August 2020; 2.5 days) Intro to OSCA at CSBD2020 https://github.com/ComunidadBioInfo/cdsb2020 (code comments in Spanish) ● (August 2021; 5 days) http://www.nnb.unam.mx/

Slide 26

Slide 26 text

Workflows not pipelines or recipes

Slide 27

Slide 27 text

No content

Slide 28

Slide 28 text

No content

Slide 29

Slide 29 text

R Data Structures for scRNA-seq

Slide 30

Slide 30 text

Background

Slide 31

Slide 31 text

Background Common data infrastructure makes life easier - Can switch between different tools more easily - No need to convert between formats What data do we need to analyse a scRNA-seq experiment?

Slide 32

Slide 32 text

What data do we start with? Cell 1 Cell 2 ... Cell N Gene 1 0 1 ... 0 Gene 2 1 3 ... 0 ... ... ... ... ... Gene M 2 2 4 Gene counts

Slide 33

Slide 33 text

What data do we start with? Barcode Donor ... Treatment Cell 1 ACTGTA D1 ... Drug Cell 2 TGCATA D1 ... Control ... ... ... ... ... Cell N CCTATA D6 Drug Information about the cells

Slide 34

Slide 34 text

What data do we start with? ID Symbol ... Chromosome Gene 1 ENSG00000155816 FMN2 ... 1 Gene 2 ENSG00000229807 XIST ... X ... ... ... ... ... Gene M ENSG00000139618 BRCA2 13 Information about the genes

Slide 35

Slide 35 text

What data do we create in an analysis? Cell 1 Cell 2 ... Cell N Gene 1 0 0.6 ... 0 Gene 2 0.3 0.8 ... 0 ... ... ... ... ... Gene M 0.35 0.67 2.1 Log-normalized gene expression

Slide 36

Slide 36 text

What data do we create in an analysis? PCA 1 PCA 2 ... PCA K Cell 1 0.93 1.28 ... 0.03 Cell 2 0.32 1.22 ... 0.09 ... ... ... ... ... Cell N -0.66 1.00 0.15 Dimensionality reduction t-SNE 1 t-SNE 2 Cell 1 1.24 8.93 Cell 2 -0.33 7.85 ... ... ... Cell N 0.46 3.41

Slide 37

Slide 37 text

How to coordinate this? Cell 1 Cell 2 ... Cell N Gene 1 0 1 ... 0 Gene 2 1 3 ... 0 ... ... ... ... ... Gene M 2 2 4 Gene counts Barcode Donor ... Treatment Cell 1 ACTGTA D1 ... Drug Cell 2 TGCATA D1 ... Control ... ... ... ... ... Cell N CCTATA D6 Drug Information about the cells ID Symbol ... Chromosome Gene 1 ENSG00000155816 FMN2 ... 1 Gene 2 ENSG00000229807 XIST ... X ... ... ... ... ... Gene M ENSG00000139618 BRCA2 13 Information about the genes

Slide 38

Slide 38 text

Slide 39

Slide 39 text

Slide 40

Slide 40 text

Slide 41

Slide 41 text

Slide 42

Slide 42 text

SingleCellExperiment

Slide 43

Slide 43 text

The anatomy of a SingleCellExperiment Each piece of (meta)data in the SingleCellExperiment is represented by a separate 'slot' An analogy - An SCE is a cargo ship - Each 'slot' is a cargo box - Certain cargo boxes (slots) expect certain types of cargo (data)

Slide 44

Slide 44 text

Illustrative dataset: 416B

Slide 45

Slide 45 text

Illustrative dataset: 416B library(scRNAseq) sce.416b <- LunSpikeInData(which="416b") Lun, A. T. L., Calero-Nieto, F. J., Haim-Vilmovsky, L., Göttgens, B. & Marioni, J. C. Assessing the reliability of spike-in normalization for analyses of single-cell RNA sequencing data. Genome Res. 27, 1795–1806 (2017) - Immortalized mouse myeloid progenitor cell line processed using SmartSeq2 - https://osca.bioconductor.org/lun-416b-cell-line-smart-seq2.html

Slide 46

Slide 46 text

https://github.com/lcolladotor/osca_LIIGH_UNAM_2020/blob/master/02-data- infrastructure-and-import.R#L57-L208 https://lcolladotor.github.io/HCA_LA_2021/ R Code

Slide 47

Slide 47 text

Storing primary experimental data

Slide 48

Slide 48 text

Storing primary experimental data Cell 1 Cell 2 ... Cell N Gene 1 0 1 ... 0 Gene 2 1 3 ... 0 ... ... ... ... ... Gene M 2 2 4 Gene counts

Slide 49

Slide 49 text

Filling the assays slot # Load the SingleCellExperiment package library(SingleCellExperiment) # Extract the count matrix from the 416b dataset counts.416b <- counts(sce.416b) # Construct a new SCE from the counts matrix sce <- SingleCellExperiment( assays = list(counts = counts.416b))

Slide 50

Slide 50 text

Filling the assays slot # Inspect the object we just created sce

Slide 51

Slide 51 text

Filling the assays slot # Access the counts matrix from the assays slot # WARNING: This will flood RStudio with output! # 1. The general method assay(sce, "counts") # 2. The special method for the assay named "counts" counts(sce) # Tip: Limit the output to just a few samples & genes counts(sce)[1:30, 1:2]

Slide 52

Slide 52 text

Adding to the assays slot sce <- scater::logNormCounts(sce) # Inspect the object we just updated sce

Slide 53

Slide 53 text

Adding to the assays slot sce <- scater::logNormCounts(sce) # Inspect the object we just updated sce 👉We overwrote our previous 'sce' by reassigning the results back to 'sce' (possible because this particular function returns a 'SingleCellExperiment' that contains the results in addition to the original data) ⚠ Some functions - especially those outside the single-cell oriented Bioconductor packages - do not do this, then we need to do extra work

Slide 54

Slide 54 text

Adding more assays # Access the logcounts matrix from the assays slot # WARNING: This will flood RStudio with output! # 1. The general method assay(sce, "logcounts") # 2. The special method for the assay named "logcounts" logcounts(sce) # Tip: Limit the output to just a few samples & genes logcounts(sce)[1:30, 1:2]

Slide 55

Slide 55 text

Adding more assays # assign a new entry to assays slot assay(sce, "counts_100") <- assay(sce, "counts") + 100 # List the assays in the object assays(sce) 🤔 What if I want to add a custom assay?

Slide 56

Slide 56 text

Storing metadata

Slide 57

Slide 57 text

Storing metadata Barcode Donor ... Treatment Cell 1 ACTGTA D1 ... Drug Cell 2 TGCATA D1 ... Control ... ... ... ... ... Cell N CCTATA D6 Drug Information about the cells ID Symbol ... Chromosome Gene 1 ENSG00000155816 FMN2 ... 1 Gene 2 ENSG00000229807 XIST ... X ... ... ... ... ... Gene M ENSG00000139618 BRCA2 13 Information about the genes Information about the experiment - "Look at expression of genes A, B, and C" - "There could be problems with samples from the first batch because of issue with FACS sort"

Slide 58

Slide 58 text

Storing metadata on the columns (cells) Barcode Donor ... Treatment Cell 1 ACTGTA D1 ... Drug Cell 2 TGCATA D1 ... Control ... ... ... ... ... Cell N CCTATA D6 Drug Information about the cells

Slide 59

Slide 59 text

Filling the colData slot # Extract the sample metadata from the 416b dataset colData.416b <- colData(sce.416b) # Add some of the sample metadata to our SCE colData(sce) <- colData.416b[, c("phenotype", "block")] # Inspect the object we just updated sce # Access the sample metadata from our SCE colData(sce) # Access a specific column of sample metadata from our SCE sce$block

Slide 60

Slide 60 text

Adding to the colData slot # Example of function that adds extra fields to colData sce <- scater::addPerCellQC(sce.416b) # Access the sample metadata from our updated SCE colData(sce)

Slide 61

Slide 61 text

Using colData for subsetting # E.g., subset data to just wild type cells # Remember, cells are columns of the SCE sce[, sce$phenotype == "wild type phenotype"] 🤔 What if I want to subset the data to certain cells?

Slide 62

Slide 62 text

Storing metadata on the rows (features/genes) ID Symbol ... Chromosome Gene 1 ENSG00000155816 FMN2 ... 1 Gene 2 ENSG00000229807 XIST ... X ... ... ... ... ... Gene M ENSG00000139618 BRCA2 13 Information about the genes

Slide 63

Slide 63 text

Adding to the rowData slot # Access the feature metadata from our SCE # It's currently empty! rowData(sce) # Example of function that adds extra fields to rowData sce <- scater::addPerFeatureQC(sce) # Access the feature metadata from our updated SCE rowData(sce)

Slide 64

Slide 64 text

Adding to the rowData slot # Download the relevant Ensembl annotation database # using AnnotationHub resources library(AnnotationHub) ah <- AnnotationHub() query(ah, c("Mus musculus", "Ensembl", "v97")) 🤔 What if I want to add the chromosome of each gene?

Slide 65

Slide 65 text

Adding to the rowData slot # Annotate each gene with its chromosome location ensdb <- ah[["AH73905"]] chromosome <- mapIds(ensdb, keys=rownames(sce), keytype="GENEID", column="SEQNAME") rowData(sce)$chromosome <- chromosome # Access the feature metadata from our updated SCE rowData(sce)

Slide 66

Slide 66 text

Using rowData for subsetting # E.g., subset data to just genes on chromosome 3 # NOTE: which() needed to cope with NA chromosome names sce[which(rowData(sce)$chromosome == "3"), ] 🤔 What if I want to subset the data to certain features?

Slide 67

Slide 67 text

Storing other metadata 😬 Information about the experiment - "Look at expression of genes A, B, and C" - "There could be problems with samples from the first batch because of issue with FACS sort"

Slide 68

Slide 68 text

Adding to the metadata slot # Access the metadata from our SCE # It's currently empty! metadata(sce) # The metadata slot is Vegas - anything goes metadata(sce) <- list( favourite_genes=c("Shh", "Nck1", "Diablo"), analyst=c("Pete")) # Access the metadata from our updated SCE metadata(sce)

Slide 69

Slide 69 text

Storing single-cell-specific data

Slide 70

Slide 70 text

Background So far, we've covered: - 'assays' (primary data) - 'colData' (cell metadata) - 'rowData' (feature metadata) - 'metadata' (experiment-level metadata) 🤓 The above is a SummarizedExperiment

Slide 71

Slide 71 text

So why do we need SingleCellExperiment? For single-cell data, we commonly also have: - Dimensionality reduction results (e.g., t-SNE, UMAP) - Special or 'alternative' types of features (e.g., spike-in RNAs, antibody-derived sequencing tags) 🤓 SingleCellExperiment is an extension of SummarizedExperiment

Slide 72

Slide 72 text

Storing dimensionality reduction results PCA 1 PCA 2 ... PCA K Cell 1 0.93 1.28 ... 0.03 Cell 2 0.32 1.22 ... 0.09 ... ... ... ... ... Cell N -0.66 1.00 0.15 Dimensionality reduction t-SNE 1 t-SNE 2 Cell 1 1.24 8.93 Cell 2 -0.33 7.85 ... ... ... Cell N 0.46 3.41

Slide 73

Slide 73 text

Adding to the reducedDims slot # E.g., add the PCA of logcounts # NOTE: We'll learn more about PCA later sce <- scater::runPCA(sce) # Inspect the object we just updated sce # Access the PCA matrix from the reducedDims slot reducedDim(sce, "PCA") 🤔 What if I want to store a dimensionality reduced version of the data?

Slide 74

Slide 74 text

Adding to the reducedDims slot # E.g., add a t-SNE representation of logcounts # NOTE: We'll learn more about t-SNE later sce <- scater::runTSNE(sce) # Inspect the object we just updated sce # Access the t-SNE matrix from the reducedDims slot reducedDim(sce, "TSNE") 🤔 What if I want to store a dimensionality reduced version of the data?

Slide 75

Slide 75 text

Adding to the reducedDims slot # E.g., add a 'manual' UMAP representation of logcounts # NOTE: We'll learn more about UMAP later and a # simpler way to compute it. u <- uwot::umap(t(logcounts(sce)), n_component=2) # Add the UMAP matrix to the reducedDims slot # Access the UMAP matrix from the reducedDims slot reducedDim(sce, "UMAP") <- u # List the dimensionality reduction results stored in # the object reducedDims(sce) 🤔 What if I want to store a dimensionality reduced version of the data?

Slide 76

Slide 76 text

Quick Quality Control

Slide 77

Slide 77 text

https://github.com/lcolladotor/osca_LIIGH_UNAM_2020/blob/master/03-quality- control.R#L57-L226 https://lcolladotor.github.io/HCA_LA_2021/ R Code

Slide 78

Slide 78 text

Motivation

Slide 79

Slide 79 text

Motivation 💩 💩 Fancy single-cell method

Slide 80

Slide 80 text

Reasons for 💩 Cell damage during dissociation Failures in library preparation Inefficient reverse transcription PCR amplification Plate alignment issues during FACS

Slide 81

Slide 81 text

Signs of 💩 'Cells' featuring one or more of: ⚠ Low total counts ⚠ Few expressed genes ⚠ High proportion of reads coming from mitochondria ⚠ High proportion of reads coming from spike-ins

Slide 82

Slide 82 text

Consequences of 💩 Distinct cluster(s) with lots of 💩, complicating interpretation of results Distortion of population heterogeneity Artificial 'upregulation' of certain genes

Slide 83

Slide 83 text

Avoiding (or mitigating) 💩 🥇Careful experimental design and sample preparation 🥈Quality control (QC) on the cells Identifying/removing 'bad' cells Identifying/removing 'bad' features

Slide 84

Slide 84 text

Illustrative dataset: 416B

Slide 85

Slide 85 text

Illustrative dataset: 416B library(scRNAseq) sce.416b <- LunSpikeInData(which="416b") sce.416b$block <- factor(sce.416b$block) Lun, A. T. L., Calero-Nieto, F. J., Haim-Vilmovsky, L., Göttgens, B. & Marioni, J. C. Assessing the reliability of spike-in normalization for analyses of single-cell RNA sequencing data. Genome Res. 27, 1795–1806 (2017) - Immortalized mouse myeloid progenitor cell line processed using SmartSeq2 - A constant amount of ERCC spike-in RNA was added to each cell’s lysate prior to library preparation. - https://osca.bioconductor.org/lun-416b-cell-line-smart-seq2.html

Slide 86

Slide 86 text

Illustrative dataset: 416B

Slide 87

Slide 87 text

Choice of QC metrics

Slide 88

Slide 88 text

Common QC metrics sum: Library size detected: Number of expressed genes/features altexps_ERCC_percent: Percentage of reads mapped to (ERCC) spike-in transcripts subsets_Mito_percent: Percentage of reads mapped to mitochondrial transcripts

Slide 89

Slide 89 text

Aside: Identifying the mitochondrial transcripts # Download the relevant Ensembl annotation database # using AnnotationHub resources library(AnnotationHub) ah <- AnnotationHub() query(ah, c("Mus musculus", "Ensembl", "v97")) # Annotate each gene with its chromosome location ens.mm.v97 <- ah[["AH73905"]] location <- mapIds(ens.mm.v97, keys=rownames(sce.416b), keytype="GENEID", column="SEQNAME") # Identify the mitochondrial genes is.mito <- which(location=="MT")

Slide 90

Slide 90 text

Computing the QC metrics scater::addPerCellQC() library(scater) sce.416b <- addPerCellQC(sce.416b, subsets=list(Mito=is.mito)) ❗We'll also use scater::perCellQCMetrics() in this workshop

Slide 91

Slide 91 text

Visualizing QC metrics

Slide 92

Slide 92 text

Visualizing QC metrics scater::plotColData() plotColData(sce.416b, x="block", y="detected") plotColData(sce.416b, x="block", y="detected") + scale_y_log10() plotColData(sce.416b, x="block", y="detected", other_fields="phenotype") + scale_y_log10() + facet_wrap(~phenotype) ▶ Try plotting one of the other QC metrics

Slide 93

Slide 93 text

Identifying low-quality cells

Slide 94

Slide 94 text

Identifying low-quality cells 🥈Using fixed thresholds 🥇Using adaptive thresholds Other approaches

Slide 95

Slide 95 text

Using fixed thresholds # Example thresholds qc.lib <- sce.416b$sum < 100000 qc.nexprs <- sce.416b$detected < 5000 qc.spike <- sce.416b$altexps_ERCC_percent > 10 qc.mito <- sce.416b$subsets_Mito_percent > 10 discard <- qc.lib | qc.nexprs | qc.spike | qc.mito # Summarize the number of cells removed for each reason DataFrame(LibSize=sum(qc.lib), NExprs=sum(qc.nexprs), SpikeProp=sum(qc.spike), MitoProp=sum(qc.mito), Total=sum(discard))

Slide 96

Slide 96 text

Using fixed thresholds ➕ Simple ➖ Hard to determine appropriate thresholds for each experimental protocol and biological system

Slide 97

Slide 97 text

Using adaptive thresholds scater::isOutlier() qc.lib2 <- isOutlier(sce.416b$sum, log=TRUE, type="lower") qc.nexprs2 <- isOutlier(sce.416b$detected, log=TRUE, type="lower") qc.spike2 <- isOutlier(sce.416b$altexps_ERCC_percent, type="higher") qc.mito2 <- isOutlier(sce.416b$subsets_Mito_percent, type="higher") discard2 <- qc.lib2 | qc.nexprs2 | qc.spike2 | qc.mito2

Slide 98

Slide 98 text

Using adaptive thresholds scater::isOutlier() # Extract the thresholds attr(qc.lib2, "thresholds") attr(qc.nexprs2, "thresholds") # Summarize the number of cells removed for each reason. DataFrame(LibSize=sum(qc.lib2), NExprs=sum(qc.nexprs2), SpikeProp=sum(qc.spike2), MitoProp=sum(qc.mito2), Total=sum(discard2))

Slide 99

Slide 99 text

How scater::isOutlier() works - Assume that most of the dataset consists of high-quality cells1 - (Optional: log-transform the QC metric) - Compute median QC metric - Compute median absolute deviation (MAD2) of QC metric - Identify outliers as those cells with a QC metric more than 3 MADs3 from the median in the "problematic" direction. - You can control how many MADs are acceptable - You can decide which direction is problematic 1But we'll relax that assumption in a few moments 2MAD is similar to standard deviation 3Loosely, 3 MADs will retain 99% of non-outliers values that follow a Normal distribution

Slide 100

Slide 100 text

isOutlier(sce.416b$sum, log=TRUE, type="lower")

Slide 101

Slide 101 text

Assumptions of (adaptive) outlier detection Most cells are of acceptable quality ⚠ Adaptive thresholds will fail if majority of cells are of unacceptably low quality 🔬 Experimental support (e.g., imaging, cell viability) 🤔 What is 'acceptable' will vary by cell type Poor QC metrics are driven by technical factors rather than biological processes ⚠ Highly heterogeneous cell populations may contain cells with systematically low RNA content, for example 🤔 What cell types might be discarded by your QC metric thresholds

Slide 102

Slide 102 text

Considering experimental factors More complex experiments may involve batches of cells with different experimental parameters (e.g., sequencing depth) 👉 Need to consider QC metrics stratified by batch

Slide 103

Slide 103 text

Considering experimental factors plotColData(sce.416b, x="block", y="detected", other_fields="phenotype") + scale_y_log10() + facet_wrap(~phenotype) ❓What systematic patterns does this show

Slide 104

Slide 104 text

Adaptive outlier detection accounting for batch batch <- paste0(sce.416b$phenotype, "-", sce.416b$block) qc.lib3 <- isOutlier(sce.416b$sum, log=TRUE, type="lower", batch=batch) qc.nexprs3 <- isOutlier(sce.416b$detected, log=TRUE, type="lower", batch=batch) qc.spike3 <- isOutlier(sce.416b$altexps_ERCC_percent, type="higher", batch=batch) qc.mito3 <- isOutlier(sce.416b$subsets_Mito_percent, type="higher", batch=batch) discard3 <- qc.lib3 | qc.nexprs3 | qc.spike3 | qc.mito3

Slide 105

Slide 105 text

Adaptive outlier detection accounting for batch # Extract the thresholds attr(qc.lib3, "thresholds") attr(qc.nexprs3, "thresholds") # Summarize the number of cells removed for each reason DataFrame(LibSize=sum(qc.lib3), NExprs=sum(qc.nexprs3), SpikeProp=sum(qc.spike3), MitoProp=sum(qc.mito3), Total=sum(discard3)) ⚠ Use of batch requires assumption that most cells in each batch are of acceptable quality 🤔 What if an entire batch failed

Slide 106

Slide 106 text

Illustrative dataset: Grun pancreas

Slide 107

Slide 107 text

Illustrative dataset: Grun pancreas sce.grun <- GrunPancreasData() sce.grun <- addPerCellQC(sce.grun) Grün, D. et al. De Novo Prediction of Stem Cell Identity using Single-Cell Transcriptome Data. Cell Stem Cell 19, 266–277 (2016) Human pancreas samples from various donors processed with CEL-Seq2 https://osca.bioconductor.org/grun-human-pancreas-cel-seq2.html

Slide 108

Slide 108 text

What to do if an entire batch failed plotColData(sce.grun, x="donor", y="altexps_ERCC_percent") ❓What systematic patterns does this show ❓How might this affect the assumptions required for outlier detection

Slide 109

Slide 109 text

Adaptive outlier detection using a subset of the data discard.ercc <- isOutlier(sce.grun$altexps_ERCC_percent, type="higher", batch=sce.grun$donor) discard.ercc2 <- isOutlier(sce.grun$altexps_ERCC_percent, type="higher", batch=sce.grun$donor, subset=sce.grun$donor %in% c("D17", "D2", "D7")) plotColData(sce.grun, x="donor", y="altexps_ERCC_percent", colour_by=data.frame(discard = discard.ercc)) plotColData(sce.grun, x="donor", y="altexps_ERCC_percent", colour_by=data.frame(discard = discard.ercc2))

Slide 110

Slide 110 text

Adaptive outlier detection using a subset

Slide 111

Slide 111 text

Checking diagnostic plots

Slide 112

Slide 112 text

Checking diagnostic plots # Add info about which cells are outliers sce.416b$discard <- discard2 # Look at this plot for each QC metric plotColData(sce.416b, x="block", y="sum", colour_by="discard", other_fields="phenotype") + facet_wrap(~phenotype) + scale_y_log10()

Slide 113

Slide 113 text

Checking diagnostic plots # Another useful diagnostic plot plotColData(sce.416b, x="sum", y="subsets_Mito_percent", colour_by="discard", other_fields=c("block", "phenotype")) + facet_grid(block~phenotype) ⚠ Cells with both large total counts and large mitochondrial counts may be high-quality cells that happen to be highly metabolically active

Slide 114

Slide 114 text

Hands on

Slide 115

Slide 115 text

LIEBER INSTITUTE for BRAIN DEVELOPMENT https://lcolladotor.github.io/HCA_LA_2021/ & beyond

Slide 116

Slide 116 text

LIEBER INSTITUTE for BRAIN DEVELOPMENT Interested in working with us? Check https://www.libd.org/careers/