HCA-LA-2021

Getting started with scRNA-seq analyses with Bioconductor Leonardo Collado Torres
lcolladotor.github.io 2021-05-03 Human Cell Atlas Latin America

Slides & Code • Slides : https://docs.google.com/presentation/d/1CgBtdoE5GAhbqAaGzxR5HEkUTbXNZsKEfym9099 6vWQ/edit?usp=sharing • Code
https://lcolladotor.github.io/HCA_LA_2021 • R Code https://github.com/lcolladotor/HCA_LA_2021/blob/master/2021-05-03_code_HCA_LA.R

LIEBER INSTITUTE for BRAIN DEVELOPMENT R code for installing R/Bioconductor
packages used here ## For installing Bioconductor packages if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") ## Install required packages BiocManager::install( c('SingleCellExperiment', 'scran', 'scater', 'scRNAseq', 'AnnotationHub', 'ExperimentHub', 'uwot', ‘Rtsne’, 'pryr', 'sessioninfo'))

Continuous learning & community Me: Learning to teach

LIEBER INSTITUTE for BRAIN DEVELOPMENT http://bioconductor.org/about/community-advisory-board/ https://comunidadbioinfo.github.io/

LIEBER INSTITUTE for BRAIN DEVELOPMENT http://research.libd.org/rstatsclub/ & https://www.youtube.com/c/LeonardoColladoTorres https://youtu.be/lqxtgpD-heM May
2020 (English)

LIEBER INSTITUTE for BRAIN DEVELOPMENT https://lcolladotor.github.io/RBiocIntro2020/ July 2020 - Spanish
https://youtu.be/SZVPXT2McQM & https://www.youtube.com/channel/UCHCdYfAXVzJIUkMoMSGiZMw

LIEBER INSTITUTE for BRAIN DEVELOPMENT Find news about the latest
developments Twitter • https://twitter.com/bioconductor • https://twitter.com/LIBDrstats (my work R club) • https://twitter.com/CDSBMexico • #rstats: general • #rstatsES: general & in Spanish • #BioC2021: upcoming Bioconductor conference

LIEBER INSTITUTE for BRAIN DEVELOPMENT Join a community: virtual “academic
department” Slack • Bioconductor (English) https://bioc-community.herokuapp.com • CDSB (Spanish) https://join.slack.com/t/comunidadbioinfo/shared_invite/z t-8lsvpm84-Fne1W0hadk6cpjgJS17Tnw • LatinR (Spanish, English, Portuguese) https://latin-r.com/blog/communities Blog post with Stephanie Hicks http://lcolladotor.github.io/2018/06/19/using-slack-for-academic-departmental-communication/

LIEBER INSTITUTE for BRAIN DEVELOPMENT Foster a welcoming environment: follow
the code of conduct! http://bioconductor.org/about/ code-of-conduct/

Bioconductor

LIEBER INSTITUTE for BRAIN DEVELOPMENT About Bioconductor Bioconductor provides tools
for the analysis and comprehension of high-throughput genomic data. Bioconductor uses the R statistical programming language, and is open source and open development. http://bioconductor.org/

LIEBER INSTITUTE for BRAIN DEVELOPMENT R/Bioconductor packages (BioC 3.12 from
October 2020) • Software: 1975 • Annotation: 971 • Experimental Data: 398 • Workflow guides: 28 http://bioconductor.org/packages/release/BiocViews.html#___S oftware

LIEBER INSTITUTE for BRAIN DEVELOPMENT Exercise: navigating the biocViews terms
(tree-like structure) Who is the author of the top ranked Bioconductor workflow that mentions “cancer” in the title? Hint: • http://bioconductor.org/ -> • http://bioconductor.org/packages/release/BiocViews.html#_ __Software ->

LIEBER INSTITUTE for BRAIN DEVELOPMENT Answer: Tiago Chedraoui Silva http://bioconductor.org/packages/release/BiocViews.html#___W
orkflow

LIEBER INSTITUTE for BRAIN DEVELOPMENT Let’s say you found a
package of interest: scran

LIEBER INSTITUTE for BRAIN DEVELOPMENT Learn more about it before
installing it: text description & DOI https://bioconductor.org/packages/scran

installing it: vignette https://bioconductor.org/packages/scran

installing it: vignette http://bioconductor.org/packages/release/bioc/vignettes/scran/inst/doc/scran.html

LIEBER INSTITUTE for BRAIN DEVELOPMENT Ok, now we can install
it (if we have the right version of R) https://bioconductor.org/packages/scran

LIEBER INSTITUTE for BRAIN DEVELOPMENT R and Bioconductor versions https://twitter.com/lcolladotor/status/1388641318604316673?s=20

Orchestrating Single Cell Analysis (OSCA) with Bioconductor

LIEBER INSTITUTE for BRAIN DEVELOPMENT OSCA Book & paper http://bioconductor.org/books/release/OSCA/
https://doi.org/10.1038/s41592-019-0654-x This website is free to use, and is licensed under the Creative Commons Attribution-NonCommercial-NoDerivs 3.0 License.

Single-cell RNA-sequencing (scRNA-seq) workshop Peter Hickey @PeteHaitch Single Cell Open
Research Endeavour (SCORE), Walter and Eliza Hall Institute of Medical Research bit.ly/WEHIscRNAworkshop This material has been adapted from Pete’s WEHI course

LIEBER INSTITUTE for BRAIN DEVELOPMENT Other versions of this material
• (March 2020; 4 days) Intro to OSCA for LGC-UNAM at LIIGH-UNAM undergrad students (code in English) https://lcolladotor.github.io/osca_LIIGH_UNAM_2020/ • (August 2020; 2.5 days) Intro to OSCA at CSBD2020 https://github.com/ComunidadBioInfo/cdsb2020 (code comments in Spanish) • (August 2021; 5 days) http://www.nnb.unam.mx/

Workflows not pipelines or recipes

R Data Structures for scRNA-seq

Background

Background Common data infrastructure makes life easier - Can switch
between different tools more easily - No need to convert between formats What data do we need to analyse a scRNA-seq experiment?

What data do we start with? Cell 1 Cell 2
... Cell N Gene 1 0 1 ... 0 Gene 2 1 3 ... 0 ... ... ... ... ... Gene M 2 2 4 Gene counts

What data do we start with? Barcode Donor ... Treatment
Cell 1 ACTGTA D1 ... Drug Cell 2 TGCATA D1 ... Control ... ... ... ... ... Cell N CCTATA D6 Drug Information about the cells

What data do we start with? ID Symbol ... Chromosome
Gene 1 ENSG00000155816 FMN2 ... 1 Gene 2 ENSG00000229807 XIST ... X ... ... ... ... ... Gene M ENSG00000139618 BRCA2 13 Information about the genes

What data do we create in an analysis? Cell 1
Cell 2 ... Cell N Gene 1 0 0.6 ... 0 Gene 2 0.3 0.8 ... 0 ... ... ... ... ... Gene M 0.35 0.67 2.1 Log-normalized gene expression

What data do we create in an analysis? PCA 1
PCA 2 ... PCA K Cell 1 0.93 1.28 ... 0.03 Cell 2 0.32 1.22 ... 0.09 ... ... ... ... ... Cell N -0.66 1.00 0.15 Dimensionality reduction t-SNE 1 t-SNE 2 Cell 1 1.24 8.93 Cell 2 -0.33 7.85 ... ... ... Cell N 0.46 3.41

How to coordinate this? Cell 1 Cell 2 ... Cell
N Gene 1 0 1 ... 0 Gene 2 1 3 ... 0 ... ... ... ... ... Gene M 2 2 4 Gene counts Barcode Donor ... Treatment Cell 1 ACTGTA D1 ... Drug Cell 2 TGCATA D1 ... Control ... ... ... ... ... Cell N CCTATA D6 Drug Information about the cells ID Symbol ... Chromosome Gene 1 ENSG00000155816 FMN2 ... 1 Gene 2 ENSG00000229807 XIST ... X ... ... ... ... ... Gene M ENSG00000139618 BRCA2 13 Information about the genes

How to coordinate this? Cell 1 Cell 2 ... Cell
N Gene 1 0 1 ... 0 Gene 2 1 3 ... 0 ... ... ... ... ... Gene M 2 2 4 Gene counts Barcode Donor ... Treatment Cell 1 ACTGTA D1 ... Drug Cell 2 TGCATA D1 ... Control ... ... ... ... ... Cell N CCTATA D6 Drug Information about the cells ID Symbol ... Chromosome Gene 1 ENSG00000155816 FMN2 ... 1 Gene 2 ENSG00000229807 XIST ... X ... ... ... ... ... Gene M ENSG00000139618 BRCA2 13 Information about the genes Also have to coordinate derived data (log-normalized gene expression, PCA, t-SNE, etc.)

SingleCellExperiment

The anatomy of a SingleCellExperiment Each piece of (meta)data in
the SingleCellExperiment is represented by a separate 'slot' An analogy - An SCE is a cargo ship - Each 'slot' is a cargo box - Certain cargo boxes (slots) expect certain types of cargo (data)

Illustrative dataset: 416B

Illustrative dataset: 416B library(scRNAseq) sce.416b <- LunSpikeInData(which="416b") Lun, A. T.
L., Calero-Nieto, F. J., Haim-Vilmovsky, L., Göttgens, B. & Marioni, J. C. Assessing the reliability of spike-in normalization for analyses of single-cell RNA sequencing data. Genome Res. 27, 1795–1806 (2017) - Immortalized mouse myeloid progenitor cell line processed using SmartSeq2 - https://osca.bioconductor.org/lun-416b-cell-line-smart-seq2.html

https://github.com/lcolladotor/osca_LIIGH_UNAM_2020/blob/master/02-data- infrastructure-and-import.R#L57-L208 https://lcolladotor.github.io/HCA_LA_2021/ R Code

Storing primary experimental data

Storing primary experimental data Cell 1 Cell 2 ... Cell
N Gene 1 0 1 ... 0 Gene 2 1 3 ... 0 ... ... ... ... ... Gene M 2 2 4 Gene counts

Filling the assays slot # Load the SingleCellExperiment package library(SingleCellExperiment)
# Extract the count matrix from the 416b dataset counts.416b <- counts(sce.416b) # Construct a new SCE from the counts matrix sce <- SingleCellExperiment( assays = list(counts = counts.416b))

Filling the assays slot # Inspect the object we just
created sce

Filling the assays slot # Access the counts matrix from
the assays slot # WARNING: This will flood RStudio with output! # 1. The general method assay(sce, "counts") # 2. The special method for the assay named "counts" counts(sce) # Tip: Limit the output to just a few samples & genes counts(sce)[1:30, 1:2]

Adding to the assays slot sce <- scater::logNormCounts(sce) # Inspect
the object we just updated sce

Adding to the assays slot sce <- scater::logNormCounts(sce) # Inspect
the object we just updated sce 👉We overwrote our previous 'sce' by reassigning the results back to 'sce' (possible because this particular function returns a 'SingleCellExperiment' that contains the results in addition to the original data) ⚠ Some functions - especially those outside the single-cell oriented Bioconductor packages - do not do this, then we need to do extra work

Adding more assays # Access the logcounts matrix from the
assays slot # WARNING: This will flood RStudio with output! # 1. The general method assay(sce, "logcounts") # 2. The special method for the assay named "logcounts" logcounts(sce) # Tip: Limit the output to just a few samples & genes logcounts(sce)[1:30, 1:2]

Adding more assays # assign a new entry to assays
slot assay(sce, "counts_100") <- assay(sce, "counts") + 100 # List the assays in the object assays(sce) 🤔 What if I want to add a custom assay?

Storing metadata

Storing metadata Barcode Donor ... Treatment Cell 1 ACTGTA D1
... Drug Cell 2 TGCATA D1 ... Control ... ... ... ... ... Cell N CCTATA D6 Drug Information about the cells ID Symbol ... Chromosome Gene 1 ENSG00000155816 FMN2 ... 1 Gene 2 ENSG00000229807 XIST ... X ... ... ... ... ... Gene M ENSG00000139618 BRCA2 13 Information about the genes Information about the experiment - "Look at expression of genes A, B, and C" - "There could be problems with samples from the first batch because of issue with FACS sort"

Storing metadata on the columns (cells) Barcode Donor ... Treatment
Cell 1 ACTGTA D1 ... Drug Cell 2 TGCATA D1 ... Control ... ... ... ... ... Cell N CCTATA D6 Drug Information about the cells

Filling the colData slot # Extract the sample metadata from
the 416b dataset colData.416b <- colData(sce.416b) # Add some of the sample metadata to our SCE colData(sce) <- colData.416b[, c("phenotype", "block")] # Inspect the object we just updated sce # Access the sample metadata from our SCE colData(sce) # Access a specific column of sample metadata from our SCE sce$block

Adding to the colData slot # Example of function that
adds extra fields to colData sce <- scater::addPerCellQC(sce.416b) # Access the sample metadata from our updated SCE colData(sce)

Using colData for subsetting # E.g., subset data to just
wild type cells # Remember, cells are columns of the SCE sce[, sce$phenotype == "wild type phenotype"] 🤔 What if I want to subset the data to certain cells?

Storing metadata on the rows (features/genes) ID Symbol ... Chromosome
Gene 1 ENSG00000155816 FMN2 ... 1 Gene 2 ENSG00000229807 XIST ... X ... ... ... ... ... Gene M ENSG00000139618 BRCA2 13 Information about the genes

Adding to the rowData slot # Access the feature metadata
from our SCE # It's currently empty! rowData(sce) # Example of function that adds extra fields to rowData sce <- scater::addPerFeatureQC(sce) # Access the feature metadata from our updated SCE rowData(sce)

Adding to the rowData slot # Download the relevant Ensembl
annotation database # using AnnotationHub resources library(AnnotationHub) ah <- AnnotationHub() query(ah, c("Mus musculus", "Ensembl", "v97")) 🤔 What if I want to add the chromosome of each gene?

Adding to the rowData slot # Annotate each gene with
its chromosome location ensdb <- ah[["AH73905"]] chromosome <- mapIds(ensdb, keys=rownames(sce), keytype="GENEID", column="SEQNAME") rowData(sce)$chromosome <- chromosome # Access the feature metadata from our updated SCE rowData(sce)

Using rowData for subsetting # E.g., subset data to just
genes on chromosome 3 # NOTE: which() needed to cope with NA chromosome names sce[which(rowData(sce)$chromosome == "3"), ] 🤔 What if I want to subset the data to certain features?

Storing other metadata 😬 Information about the experiment - "Look
at expression of genes A, B, and C" - "There could be problems with samples from the first batch because of issue with FACS sort"

Adding to the metadata slot # Access the metadata from
our SCE # It's currently empty! metadata(sce) # The metadata slot is Vegas - anything goes metadata(sce) <- list( favourite_genes=c("Shh", "Nck1", "Diablo"), analyst=c("Pete")) # Access the metadata from our updated SCE metadata(sce)

Storing single-cell-specific data

Background So far, we've covered: - 'assays' (primary data) -
'colData' (cell metadata) - 'rowData' (feature metadata) - 'metadata' (experiment-level metadata) 🤓 The above is a SummarizedExperiment

So why do we need SingleCellExperiment? For single-cell data, we
commonly also have: - Dimensionality reduction results (e.g., t-SNE, UMAP) - Special or 'alternative' types of features (e.g., spike-in RNAs, antibody-derived sequencing tags) 🤓 SingleCellExperiment is an extension of SummarizedExperiment

Storing dimensionality reduction results PCA 1 PCA 2 ... PCA
K Cell 1 0.93 1.28 ... 0.03 Cell 2 0.32 1.22 ... 0.09 ... ... ... ... ... Cell N -0.66 1.00 0.15 Dimensionality reduction t-SNE 1 t-SNE 2 Cell 1 1.24 8.93 Cell 2 -0.33 7.85 ... ... ... Cell N 0.46 3.41

Adding to the reducedDims slot # E.g., add the PCA
of logcounts # NOTE: We'll learn more about PCA later sce <- scater::runPCA(sce) # Inspect the object we just updated sce # Access the PCA matrix from the reducedDims slot reducedDim(sce, "PCA") 🤔 What if I want to store a dimensionality reduced version of the data?

Adding to the reducedDims slot # E.g., add a t-SNE
representation of logcounts # NOTE: We'll learn more about t-SNE later sce <- scater::runTSNE(sce) # Inspect the object we just updated sce # Access the t-SNE matrix from the reducedDims slot reducedDim(sce, "TSNE") 🤔 What if I want to store a dimensionality reduced version of the data?

Adding to the reducedDims slot # E.g., add a 'manual'
UMAP representation of logcounts # NOTE: We'll learn more about UMAP later and a # simpler way to compute it. u <- uwot::umap(t(logcounts(sce)), n_component=2) # Add the UMAP matrix to the reducedDims slot # Access the UMAP matrix from the reducedDims slot reducedDim(sce, "UMAP") <- u # List the dimensionality reduction results stored in # the object reducedDims(sce) 🤔 What if I want to store a dimensionality reduced version of the data?

Quick Quality Control

https://github.com/lcolladotor/osca_LIIGH_UNAM_2020/blob/master/03-quality- control.R#L57-L226 https://lcolladotor.github.io/HCA_LA_2021/ R Code

Motivation

Motivation 💩 💩 Fancy single-cell method

Reasons for 💩 Cell damage during dissociation Failures in library
preparation Inefficient reverse transcription PCR amplification Plate alignment issues during FACS

Signs of 💩 'Cells' featuring one or more of: ⚠
Low total counts ⚠ Few expressed genes ⚠ High proportion of reads coming from mitochondria ⚠ High proportion of reads coming from spike-ins

Consequences of 💩 Distinct cluster(s) with lots of 💩, complicating
interpretation of results Distortion of population heterogeneity Artificial 'upregulation' of certain genes

Avoiding (or mitigating) 💩 🥇Careful experimental design and sample preparation
🥈Quality control (QC) on the cells Identifying/removing 'bad' cells Identifying/removing 'bad' features

Illustrative dataset: 416B library(scRNAseq) sce.416b <- LunSpikeInData(which="416b") sce.416b$block <- factor(sce.416b$block)
Lun, A. T. L., Calero-Nieto, F. J., Haim-Vilmovsky, L., Göttgens, B. & Marioni, J. C. Assessing the reliability of spike-in normalization for analyses of single-cell RNA sequencing data. Genome Res. 27, 1795–1806 (2017) - Immortalized mouse myeloid progenitor cell line processed using SmartSeq2 - A constant amount of ERCC spike-in RNA was added to each cell’s lysate prior to library preparation. - https://osca.bioconductor.org/lun-416b-cell-line-smart-seq2.html

Choice of QC metrics

Common QC metrics sum: Library size detected: Number of expressed
genes/features altexps_ERCC_percent: Percentage of reads mapped to (ERCC) spike-in transcripts subsets_Mito_percent: Percentage of reads mapped to mitochondrial transcripts

Aside: Identifying the mitochondrial transcripts # Download the relevant Ensembl
annotation database # using AnnotationHub resources library(AnnotationHub) ah <- AnnotationHub() query(ah, c("Mus musculus", "Ensembl", "v97")) # Annotate each gene with its chromosome location ens.mm.v97 <- ah[["AH73905"]] location <- mapIds(ens.mm.v97, keys=rownames(sce.416b), keytype="GENEID", column="SEQNAME") # Identify the mitochondrial genes is.mito <- which(location=="MT")

Computing the QC metrics scater::addPerCellQC() library(scater) sce.416b <- addPerCellQC(sce.416b, subsets=list(Mito=is.mito))
❗We'll also use scater::perCellQCMetrics() in this workshop

Visualizing QC metrics

Visualizing QC metrics scater::plotColData() plotColData(sce.416b, x="block", y="detected") plotColData(sce.416b, x="block", y="detected")
+ scale_y_log10() plotColData(sce.416b, x="block", y="detected", other_fields="phenotype") + scale_y_log10() + facet_wrap(~phenotype) ▶ Try plotting one of the other QC metrics

Identifying low-quality cells

Identifying low-quality cells 🥈Using fixed thresholds 🥇Using adaptive thresholds Other
approaches

Using fixed thresholds # Example thresholds qc.lib <- sce.416b$sum <
100000 qc.nexprs <- sce.416b$detected < 5000 qc.spike <- sce.416b$altexps_ERCC_percent > 10 qc.mito <- sce.416b$subsets_Mito_percent > 10 discard <- qc.lib | qc.nexprs | qc.spike | qc.mito # Summarize the number of cells removed for each reason DataFrame(LibSize=sum(qc.lib), NExprs=sum(qc.nexprs), SpikeProp=sum(qc.spike), MitoProp=sum(qc.mito), Total=sum(discard))

Using fixed thresholds ➕ Simple ➖ Hard to determine appropriate
thresholds for each experimental protocol and biological system

Using adaptive thresholds scater::isOutlier() qc.lib2 <- isOutlier(sce.416b$sum, log=TRUE, type="lower") qc.nexprs2
<- isOutlier(sce.416b$detected, log=TRUE, type="lower") qc.spike2 <- isOutlier(sce.416b$altexps_ERCC_percent, type="higher") qc.mito2 <- isOutlier(sce.416b$subsets_Mito_percent, type="higher") discard2 <- qc.lib2 | qc.nexprs2 | qc.spike2 | qc.mito2

Using adaptive thresholds scater::isOutlier() # Extract the thresholds attr(qc.lib2, "thresholds")
attr(qc.nexprs2, "thresholds") # Summarize the number of cells removed for each reason. DataFrame(LibSize=sum(qc.lib2), NExprs=sum(qc.nexprs2), SpikeProp=sum(qc.spike2), MitoProp=sum(qc.mito2), Total=sum(discard2))

How scater::isOutlier() works - Assume that most of the dataset
consists of high-quality cells1 - (Optional: log-transform the QC metric) - Compute median QC metric - Compute median absolute deviation (MAD2) of QC metric - Identify outliers as those cells with a QC metric more than 3 MADs3 from the median in the "problematic" direction. - You can control how many MADs are acceptable - You can decide which direction is problematic 1But we'll relax that assumption in a few moments 2MAD is similar to standard deviation 3Loosely, 3 MADs will retain 99% of non-outliers values that follow a Normal distribution

isOutlier(sce.416b$sum, log=TRUE, type="lower")

Assumptions of (adaptive) outlier detection Most cells are of acceptable
quality ⚠ Adaptive thresholds will fail if majority of cells are of unacceptably low quality 🔬 Experimental support (e.g., imaging, cell viability) 🤔 What is 'acceptable' will vary by cell type Poor QC metrics are driven by technical factors rather than biological processes ⚠ Highly heterogeneous cell populations may contain cells with systematically low RNA content, for example 🤔 What cell types might be discarded by your QC metric thresholds

Considering experimental factors More complex experiments may involve batches of
cells with different experimental parameters (e.g., sequencing depth) 👉 Need to consider QC metrics stratified by batch

Considering experimental factors plotColData(sce.416b, x="block", y="detected", other_fields="phenotype") + scale_y_log10() +
facet_wrap(~phenotype) ❓What systematic patterns does this show

Adaptive outlier detection accounting for batch batch <- paste0(sce.416b$phenotype, "-",
sce.416b$block) qc.lib3 <- isOutlier(sce.416b$sum, log=TRUE, type="lower", batch=batch) qc.nexprs3 <- isOutlier(sce.416b$detected, log=TRUE, type="lower", batch=batch) qc.spike3 <- isOutlier(sce.416b$altexps_ERCC_percent, type="higher", batch=batch) qc.mito3 <- isOutlier(sce.416b$subsets_Mito_percent, type="higher", batch=batch) discard3 <- qc.lib3 | qc.nexprs3 | qc.spike3 | qc.mito3

Adaptive outlier detection accounting for batch # Extract the thresholds
attr(qc.lib3, "thresholds") attr(qc.nexprs3, "thresholds") # Summarize the number of cells removed for each reason DataFrame(LibSize=sum(qc.lib3), NExprs=sum(qc.nexprs3), SpikeProp=sum(qc.spike3), MitoProp=sum(qc.mito3), Total=sum(discard3)) ⚠ Use of batch requires assumption that most cells in each batch are of acceptable quality 🤔 What if an entire batch failed

Illustrative dataset: Grun pancreas

Illustrative dataset: Grun pancreas sce.grun <- GrunPancreasData() sce.grun <- addPerCellQC(sce.grun)
Grün, D. et al. De Novo Prediction of Stem Cell Identity using Single-Cell Transcriptome Data. Cell Stem Cell 19, 266–277 (2016) Human pancreas samples from various donors processed with CEL-Seq2 https://osca.bioconductor.org/grun-human-pancreas-cel-seq2.html

What to do if an entire batch failed plotColData(sce.grun, x="donor",
y="altexps_ERCC_percent") ❓What systematic patterns does this show ❓How might this affect the assumptions required for outlier detection

Adaptive outlier detection using a subset of the data discard.ercc
<- isOutlier(sce.grun$altexps_ERCC_percent, type="higher", batch=sce.grun$donor) discard.ercc2 <- isOutlier(sce.grun$altexps_ERCC_percent, type="higher", batch=sce.grun$donor, subset=sce.grun$donor %in% c("D17", "D2", "D7")) plotColData(sce.grun, x="donor", y="altexps_ERCC_percent", colour_by=data.frame(discard = discard.ercc)) plotColData(sce.grun, x="donor", y="altexps_ERCC_percent", colour_by=data.frame(discard = discard.ercc2))

Adaptive outlier detection using a subset

Checking diagnostic plots

Checking diagnostic plots # Add info about which cells are
outliers sce.416b$discard <- discard2 # Look at this plot for each QC metric plotColData(sce.416b, x="block", y="sum", colour_by="discard", other_fields="phenotype") + facet_wrap(~phenotype) + scale_y_log10()

Checking diagnostic plots # Another useful diagnostic plot plotColData(sce.416b, x="sum",
y="subsets_Mito_percent", colour_by="discard", other_fields=c("block", "phenotype")) + facet_grid(block~phenotype) ⚠ Cells with both large total counts and large mitochondrial counts may be high-quality cells that happen to be highly metabolically active

Hands on

LIEBER INSTITUTE for BRAIN DEVELOPMENT https://lcolladotor.github.io/HCA_LA_2021/ & beyond

LIEBER INSTITUTE for BRAIN DEVELOPMENT Interested in working with us?
Check https://www.libd.org/careers/

HCA-LA-2021

HCA-LA-2021

More Decks by Leonardo Collado-Torres

Other Decks in Science

Featured

Transcript