for the analysis and comprehension of high-throughput genomic data. Bioconductor uses the R statistical programming language, and is open source and open development. http://bioconductor.org/
(tree-like structure) Who is the author of the top ranked Bioconductor workflow that mentions “cancer” in the title? Hint: • http://bioconductor.org/ -> • http://bioconductor.org/packages/release/BiocViews.html#_ __Software ->
https://doi.org/10.1038/s41592-019-0654-x This website is free to use, and is licensed under the Creative Commons Attribution-NonCommercial-NoDerivs 3.0 License.
Research Endeavour (SCORE), Walter and Eliza Hall Institute of Medical Research bit.ly/WEHIscRNAworkshop This material has been adapted from Pete’s WEHI course
• (March 2020; 4 days) Intro to OSCA for LGC-UNAM at LIIGH-UNAM undergrad students (code in English) https://lcolladotor.github.io/osca_LIIGH_UNAM_2020/ • (August 2020; 2.5 days) Intro to OSCA at CSBD2020 https://github.com/ComunidadBioInfo/cdsb2020 (code comments in Spanish) • (August 2021; 5 days) http://www.nnb.unam.mx/
N Gene 1 0 1 ... 0 Gene 2 1 3 ... 0 ... ... ... ... ... Gene M 2 2 4 Gene counts Barcode Donor ... Treatment Cell 1 ACTGTA D1 ... Drug Cell 2 TGCATA D1 ... Control ... ... ... ... ... Cell N CCTATA D6 Drug Information about the cells ID Symbol ... Chromosome Gene 1 ENSG00000155816 FMN2 ... 1 Gene 2 ENSG00000229807 XIST ... X ... ... ... ... ... Gene M ENSG00000139618 BRCA2 13 Information about the genes Also have to coordinate derived data (log-normalized gene expression, PCA, t-SNE, etc.)
the SingleCellExperiment is represented by a separate 'slot' An analogy - An SCE is a cargo ship - Each 'slot' is a cargo box - Certain cargo boxes (slots) expect certain types of cargo (data)
L., Calero-Nieto, F. J., Haim-Vilmovsky, L., Göttgens, B. & Marioni, J. C. Assessing the reliability of spike-in normalization for analyses of single-cell RNA sequencing data. Genome Res. 27, 1795–1806 (2017) - Immortalized mouse myeloid progenitor cell line processed using SmartSeq2 - https://osca.bioconductor.org/lun-416b-cell-line-smart-seq2.html
# Extract the count matrix from the 416b dataset counts.416b <- counts(sce.416b) # Construct a new SCE from the counts matrix sce <- SingleCellExperiment( assays = list(counts = counts.416b))
the assays slot # WARNING: This will flood RStudio with output! # 1. The general method assay(sce, "counts") # 2. The special method for the assay named "counts" counts(sce) # Tip: Limit the output to just a few samples & genes counts(sce)[1:30, 1:2]
the object we just updated sce 👉We overwrote our previous 'sce' by reassigning the results back to 'sce' (possible because this particular function returns a 'SingleCellExperiment' that contains the results in addition to the original data) ⚠ Some functions - especially those outside the single-cell oriented Bioconductor packages - do not do this, then we need to do extra work
assays slot # WARNING: This will flood RStudio with output! # 1. The general method assay(sce, "logcounts") # 2. The special method for the assay named "logcounts" logcounts(sce) # Tip: Limit the output to just a few samples & genes logcounts(sce)[1:30, 1:2]
... Drug Cell 2 TGCATA D1 ... Control ... ... ... ... ... Cell N CCTATA D6 Drug Information about the cells ID Symbol ... Chromosome Gene 1 ENSG00000155816 FMN2 ... 1 Gene 2 ENSG00000229807 XIST ... X ... ... ... ... ... Gene M ENSG00000139618 BRCA2 13 Information about the genes Information about the experiment - "Look at expression of genes A, B, and C" - "There could be problems with samples from the first batch because of issue with FACS sort"
the 416b dataset colData.416b <- colData(sce.416b) # Add some of the sample metadata to our SCE colData(sce) <- colData.416b[, c("phenotype", "block")] # Inspect the object we just updated sce # Access the sample metadata from our SCE colData(sce) # Access a specific column of sample metadata from our SCE sce$block
wild type cells # Remember, cells are columns of the SCE sce[, sce$phenotype == "wild type phenotype"] 🤔 What if I want to subset the data to certain cells?
from our SCE # It's currently empty! rowData(sce) # Example of function that adds extra fields to rowData sce <- scater::addPerFeatureQC(sce) # Access the feature metadata from our updated SCE rowData(sce)
annotation database # using AnnotationHub resources library(AnnotationHub) ah <- AnnotationHub() query(ah, c("Mus musculus", "Ensembl", "v97")) 🤔 What if I want to add the chromosome of each gene?
genes on chromosome 3 # NOTE: which() needed to cope with NA chromosome names sce[which(rowData(sce)$chromosome == "3"), ] 🤔 What if I want to subset the data to certain features?
commonly also have: - Dimensionality reduction results (e.g., t-SNE, UMAP) - Special or 'alternative' types of features (e.g., spike-in RNAs, antibody-derived sequencing tags) 🤓 SingleCellExperiment is an extension of SummarizedExperiment
of logcounts # NOTE: We'll learn more about PCA later sce <- scater::runPCA(sce) # Inspect the object we just updated sce # Access the PCA matrix from the reducedDims slot reducedDim(sce, "PCA") 🤔 What if I want to store a dimensionality reduced version of the data?
representation of logcounts # NOTE: We'll learn more about t-SNE later sce <- scater::runTSNE(sce) # Inspect the object we just updated sce # Access the t-SNE matrix from the reducedDims slot reducedDim(sce, "TSNE") 🤔 What if I want to store a dimensionality reduced version of the data?
UMAP representation of logcounts # NOTE: We'll learn more about UMAP later and a # simpler way to compute it. u <- uwot::umap(t(logcounts(sce)), n_component=2) # Add the UMAP matrix to the reducedDims slot # Access the UMAP matrix from the reducedDims slot reducedDim(sce, "UMAP") <- u # List the dimensionality reduction results stored in # the object reducedDims(sce) 🤔 What if I want to store a dimensionality reduced version of the data?
Lun, A. T. L., Calero-Nieto, F. J., Haim-Vilmovsky, L., Göttgens, B. & Marioni, J. C. Assessing the reliability of spike-in normalization for analyses of single-cell RNA sequencing data. Genome Res. 27, 1795–1806 (2017) - Immortalized mouse myeloid progenitor cell line processed using SmartSeq2 - A constant amount of ERCC spike-in RNA was added to each cell’s lysate prior to library preparation. - https://osca.bioconductor.org/lun-416b-cell-line-smart-seq2.html
genes/features altexps_ERCC_percent: Percentage of reads mapped to (ERCC) spike-in transcripts subsets_Mito_percent: Percentage of reads mapped to mitochondrial transcripts
+ scale_y_log10() plotColData(sce.416b, x="block", y="detected", other_fields="phenotype") + scale_y_log10() + facet_wrap(~phenotype) ▶ Try plotting one of the other QC metrics
attr(qc.nexprs2, "thresholds") # Summarize the number of cells removed for each reason. DataFrame(LibSize=sum(qc.lib2), NExprs=sum(qc.nexprs2), SpikeProp=sum(qc.spike2), MitoProp=sum(qc.mito2), Total=sum(discard2))
consists of high-quality cells1 - (Optional: log-transform the QC metric) - Compute median QC metric - Compute median absolute deviation (MAD2) of QC metric - Identify outliers as those cells with a QC metric more than 3 MADs3 from the median in the "problematic" direction. - You can control how many MADs are acceptable - You can decide which direction is problematic 1But we'll relax that assumption in a few moments 2MAD is similar to standard deviation 3Loosely, 3 MADs will retain 99% of non-outliers values that follow a Normal distribution
quality ⚠ Adaptive thresholds will fail if majority of cells are of unacceptably low quality 🔬 Experimental support (e.g., imaging, cell viability) 🤔 What is 'acceptable' will vary by cell type Poor QC metrics are driven by technical factors rather than biological processes ⚠ Highly heterogeneous cell populations may contain cells with systematically low RNA content, for example 🤔 What cell types might be discarded by your QC metric thresholds
attr(qc.lib3, "thresholds") attr(qc.nexprs3, "thresholds") # Summarize the number of cells removed for each reason DataFrame(LibSize=sum(qc.lib3), NExprs=sum(qc.nexprs3), SpikeProp=sum(qc.spike3), MitoProp=sum(qc.mito3), Total=sum(discard3)) ⚠ Use of batch requires assumption that most cells in each batch are of acceptable quality 🤔 What if an entire batch failed
Grün, D. et al. De Novo Prediction of Stem Cell Identity using Single-Cell Transcriptome Data. Cell Stem Cell 19, 266–277 (2016) Human pancreas samples from various donors processed with CEL-Seq2 https://osca.bioconductor.org/grun-human-pancreas-cel-seq2.html
y="subsets_Mito_percent", colour_by="discard", other_fields=c("block", "phenotype")) + facet_grid(block~phenotype) ⚠ Cells with both large total counts and large mitochondrial counts may be high-quality cells that happen to be highly metabolically active