Successful scRNA-seq analysis

Successful scRNA-seq analysis ILC Summer School 2022

Postdoctoral researcher (Theis Lab, Helmholtz Munich) Chemistry, Informatics, Bioinformatics scRNA-seq
- Methods development - Software development - Benchmarking - Data analysis @_lazappi_ @lazappi lazappi.id.au Luke Zappia

Apply machine learning to biological data scRNA-seq - Integration and
perturbations - Modelling of transitions - Multimodal analysis Theis Lab @fabian_theis @ICBmunich www.comp.bio

1. What is scRNA-seq? 2. Designing an scRNA-seq experiment 3.
Standard scRNA-seq analysis 4. Advanced analysis topics

1. What is scRNA-seq?

single-cell RNA sequencing

Why single-cell?

Single-cell capture Droplet-based Plate/well-based More cells Easier UMI Fewer cells
Custom setup Full length, higher depth More ﬂexible

mccarrolllab.com/dropseq/ Macosko et al. DOI: 10.1016/j.cell.2015.05.002

UMI vs full-length Unique Molecular Identiﬁers 5’ AAAA (PCR){BARCODE}[UMI]TTTT Full-length
Better quantiﬁcation Less sequencing No gene-length bias Full coverage More sequencing Affected by gene length

Extensions Protein expression (CITE-seq, feature barcoding) Chromatin accessibility (scATAC-seq, 10x
Multiome) Spatial location (10 Visium, MERFISH) Immune receptors (TCR/BCR proﬁling) Methylation, CRISPR screens, electrophysiology,... Pre-sorting (FACS to enrich target cells)

CITE-seq Simultaneous measurement of RNA and protein expression - Protein
≠ RNA Uses nucleotide-tagged antibodies Targets need to be carefully selected Particularly useful for PBMCs

Multiplexing Genetic multiplexing Easier but requires genetic diversity and reference
panels Cell hashing More complex but can be more ﬂexible More samples, less batch effects

Comparison to bulk Gives insight into cellular variability Avoids the
composition problem Much more complex analysis Much noisier Much sparser - But UMI data isn’t zero inﬂated!

2. Experimental design

Who should be involved? Experimentalists Bioinformaticians PIs Collaborators

What is the question? What do you want to answer
with this experiment? - Not necessarily an hypothesis Experimentalists should have a clear idea that is reﬁned with input from analysts - Discuss everything that is relevant PIs and external collaborators need to be on board

Things to consider Cells are not replicates! - Proper analysis
requires multiple samples from each condition Avoid confounding batches and conditions - How will the samples be multiplexed? What are your controls? How rare are the cells you are interested in? Are you using the right assay?

Example designs Exploratory Case/control Multiple conditions Time series Cohort study
Many others…

How long will it take? Experiments take time, so does
analysis - Often getting results takes longer than generating data Simpler experiments with clearer questions are quicker and easier to analyse You will be likely be competing with other projects, good relationships are key!

Make a plan What is the question? What is the
design? Who is involved? What is everyone’s role (authorship)? What if somebody leaves? What is the timeline? How is it funded? Write it down!

Tips for good collaborations Involve everyone in the process -
Give everyone ownership over the project Good, clear communication - Keep everyone in the loop Share all the (relevant) data - If you did FACS, share the measurements Keep good records - Complete, consistent, machine-readable metadata

3. Standard analysis

@SEQ_ID GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65 Gene Cell 1 Cell 2 Cell
3 Cell 4 A 12 10 9 0 B 0 0 1 4 C 9 6 0 0 D 7 0 4 0 ?

Alignment and quantiﬁcation 1. Align to reference genome 2. Compare
to gene annotation 3. Deduplication Gene Cell 1 Cell 2 Cell 3 Cell 4 A 12 10 9 0 B 0 0 1 4 C 9 6 0 0 D 7 0 4 0 4. Quantiﬁcation

Over 1300 scRNA-seq tools www.scRNA-tools.org

Ecosystems scverse

Which ecosystem? They all have strengths and weaknesses Possible to
convert between them Use whatever is best for the task For simple tasks use whichever is easiest

Which tool? Independent benchmarks are the best measure of performance
Try commonly used tools ﬁrst Look for good documentation/maintenance Prefer tools that can be installed from major repositories Read more than just the introductory tutorial - Paper, package documentation

Quality control Not every droplet contains a cell Not every
cell is in good condition Not every cell is informative Not every cell is a single cell Sometimes whole samples can be low-quality

Quality control Cell selection Cell ﬁltering

Normalisation Correct for technical differences between cells (number of counts)
Most commonly used is simple (log) depth normalisation scran can compute more sophisticated size factors Seurat provides a regression-based method called sctransform Other options…

Integration Remove technical effects between batches* *Deciding what a “batch”
is can be difﬁcult

Integration Top performer in benchmarks Well-documented, maintained, easy-to-use package Able
to map new samples Models for different modalities *Personal opinion, other packages can also produce good results

Clustering Group cells based on similar expression proﬁles Graph-based algorithms
are most common Selecting a clustering resolution is difﬁcult Sub-clustering often required No clustering is perfect

Visualisation 2D embeddings are the most common visualisation - t-SNE,
UMAP etc. Can be useful BUT: - Easy to overinterpret - Hides lots of complexity - Potentially misleading

Marker genes Genes that are speciﬁcally expressed in a cluster

Annotation Maybe the most difﬁcult part of the process Usually
relies on interpreting marker genes (and iteratively clustering) Prior knowledge can help: - Automatic classiﬁcation - Label transfer - Gene sets (maybe)

Explore the data Always look at the output of each
step - Make sure you understand what it has done - Every method will produce an output, that doesn’t mean it makes any sense Make lots of plots! - Use these to make decisions

4. Advanced analysis

Differential expression Differences in expression between conditions Multiple benchmarks show
that “pseudobulk” analysis performs best Models sample level variation Arbitrarily complex models Beneﬁt from 10+ years of development vs

Differential abundance Differences in cell type proportions between conditions Condition
1 Condition 2 vs

Trajectories Analysis of continuous processes Pseudotime RNA velocity

Multimodal analysis Analysis of multiple different measurements Can provide more
context and insight… …but methods are still developing Depends on the modalities and the question Unclear whether combined modelling is useful or it’s better to analyse each modality and combine the results

Questions?

Resources Current best practices in single-cell RNA-seq analysis: a tutorial
Malte Lücken, Fabian Theis DOI: 10.15252/msb.20188746 Extended best practices - Theis Lab (and the community) https://github.com/theislab/extended-single-cell-best-practices Orchestrating Single-Cell Analysis with Bioconductor https://bioconductor.org/books/release/OSCA/ Seurat documentation https://satijalab.org/seurat/ Scanpy documentation https://scanpy.readthedocs.io/en/stable/ scverse community https://scverse.org/ scRNA-tools https://scRNA-tools.org/ Open Problems in Single-Cell Analysis https://openproblems.bio/

Acknowledgements Theis lab Twitter Everyone who has written documentation, tutorials
etc. Everyone has developed tools and made their code available

Successful scRNA-seq analysis

Successful scRNA-seq analysis

More Decks by Luke Zappia

Other Decks in Science

Featured

Transcript