Successful scRNA-seq analysis

Slide 1

Slide 1 text

Successful scRNA-seq analysis ILC Summer School 2022

Slide 2

Slide 2 text

Postdoctoral researcher (Theis Lab, Helmholtz Munich) Chemistry, Informatics, Bioinformatics scRNA-seq - Methods development - Software development - Benchmarking - Data analysis @_lazappi_ @lazappi lazappi.id.au Luke Zappia

Slide 3

Slide 3 text

Apply machine learning to biological data scRNA-seq - Integration and perturbations - Modelling of transitions - Multimodal analysis Theis Lab @fabian_theis @ICBmunich www.comp.bio

Slide 4

Slide 4 text

1. What is scRNA-seq? 2. Designing an scRNA-seq experiment 3. Standard scRNA-seq analysis 4. Advanced analysis topics

Slide 5

Slide 5 text

1. What is scRNA-seq?

Slide 6

Slide 6 text

single-cell RNA sequencing

Slide 7

Slide 7 text

Why single-cell?

Slide 8

Slide 8 text

Single-cell capture Droplet-based Plate/well-based More cells Easier UMI Fewer cells Custom setup Full length, higher depth More ﬂexible

Slide 9

Slide 9 text

mccarrolllab.com/dropseq/ Macosko et al. DOI: 10.1016/j.cell.2015.05.002

Slide 10

Slide 10 text

UMI vs full-length Unique Molecular Identiﬁers 5’ AAAA (PCR){BARCODE}[UMI]TTTT Full-length Better quantiﬁcation Less sequencing No gene-length bias Full coverage More sequencing Affected by gene length

Slide 11

Slide 11 text

Extensions Protein expression (CITE-seq, feature barcoding) Chromatin accessibility (scATAC-seq, 10x Multiome) Spatial location (10 Visium, MERFISH) Immune receptors (TCR/BCR proﬁling) Methylation, CRISPR screens, electrophysiology,... Pre-sorting (FACS to enrich target cells)

Slide 12

Slide 12 text

CITE-seq Simultaneous measurement of RNA and protein expression - Protein ≠ RNA Uses nucleotide-tagged antibodies Targets need to be carefully selected Particularly useful for PBMCs

Slide 13

Slide 13 text

Multiplexing Genetic multiplexing Easier but requires genetic diversity and reference panels Cell hashing More complex but can be more ﬂexible More samples, less batch effects

Slide 14

Slide 14 text

Comparison to bulk Gives insight into cellular variability Avoids the composition problem Much more complex analysis Much noisier Much sparser - But UMI data isn’t zero inﬂated!

Slide 15

Slide 15 text

2. Experimental design

Slide 16

Slide 16 text

Who should be involved? Experimentalists Bioinformaticians PIs Collaborators

Slide 17

Slide 17 text

What is the question? What do you want to answer with this experiment? - Not necessarily an hypothesis Experimentalists should have a clear idea that is reﬁned with input from analysts - Discuss everything that is relevant PIs and external collaborators need to be on board

Slide 18

Slide 18 text

Things to consider Cells are not replicates! - Proper analysis requires multiple samples from each condition Avoid confounding batches and conditions - How will the samples be multiplexed? What are your controls? How rare are the cells you are interested in? Are you using the right assay?

Slide 19

Slide 19 text

Example designs Exploratory Case/control Multiple conditions Time series Cohort study Many others…

Slide 20

Slide 20 text

How long will it take? Experiments take time, so does analysis - Often getting results takes longer than generating data Simpler experiments with clearer questions are quicker and easier to analyse You will be likely be competing with other projects, good relationships are key!

Slide 21

Slide 21 text

Make a plan What is the question? What is the design? Who is involved? What is everyone’s role (authorship)? What if somebody leaves? What is the timeline? How is it funded? Write it down!

Slide 22

Slide 22 text

Tips for good collaborations Involve everyone in the process - Give everyone ownership over the project Good, clear communication - Keep everyone in the loop Share all the (relevant) data - If you did FACS, share the measurements Keep good records - Complete, consistent, machine-readable metadata

Slide 23

Slide 23 text

3. Standard analysis

Slide 24

Slide 24 text

@SEQ_ID GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65 Gene Cell 1 Cell 2 Cell 3 Cell 4 A 12 10 9 0 B 0 0 1 4 C 9 6 0 0 D 7 0 4 0 ?

Slide 25

Slide 25 text

Alignment and quantiﬁcation 1. Align to reference genome 2. Compare to gene annotation 3. Deduplication Gene Cell 1 Cell 2 Cell 3 Cell 4 A 12 10 9 0 B 0 0 1 4 C 9 6 0 0 D 7 0 4 0 4. Quantiﬁcation

Slide 26

Slide 26 text

Over 1300 scRNA-seq tools www.scRNA-tools.org

Slide 27

Slide 27 text

Ecosystems scverse

Slide 28

Slide 28 text

Which ecosystem? They all have strengths and weaknesses Possible to convert between them Use whatever is best for the task For simple tasks use whichever is easiest

Slide 29

Slide 29 text

Which tool? Independent benchmarks are the best measure of performance Try commonly used tools ﬁrst Look for good documentation/maintenance Prefer tools that can be installed from major repositories Read more than just the introductory tutorial - Paper, package documentation

Slide 30

Slide 30 text

Quality control Not every droplet contains a cell Not every cell is in good condition Not every cell is informative Not every cell is a single cell Sometimes whole samples can be low-quality

Slide 31

Slide 31 text

Quality control Cell selection Cell ﬁltering

Slide 32

Slide 32 text

Normalisation Correct for technical differences between cells (number of counts) Most commonly used is simple (log) depth normalisation scran can compute more sophisticated size factors Seurat provides a regression-based method called sctransform Other options…

Slide 33

Slide 33 text

Integration Remove technical effects between batches* *Deciding what a “batch” is can be difﬁcult

Slide 34

Slide 34 text

Integration Top performer in benchmarks Well-documented, maintained, easy-to-use package Able to map new samples Models for different modalities *Personal opinion, other packages can also produce good results

Slide 35

Slide 35 text

Clustering Group cells based on similar expression proﬁles Graph-based algorithms are most common Selecting a clustering resolution is difﬁcult Sub-clustering often required No clustering is perfect

Slide 36

Slide 36 text

Visualisation 2D embeddings are the most common visualisation - t-SNE, UMAP etc. Can be useful BUT: - Easy to overinterpret - Hides lots of complexity - Potentially misleading

Slide 37

Slide 37 text

Marker genes Genes that are speciﬁcally expressed in a cluster

Slide 38

Slide 38 text

Annotation Maybe the most difﬁcult part of the process Usually relies on interpreting marker genes (and iteratively clustering) Prior knowledge can help: - Automatic classiﬁcation - Label transfer - Gene sets (maybe)

Slide 39

Slide 39 text

Explore the data Always look at the output of each step - Make sure you understand what it has done - Every method will produce an output, that doesn’t mean it makes any sense Make lots of plots! - Use these to make decisions

Slide 40

Slide 40 text

4. Advanced analysis

Slide 41

Slide 41 text

Differential expression Differences in expression between conditions Multiple benchmarks show that “pseudobulk” analysis performs best Models sample level variation Arbitrarily complex models Beneﬁt from 10+ years of development vs

Slide 42

Slide 42 text

Differential abundance Differences in cell type proportions between conditions Condition 1 Condition 2 vs

Slide 43

Slide 43 text

Trajectories Analysis of continuous processes Pseudotime RNA velocity

Slide 44

Slide 44 text

Multimodal analysis Analysis of multiple different measurements Can provide more context and insight… …but methods are still developing Depends on the modalities and the question Unclear whether combined modelling is useful or it’s better to analyse each modality and combine the results

Slide 45

Slide 45 text

Questions?

Slide 46

Slide 46 text

Resources Current best practices in single-cell RNA-seq analysis: a tutorial Malte Lücken, Fabian Theis DOI: 10.15252/msb.20188746 Extended best practices - Theis Lab (and the community) https://github.com/theislab/extended-single-cell-best-practices Orchestrating Single-Cell Analysis with Bioconductor https://bioconductor.org/books/release/OSCA/ Seurat documentation https://satijalab.org/seurat/ Scanpy documentation https://scanpy.readthedocs.io/en/stable/ scverse community https://scverse.org/ scRNA-tools https://scRNA-tools.org/ Open Problems in Single-Cell Analysis https://openproblems.bio/

Slide 47

Slide 47 text

Acknowledgements Theis lab Twitter Everyone who has written documentation, tutorials etc. Everyone has developed tools and made their code available