Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Successful scRNA-seq analysis

Successful scRNA-seq analysis

A brief introduction to how single-cell technologies work, how to plan a successful experiment (from an analyst's point of view), the steps in a standard scRNA-seq analysis and touching on some more advanced topics. Presented at the ILC Summer School 2022.

Luke Zappia

June 30, 2022
Tweet

More Decks by Luke Zappia

Other Decks in Science

Transcript

  1. Successful
    scRNA-seq analysis
    ILC Summer School 2022

    View Slide

  2. Postdoctoral researcher
    (Theis Lab, Helmholtz Munich)
    Chemistry, Informatics,
    Bioinformatics
    scRNA-seq
    - Methods development
    - Software development
    - Benchmarking
    - Data analysis
    @_lazappi_
    @lazappi
    lazappi.id.au
    Luke Zappia

    View Slide

  3. Apply machine learning to
    biological data
    scRNA-seq
    - Integration and perturbations
    - Modelling of transitions
    - Multimodal analysis
    Theis Lab
    @fabian_theis
    @ICBmunich
    www.comp.bio

    View Slide

  4. 1. What is
    scRNA-seq?
    2. Designing an
    scRNA-seq
    experiment
    3. Standard
    scRNA-seq
    analysis
    4. Advanced
    analysis topics

    View Slide

  5. 1. What is
    scRNA-seq?

    View Slide

  6. single-cell RNA sequencing

    View Slide

  7. Why single-cell?

    View Slide

  8. Single-cell capture
    Droplet-based Plate/well-based
    More cells
    Easier
    UMI
    Fewer cells
    Custom setup
    Full length, higher depth
    More flexible

    View Slide

  9. mccarrolllab.com/dropseq/
    Macosko et al. DOI: 10.1016/j.cell.2015.05.002

    View Slide

  10. UMI vs full-length
    Unique Molecular Identifiers
    5’
    AAAA
    (PCR){BARCODE}[UMI]TTTT
    Full-length
    Better quantification
    Less sequencing
    No gene-length bias
    Full coverage
    More sequencing
    Affected by gene length

    View Slide

  11. Extensions
    Protein expression
    (CITE-seq, feature barcoding)
    Chromatin accessibility
    (scATAC-seq, 10x Multiome)
    Spatial location
    (10 Visium, MERFISH)
    Immune receptors
    (TCR/BCR profiling)
    Methylation, CRISPR screens,
    electrophysiology,...
    Pre-sorting
    (FACS to enrich target cells)

    View Slide

  12. CITE-seq
    Simultaneous measurement of RNA and
    protein expression
    - Protein ≠ RNA
    Uses nucleotide-tagged antibodies
    Targets need to be carefully selected
    Particularly useful for PBMCs

    View Slide

  13. Multiplexing
    Genetic multiplexing
    Easier but requires genetic
    diversity and reference panels
    Cell hashing
    More complex but can be more
    flexible
    More samples, less batch effects

    View Slide

  14. Comparison to bulk
    Gives insight into cellular variability
    Avoids the composition problem
    Much more complex analysis
    Much noisier
    Much sparser
    - But UMI data isn’t zero inflated!

    View Slide

  15. 2. Experimental design

    View Slide

  16. Who should be involved?
    Experimentalists
    Bioinformaticians
    PIs
    Collaborators

    View Slide

  17. What is the question?
    What do you want to answer with this experiment?
    - Not necessarily an hypothesis
    Experimentalists should have a clear idea that is
    refined with input from analysts
    - Discuss everything that is relevant
    PIs and external collaborators need to be on board

    View Slide

  18. Things to consider
    Cells are not replicates!
    - Proper analysis requires multiple samples from
    each condition
    Avoid confounding batches and conditions
    - How will the samples be multiplexed?
    What are your controls?
    How rare are the cells you are interested in?
    Are you using the right assay?

    View Slide

  19. Example designs
    Exploratory Case/control
    Multiple conditions Time series Cohort study
    Many others…

    View Slide

  20. How long will it take?
    Experiments take time, so does analysis
    - Often getting results takes longer than generating data
    Simpler experiments with clearer questions are quicker
    and easier to analyse
    You will be likely be competing with other projects, good
    relationships are key!

    View Slide

  21. Make a plan
    What is the question?
    What is the design?
    Who is involved?
    What is everyone’s role (authorship)?
    What if somebody leaves?
    What is the timeline?
    How is it funded?
    Write it down!

    View Slide

  22. Tips for good collaborations
    Involve everyone in the process
    - Give everyone ownership over the project
    Good, clear communication
    - Keep everyone in the loop
    Share all the (relevant) data
    - If you did FACS, share the measurements
    Keep good records
    - Complete, consistent, machine-readable metadata

    View Slide

  23. 3. Standard analysis

    View Slide

  24. @SEQ_ID
    GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
    +
    !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
    Gene Cell 1 Cell 2 Cell 3 Cell 4
    A 12 10 9 0
    B 0 0 1 4
    C 9 6 0 0
    D 7 0 4 0
    ?

    View Slide

  25. Alignment and quantification
    1. Align to reference genome
    2. Compare to gene annotation 3. Deduplication
    Gene Cell 1 Cell 2 Cell 3 Cell 4
    A 12 10 9 0
    B 0 0 1 4
    C 9 6 0 0
    D 7 0 4 0
    4. Quantification

    View Slide

  26. Over 1300 scRNA-seq tools
    www.scRNA-tools.org

    View Slide

  27. Ecosystems
    scverse

    View Slide

  28. Which ecosystem?
    They all have strengths and weaknesses
    Possible to convert between them
    Use whatever is best for the task
    For simple tasks use whichever is easiest

    View Slide

  29. Which tool?
    Independent benchmarks are the best
    measure of performance
    Try commonly used tools first
    Look for good documentation/maintenance
    Prefer tools that can be installed from major
    repositories
    Read more than just the introductory tutorial
    - Paper, package documentation

    View Slide

  30. Quality control
    Not every droplet contains a cell
    Not every cell is in good condition
    Not every cell is informative
    Not every cell is a single cell
    Sometimes whole samples can be low-quality

    View Slide

  31. Quality control
    Cell selection Cell filtering

    View Slide

  32. Normalisation
    Correct for technical differences between cells (number of
    counts)
    Most commonly used is simple (log) depth normalisation
    scran can compute more sophisticated size factors
    Seurat provides a regression-based method called sctransform
    Other options…

    View Slide

  33. Integration
    Remove technical effects between batches*
    *Deciding what a “batch” is can be difficult

    View Slide

  34. Integration
    Top performer in benchmarks
    Well-documented, maintained, easy-to-use package
    Able to map new samples
    Models for different modalities
    *Personal opinion, other packages can also produce good results

    View Slide

  35. Clustering
    Group cells based on similar expression profiles
    Graph-based algorithms are most common
    Selecting a clustering resolution is difficult
    Sub-clustering often required
    No clustering is perfect

    View Slide

  36. Visualisation
    2D embeddings are the most
    common visualisation
    - t-SNE, UMAP etc.
    Can be useful BUT:
    - Easy to overinterpret
    - Hides lots of complexity
    - Potentially misleading

    View Slide

  37. Marker genes
    Genes that are specifically expressed in a cluster

    View Slide

  38. Annotation
    Maybe the most difficult part of the process
    Usually relies on interpreting marker genes
    (and iteratively clustering)
    Prior knowledge can help:
    - Automatic classification
    - Label transfer
    - Gene sets (maybe)

    View Slide

  39. Explore the data
    Always look at the output of each step
    - Make sure you understand what it has done
    - Every method will produce an output, that
    doesn’t mean it makes any sense
    Make lots of plots!
    - Use these to make decisions

    View Slide

  40. 4. Advanced analysis

    View Slide

  41. Differential expression
    Differences in expression between conditions
    Multiple benchmarks show that
    “pseudobulk” analysis performs best
    Models sample level variation
    Arbitrarily complex models
    Benefit from 10+ years of
    development
    vs

    View Slide

  42. Differential abundance
    Differences in cell type proportions between conditions
    Condition 1 Condition 2
    vs

    View Slide

  43. Trajectories
    Analysis of continuous processes
    Pseudotime RNA velocity

    View Slide

  44. Multimodal analysis
    Analysis of multiple different measurements
    Can provide more context and insight…
    …but methods are still developing
    Depends on the modalities and the question
    Unclear whether combined modelling is useful
    or it’s better to analyse each modality and
    combine the results

    View Slide

  45. Questions?

    View Slide

  46. Resources
    Current best practices in single-cell RNA-seq analysis: a tutorial
    Malte Lücken, Fabian Theis DOI: 10.15252/msb.20188746
    Extended best practices - Theis Lab (and the community)
    https://github.com/theislab/extended-single-cell-best-practices
    Orchestrating Single-Cell Analysis with Bioconductor
    https://bioconductor.org/books/release/OSCA/
    Seurat documentation
    https://satijalab.org/seurat/
    Scanpy documentation
    https://scanpy.readthedocs.io/en/stable/
    scverse community
    https://scverse.org/
    scRNA-tools
    https://scRNA-tools.org/
    Open Problems in Single-Cell Analysis
    https://openproblems.bio/

    View Slide

  47. Acknowledgements
    Theis lab
    Twitter
    Everyone who has written documentation,
    tutorials etc.
    Everyone has developed tools and made their
    code available

    View Slide