Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Clustering methods in multiome scRNA-seq and sc...

Cynthia SC
February 13, 2025
18

Clustering methods in multiome scRNA-seq and scATAC-seq

Main topics covered on this talk:
Introducing the Weighted-Nearest Neighbor (WNN) for multimodal analysis
A glance of WNN clustering algorithms available in Seurat
Some clustering evaluation methods:
- Concordance analysis based on the Jaccard index
- Looking the biological signal based on the expression of canonical genes
- Proximity Analysis with Silhouette width to assess cluster separation

Cynthia SC

February 13, 2025
Tweet

Transcript

  1. Main topics 1. Introducing the Weighted-Nearest Neighbor (WNN) for multimodal

    analysis 2. A glance of WNN clustering algorithms available in Seurat 3. Some clustering evaluation methods: a. Concordance analysis based on the Jaccard index b. Looking the biological signal based on the expression of canonical genes c. Proximity Analysis with Silhouette width to assess cluster separation
  2. Multiome scRNA-seq / scATAC-seq strategy It analyzes Chromium Single Cell

    Multiome data, connecting gene expression and chromatin accessibility
  3. WNN clustering in multiome datasets with Seurat Big picture: (1)

    Pre-processed data, normalized and scale (2) RNA PCA/Harmony normalized (3) ATAC The TF-IDF implementation (4) WNN Multimodal clustering: reduction.list = list("pca", "lsi") ATAC: Latent Semantic Indexing (LSI) LSI is a linear dimensionality reduction method originally used in text mining. In the context of scATAC-seq, it transforms the sparse and high-dimensional ATAC data into a lower-dimensional space to capture important patterns. How it Works: TF-IDF Transformation: normalization using Term Frequency-Inverse Document Frequency (TF-IDF) to adjust for biases. Singular Value Decomposition (SVD): applied to the TF-IDF matrix to identify components that capture the most variance in the data. Output: Reduced dimensions (latent components) that summarize chromatin accessibility patterns.
  4. 2 = Louvain multilevel RNA + ATAC (pca / lsi)

    1 = Louvain 3 = SLM 4 = Leiden WNN GEX+ATAC WNN clustering with Seurat Applies graph-based clustering approach The methods embed cells in a graph structure - for example a KNN graph, with edges drawn between cells with similar feature expression patterns, and then attempt to partition this graph into highly interconnected ‘quasi-cliques’ However, they differ in terms of their methodology, performance, and the quality of the clusters produced.
  5. Clustering WNN Seurat + Signac Leiden ensures well-connected clusters if

    computational resources are not a concern. Louvain is for quick clustering on very large datasets, where speed is critical and you are less concerned about minor connectivity issues. >100,000 cells, Leiden clustering algorithm is generally more recommended over Louvain. Traag, V.A., Waltman, L. & van Eck, N.J. (2019)
  6. 2 = Louvain multilevel RNA + ATAC (pca / lsi)

    1 = Louvain 3 = SLM 4 = Leiden res=0.8 res=1 res=1.5 res=2 k.nn=20 k.nn=30 k.nn=40 Lower k.nn (e.g., 10–15): better for small datasets or datasets with very distinct clusters. Higher k.nn (e.g., 30–50): better for larger datasets or where clusters are less distinct, as it captures broader neighborhood relationships. WNN GEX+ATAC n=48 Resolution: Lower values (e.g., 0.5) yield fewer clusters. Higher values (e.g., 1.0) yield more clusters.
  7. Clustering evaluation steps 1. Select WNN clusters integrating both RNA

    and ATAC data while minimizing methods that produce excessive singletons (=<4 singletons). We identified 4 options inside this criteria. “Orchestrating Single-Cell Analysis with Bioconductor” book (2025). https://bioconductor.org/books/3.20/OSCA.advanced/ 2. Concordance analysis based on the Jaccard index for each pair of previously selected clustering methods, comparing the clusters with the best match and observing cases of overclustering among the methods. 3. Use silhouette for proximity analysis to assess cluster separation and determine which methods best capture Habenula subdomains or, conversely, lead to overclustering. 4. Look at the biological signal based on the expression of canonical genes
  8. Leiden 33 final clusters 4 singletons identified SML 41 final

    clusters 4 singletons identified LouvainM 41 final clusters 4 singletons identified Louvain 38 final clusters 4 singletons identified Louvain LouvainM Leiden SML res=1 knn=30 knn=40 Louvain 37 final clusters 7 singletons identified Leiden 28 final clusters 7 singletons identified Louvain 41 final clusters 9 singletons identified
  9. Exploration based on singletons Louvain or LouvainM at res=1 with

    knn=30 Leiden at res=1 with knn=30 SML (Smart Local Moving) at res=1 with knn=30
  10. >>>> Leiden >>>> Louvain Computes the Jaccard index for each

    pair of clusters. This normalizes for the differences in cluster abundance so that large clusters do not dominate the color scale
  11. (X) SML (Y) LouvainM (X) Leiden (Y) LouvainM (X) SML

    (Y) Louvain (X) LouvainM (Y) Louvain (X) Leiden (Y) Louvain (X) SML (Y) Leiden
  12. Clusters with a large negative width represent cell subtypes. Distribution

    of the approximate silhouette width across cells in each cluster Compute the average distance to all cells in the same cluster, and in another cluster, taking the minimum of the averages across all other clusters.
  13. Leiden: In the silhouette table, each row corresponds to one

    cluster; large off-diagonal counts indicate that its cells are easily confused with those from another cluster.
  14. LouvainM 15 Clusters % [8.97-0.34] Leiden 15 Clusters % [9.9-2.25]

    * Cluster numbering is associated with cluster size. It means first clusters are bigger.
  15. Some notes • For multiome might be relevant to measure

    the number of cells per cluster, as we expect to have ~1000 cells per cluster to identify 80% of the pCREs. • There are another diagnostics methods to evaluate clustering separation and stability, methods to compare clusterings that represent different views of the data, and some strategies to choose the number of clusters. https://bioconductor.org/books/3.20/OSCA.advanced/clustering-redux.html https://www.bioconductor.org/packages/release/bioc/vignettes/bluster/inst/doc/diagnostics.html https://www.datanovia.com/en/lessons/cluster-validation-statistics-must-know-methods/ https://www.geeksforgeeks.org/clustering-metrics/