Clustering methods in multiome scRNA-seq and scATAC-seq

Clustering in Multi-ome data (snATAC-seq + snRNA-seq) Presented By Cynthia
Soto Cardinault February 07, 2025

Main topics 1. Introducing the Weighted-Nearest Neighbor (WNN) for multimodal
analysis 2. A glance of WNN clustering algorithms available in Seurat 3. Some clustering evaluation methods: a. Concordance analysis based on the Jaccard index b. Looking the biological signal based on the expression of canonical genes c. Proximity Analysis with Silhouette width to assess cluster separation

Multiome scRNA-seq / scATAC-seq strategy It analyzes Chromium Single Cell
Multiome data, connecting gene expression and chromatin accessibility

WNN clustering in multiome datasets with Seurat Big picture: (1)
Pre-processed data, normalized and scale (2) RNA PCA/Harmony normalized (3) ATAC The TF-IDF implementation (4) WNN Multimodal clustering: reduction.list = list("pca", "lsi") ATAC: Latent Semantic Indexing (LSI) LSI is a linear dimensionality reduction method originally used in text mining. In the context of scATAC-seq, it transforms the sparse and high-dimensional ATAC data into a lower-dimensional space to capture important patterns. How it Works: TF-IDF Transformation: normalization using Term Frequency-Inverse Document Frequency (TF-IDF) to adjust for biases. Singular Value Decomposition (SVD): applied to the TF-IDF matrix to identify components that capture the most variance in the data. Output: Reduced dimensions (latent components) that summarize chromatin accessibility patterns.

2 = Louvain multilevel RNA + ATAC (pca / lsi)
1 = Louvain 3 = SLM 4 = Leiden WNN GEX+ATAC WNN clustering with Seurat Applies graph-based clustering approach The methods embed cells in a graph structure - for example a KNN graph, with edges drawn between cells with similar feature expression patterns, and then attempt to partition this graph into highly interconnected ‘quasi-cliques’ However, they differ in terms of their methodology, performance, and the quality of the clusters produced.

Clustering WNN Seurat + Signac Leiden ensures well-connected clusters if
computational resources are not a concern. Louvain is for quick clustering on very large datasets, where speed is critical and you are less concerned about minor connectivity issues. >100,000 cells, Leiden clustering algorithm is generally more recommended over Louvain. Traag, V.A., Waltman, L. & van Eck, N.J. (2019)

9 singletons 46 final clusters WNN PCA/Harmony + LSI Lovain
Resolution=1 knn=20

2 = Louvain multilevel RNA + ATAC (pca / lsi)
1 = Louvain 3 = SLM 4 = Leiden res=0.8 res=1 res=1.5 res=2 k.nn=20 k.nn=30 k.nn=40 Lower k.nn (e.g., 10–15): better for small datasets or datasets with very distinct clusters. Higher k.nn (e.g., 30–50): better for larger datasets or where clusters are less distinct, as it captures broader neighborhood relationships. WNN GEX+ATAC n=48 Resolution: Lower values (e.g., 0.5) yield fewer clusters. Higher values (e.g., 1.0) yield more clusters.

Clustering evaluation steps 1. Select WNN clusters integrating both RNA
and ATAC data while minimizing methods that produce excessive singletons (=<4 singletons). We identified 4 options inside this criteria. “Orchestrating Single-Cell Analysis with Bioconductor” book (2025). https://bioconductor.org/books/3.20/OSCA.advanced/ 2. Concordance analysis based on the Jaccard index for each pair of previously selected clustering methods, comparing the clusters with the best match and observing cases of overclustering among the methods. 3. Use silhouette for proximity analysis to assess cluster separation and determine which methods best capture Habenula subdomains or, conversely, lead to overclustering. 4. Look at the biological signal based on the expression of canonical genes

Leiden 33 final clusters 4 singletons identified SML 41 final
clusters 4 singletons identified LouvainM 41 final clusters 4 singletons identified Louvain 38 final clusters 4 singletons identified Louvain LouvainM Leiden SML res=1 knn=30 knn=40 Louvain 37 final clusters 7 singletons identified Leiden 28 final clusters 7 singletons identified Louvain 41 final clusters 9 singletons identified

Exploration based on singletons Louvain or LouvainM at res=1 with
knn=30 Leiden at res=1 with knn=30 SML (Smart Local Moving) at res=1 with knn=30

>>>> Leiden >>>> Louvain Computes the Jaccard index for each
pair of clusters. This normalizes for the differences in cluster abundance so that large clusters do not dominate the color scale

(X) SML (Y) LouvainM (X) Leiden (Y) LouvainM (X) SML
(Y) Louvain (X) LouvainM (Y) Louvain (X) Leiden (Y) Louvain (X) SML (Y) Leiden

Clusters with a large negative width represent cell subtypes. Distribution
of the approximate silhouette width across cells in each cluster Compute the average distance to all cells in the same cluster, and in another cluster, taking the minimum of the averages across all other clusters.

Leiden: In the silhouette table, each row corresponds to one
cluster; large off-diagonal counts indicate that its cells are easily confused with those from another cluster.

Cell-specificity Endo Astrocytes Oligos Likely cell-subtypes

Louvain: 14 Clusters LouvainM: 15 Clusters SML: 15 Clusters Leiden:
15 Clusters

LouvainM 15 Clusters % [8.97-0.34] Leiden 15 Clusters % [9.9-2.25]
* Cluster numbering is associated with cluster size. It means first clusters are bigger.

LouvainM: 15 Hb Clusters Leiden: 15 Hb Clusters It likely
was broken in two.

Some notes • For multiome might be relevant to measure
the number of cells per cluster, as we expect to have ~1000 cells per cluster to identify 80% of the pCREs. • There are another diagnostics methods to evaluate clustering separation and stability, methods to compare clusterings that represent different views of the data, and some strategies to choose the number of clusters. https://bioconductor.org/books/3.20/OSCA.advanced/clustering-redux.html https://www.bioconductor.org/packages/release/bioc/vignettes/bluster/inst/doc/diagnostics.html https://www.datanovia.com/en/lessons/cluster-validation-statistics-must-know-methods/ https://www.geeksforgeeks.org/clustering-metrics/

Thanks

Clustering methods in multiome scRNA-seq and sc...

Clustering methods in multiome scRNA-seq and scATAC-seq

Cynthia SC

More Decks by Cynthia SC

Featured

Transcript

Clustering in Multi-ome data (snATAC-seq + snRNA-seq) Presented By Cynthia

Main topics 1. Introducing the Weighted-Nearest Neighbor (WNN) for multimodal

Multiome scRNA-seq / scATAC-seq strategy It analyzes Chromium Single Cell

WNN clustering in multiome datasets with Seurat Big picture: (1)

2 = Louvain multilevel RNA + ATAC (pca / lsi)

Clustering WNN Seurat + Signac Leiden ensures well-connected clusters if

9 singletons 46 final clusters WNN PCA/Harmony + LSI Lovain

2 = Louvain multilevel RNA + ATAC (pca / lsi)

Clustering evaluation steps 1. Select WNN clusters integrating both RNA

Leiden 33 final clusters 4 singletons identified SML 41 final

Exploration based on singletons Louvain or LouvainM at res=1 with

>>>> Leiden >>>> Louvain Computes the Jaccard index for each

(X) SML (Y) LouvainM (X) Leiden (Y) LouvainM (X) SML

Clusters with a large negative width represent cell subtypes. Distribution

Leiden: In the silhouette table, each row corresponds to one

Cell-specificity Endo Astrocytes Oligos Likely cell-subtypes

Louvain: 14 Clusters LouvainM: 15 Clusters SML: 15 Clusters Leiden:

LouvainM 15 Clusters % [8.97-0.34] Leiden 15 Clusters % [9.9-2.25]

LouvainM: 15 Hb Clusters Leiden: 15 Hb Clusters It likely

Some notes • For multiome might be relevant to measure

Thanks