ArchR software package for integrative single-cell chromatin accessibility analysis

ArchR software package for integrative single-cell chromatin accessibility analysis libd_lcolladotor_team
meetings Cynthia S Cardinault December 11, 2024

2021-03 Published: 25 February 2021

Software suite for both routine and advanced analysis of massive-scale
single-cell chromatin accessibility data without the need for high-performance computing environments. • Includes implementations for 10x Genomics Chromium system, the Bio-Rad droplet scATAC-seq system, single-cell combinatorial indexing and the Fluidigm C1 system.

Main ArchR features 1. Schematic file structure called arrow 2.
Data integration and analysis: normalization, dimensionality reduction, and clustering based on chromatin accessibility patterns 3. Data visualization and integration with other data types, such as scRNA-seq 4. Performance optimization: designed to handle datasets with millions of cells

• In addition to using fragment files as input, can
directly convert BAM files to Arrow files, enabling the analysis from diverse single-cell platforms. What is an arrow file?

S Fig. 1 – Schematics of file infrastructure and information
access Key Features of an Arrow File • Designed for scalability and speed • Facilitates random access to specific data components • Supports merging of data from multiple samples Data Components in an Arrow File • Gene Activity Matrix: Links accessibility to gene expression by aggregating signals near genes. • Tile Matrix: Accessibility aggregated across genomic tiles (e.g., 500 bp intervals). • Peak Matrix: Accessibility at specific peaks. • Metadata: includes cell-level annotations, quality control metrics, and other attributes. createArrowFiles() operate on each chromosome independently in parallel access a subset of each chromosome’s matrix from each Arrow file in parallel that are then merged

S Fig. 2 – Quality control metrics for PBMC, bone
marrow, and mouse atlas datasets (1) peripheral blood mononuclear cells (PBMCs) which represent discrete primary cell types (2) bone marrow stem and progenitor cells and differentiated cells, which represent a continuous cellular hierarchy (3) a large atlas of murine cell types from diverse organ systems (k) QC filtering plots for each individual organ type from the mouse atlas dataset showing the TSS enrichment score vs unique nuclear fragments per cell. Dot color represents the density in arbitrary units of points in the plot.

ArchR outperforms SnapATAC and Signac in speed and memory usage
across all comparisons, enabling analysis of 70,000-cell datasets in under 1 hour with 32 GB of RAM and 8 cores. Fig.1 (b) (c) S Fig. 3 – Benchmarking of SnapATAC performance Ex Fig. 2 Benchmarking comparisons of runtime and memory usage for ArchR, Signac, and SnapATAC

Putative doublets in scATAC-seq data Fig 1. Employed similarly to
methods for doublet detection in scRNA-seq. 38k cells across 2 replicates, representing 10 different cell lines. Uses a KNN-based method to identify synthetic doublets. These are projected onto the UMAP to highlight regions enriched for potential doublets. Ex Fig 4. Doublets present (black, N = 10,887) ArchR-identified doublets removed (N = 9,702) Were some predicted doublets identified by demuxlet not identified by ArchR, residing within cluster boundaries and not in intermediate zones

Identification of a peak set In the context of scATAC-seq,
identification of peak regions before cluster identification requires peak calling from all cells as a single group. *SnapATAC has the ability to use a genome-wide 500-bp tile matrix, downstream computation using this high-resolution matrix exceeds the memory limits of common computational infrastructure

Dimensionality reduction and clustering • Latent semantic indexing (LSI) =
Signac • Landmark diffusion maps (LDM) = SnapATAC • Optimized iterative LSI = ArchR LSI with TF-IDF (Standard Method) • Preprocessing: Uses TF-IDF normalization ((Term Frequency-Inverse Document Frequency) to weight features (regions) based on their importance across cells. TF-IDF emphasizes features that are frequent in specific cells but infrequent across the dataset. • Dimensionality Reduction: Applies Singular Value Decomposition (SVD) to the TF-IDF matrix, capturing major patterns of variability while suppressing noise. Runs once, directly on the preprocessed data, without refinement of feature selection. Iterative LSI (Refined Method) • Feature Selection: starts an initial round of LSI, followed by an iterative refinement of features (regions, peaks, terms) used in the analysis. • Iterative Workflow: After each iteration: (1) identifies features contributing the most meaningful variability. (2) refines the set of features included in subsequent rounds of LSI. (3) re-applies LSI with updated features. The iteration helps eliminate noise and non-informative features. Iterative LSI dynamically adjusts to the data and can improve clustering, visualization, and biological insights.

Dimensionality reduction and clustering Ex Fig.5 (d) • Latent semantic
indexing (LSI) = Signac • Optimized iterative LSI = ArchR • Landmark diffusion maps (LDM) = SnapATAC t-SNE of downsampled bulk ATAC-seq data from hematopoeitc cells (N = 7,200) to various data quality scales. Low-quality ~1,000 fragments/cell Medium-quality ~5,000 fragments/cell Right ~10,000 fragments/cell

S Fig. 4 – Comparison of clustering results in scATAC-seq
data derived from PBMCs In both cases, ArchR identified clusters similar to those in other methods while being less biased by low-quality cells and doublets. But, when comparing clustering of the bone marrow cell dataset, we found that ArchR alone maintained the structure of the continuous differentiation trajectories from immature CD34+ hematopoietic stem and progenitor cells through differentiated myeloid, erythroid and B cells

S Fig. 7 – Comparison of estimated LSI in ArchR
and estimated LDM in SnapATAC in bone marrow cells.

What about Signac vs ArchR • LSI with TF-IDF: works
well on smaller, cleaner datasets where the initial feature selection captures meaningful biological signals. • Iterative LSI: better for large, noisy datasets or when a more refined analysis is needed, such as identifying subtle subpopulations or resolving complex chromatin accessibility patterns.

Fig. 2: Optimized gene score inference models improve prediction of
gene expression from scATAC-seq data Gene scores represent inferred gene expression, are critical for annotating biological states in clusters. Previous methods for deriving gene scores were not extensively optimized, so ArchR benchmarked 56 models using matched scATAC-seq and scRNA-seq datasets from PBMCs and bone marrow cells. These varied by the regions included, the sizes of those regions and the weights (based on genomic distance) applied to each region, Used the canonical correlation analysis-based integration implemented in Seurat ArchR now uses this optimized gene score model (Model 42) for all downstream analyses, enabling more accurate cluster annotation and biological state identification.

Fig 3. ArchR massive scale Feature Details Scalability and Performance
- Processes large datasets (>1 million cells) efficiently. - ~220,000 cells in under 3 hours on modest systems. - Simulated 1.2 million cells in under 8 hours. Dimensionality Reduction - Landmark-based Latent Semantic Indexing (LSI) for effective reduction. Clustering - Plus to identify broad clusters, can identify rare cell types like plasma cells (~0.1% of the population). Peak Identification - Identified 21 differentially accessible peaks across clusters. (215,916) Motif Enrichment Analysis Side-by-side UMAPs of gene scores (left) and motif deviation scores Integration with Bulk Data - Integrated bulk ATAC-seq data with single-cell UMAP embeddings. TF Deviation Analysis - Improved chromVAR implementation. Lines are colored by the 21 clusters shown in “c” chromVAR is a method for determining TF deviations TFs for which the expression is highly correlated with motif accessibility can therefore be identified based on the correlation of the inferred gene score to the chromVAR motif deviation. This analysis identifies known drivers of hematopoietic differentiation, such as GATA1 and EOMES.

S Fig. 10 The interactive ArchR genome browser in real
time. CRE with active transcription Distal CRE

Supplementary Fig. 10 The interactive ArchR genome browser in real
time. A coverage plot in genome accessibility track visualization represents the distribution chromatin accessibility signals (seq. reads) across a genomic region. Genomic Coordinates Signal Intensity or Coverage Peaks Annotations, like TSS, exons, and introns Multi Tracks CRE with active transcription Distal CRE Typically associated with specific transcription factor binding sites. Associated with super enhancers Co-accessibility relationships or interaction links between different genomic regions

Genome accessibility track visualization of marker genes with peak co-accessibility
CD34 genome track showing greater accessibility in earlier hematopoietic clusters (1–5, 7–8 and 12–13). CD14 genome track (chr5, 139,963,285–140,023,286) showing greater accessibility in earlier monocytic clusters (13–15).

Integration of scRNA-seq and scATAC-seq ArchR enables seamless integration of
scRNA-seq and scATAC-seq data using Seurat. In ArchR, clustering is performed using the addClusters() function which permits additional clustering parameters to be passed to the Seurat::FindClusters() function Clustering using Seurat::FindClusters() is deterministic, meaning that the exact same input will always result in the exact same output. ArchR allows for the identification of clusters with scran by changing the method parameter in addClusters()

Fig. 4: Integration of scATAC-seq and scRNA-seq data by ArchR
identifies gene regulatory trajectories of hematopoietic differentiation

Ext Fig. 1: Comparison of supported features from currently available
scATAC-seq software Cluster Peak Calling in Signac Signac integrates with external peak-calling tools, such as MACS2, to perform peak calling on aggregated data for each cluster. The workflow involves aggregating single-cell data for each cluster into pseudo-bulk profiles and then calling peaks on these profiles.

ArchR, snapATAC or Signac ? Common Feature Data Normalization Both
use TF-IDF normalization to preprocess chromatin accessibility data Dimensionality Reduction Employ Latent Semantic Indexing (LSI) as the primary method Peak Calling Support identifying peaks of chromatin accessibility to study CREs Integration with RNA data Integrate with the Seurat framework Clustering and Visualization Provide tools for clustering and embedding in reduced dimensions (e.g., UMAP, t-SNE). Co-accessibility Analysis Allow analysis of chromatin co-accessibility to identify interacting regulatory regions. Motif Analysis Support motif enrichment analysis to identify TF binding sites in accessible regions. Open-source and R-based Both are open-source and implemented in R

Arch and Signac main differences ? Feature ArchR Signac Normalization
and Dimensionality Reduction Iterative Latent Semantic Indexing (LSI) approach LSI with Term Frequency-Inverse Document Frequency (TF-IDF) normalization followed by Singular Value Decomposition (SVD) Scalability and Performance High scalability, capable of processing large datasets efficiently (over 1.2 million cells within 8 hours on standard hardware) Effective for smaller datasets, it may face challenges with memory usage and speed when handling larger datasets Downstream Analysis Features Includes doublet removal, unified peak set generation, cellular trajectory identification and TF footprinting Peak calling, motif analysis, and co-accessibility analysis, but may not encompass the full range of downstream analysis features available in ArchR Community Support and Development Greenleaf Lab Satija Lab

Thanks

ArchR software package for integrative single-c...

ArchR software package for integrative single-cell chromatin accessibility analysis

Cynthia SC

More Decks by Cynthia SC

Featured

Transcript

ArchR software package for integrative single-cell chromatin accessibility analysis libd_lcolladotor_team

2021-03 Published: 25 February 2021

Software suite for both routine and advanced analysis of massive-scale

Main ArchR features 1. Schematic file structure called arrow 2.

• In addition to using fragment files as input, can

S Fig. 1 – Schematics of file infrastructure and information

S Fig. 2 – Quality control metrics for PBMC, bone

ArchR outperforms SnapATAC and Signac in speed and memory usage

Putative doublets in scATAC-seq data Fig 1. Employed similarly to

Identification of a peak set In the context of scATAC-seq,

Dimensionality reduction and clustering • Latent semantic indexing (LSI) =

Dimensionality reduction and clustering Ex Fig.5 (d) • Latent semantic

S Fig. 4 – Comparison of clustering results in scATAC-seq

S Fig. 7 – Comparison of estimated LSI in ArchR

What about Signac vs ArchR • LSI with TF-IDF: works

Fig. 2: Optimized gene score inference models improve prediction of

Fig 3. ArchR massive scale Feature Details Scalability and Performance

S Fig. 10 The interactive ArchR genome browser in real

Supplementary Fig. 10 The interactive ArchR genome browser in real

Genome accessibility track visualization of marker genes with peak co-accessibility

Integration of scRNA-seq and scATAC-seq ArchR enables seamless integration of

Fig. 4: Integration of scATAC-seq and scRNA-seq data by ArchR

Ext Fig. 1: Comparison of supported features from currently available

ArchR, snapATAC or Signac ? Common Feature Data Normalization Both

Arch and Signac main differences ? Feature ArchR Signac Normalization

Thanks