Upgrade to Pro — share decks privately, control downloads, hide ads and more …

ArchR software package for integrative single-c...

Cynthia SC
December 11, 2024
11

ArchR software package for integrative single-cell chromatin accessibility analysis

In this presentation I review ArchR software, a comprehensive and scalable R package designed for integrative single-cell chromatin accessibility analysis, excelling in handling large datasets and seamlessly integrating scATAC-seq with single-cell RNA-seq for multi-modal studies.
ArchR offers a user-friendly workflow with advanced features such as trajectory analysis, pseudo-bulk profiling, and high-quality visualizations. Compared to other tools, Signac, which extends Seurat for small to medium datasets, provides a unified RNA + ATAC workflow too, but is less efficient for large-scale analyses, while snapATAC specializes in large-scale scATAC-seq data with efficient cell barcoding but lacks advanced visualization and integration capabilities. Making the ideal choice for researchers needing both scalability and extensive multi-modal functionalities could challenging, tried this content to gain a comprehensive gist.

Cynthia SC

December 11, 2024
Tweet

Transcript

  1. Software suite for both routine and advanced analysis of massive-scale

    single-cell chromatin accessibility data without the need for high-performance computing environments. • Includes implementations for 10x Genomics Chromium system, the Bio-Rad droplet scATAC-seq system, single-cell combinatorial indexing and the Fluidigm C1 system.
  2. Main ArchR features 1. Schematic file structure called arrow 2.

    Data integration and analysis: normalization, dimensionality reduction, and clustering based on chromatin accessibility patterns 3. Data visualization and integration with other data types, such as scRNA-seq 4. Performance optimization: designed to handle datasets with millions of cells
  3. • In addition to using fragment files as input, can

    directly convert BAM files to Arrow files, enabling the analysis from diverse single-cell platforms. What is an arrow file?
  4. S Fig. 1 – Schematics of file infrastructure and information

    access Key Features of an Arrow File • Designed for scalability and speed • Facilitates random access to specific data components • Supports merging of data from multiple samples Data Components in an Arrow File • Gene Activity Matrix: Links accessibility to gene expression by aggregating signals near genes. • Tile Matrix: Accessibility aggregated across genomic tiles (e.g., 500 bp intervals). • Peak Matrix: Accessibility at specific peaks. • Metadata: includes cell-level annotations, quality control metrics, and other attributes. createArrowFiles() operate on each chromosome independently in parallel access a subset of each chromosome’s matrix from each Arrow file in parallel that are then merged
  5. S Fig. 2 – Quality control metrics for PBMC, bone

    marrow, and mouse atlas datasets (1) peripheral blood mononuclear cells (PBMCs) which represent discrete primary cell types (2) bone marrow stem and progenitor cells and differentiated cells, which represent a continuous cellular hierarchy (3) a large atlas of murine cell types from diverse organ systems (k) QC filtering plots for each individual organ type from the mouse atlas dataset showing the TSS enrichment score vs unique nuclear fragments per cell. Dot color represents the density in arbitrary units of points in the plot.
  6. ArchR outperforms SnapATAC and Signac in speed and memory usage

    across all comparisons, enabling analysis of 70,000-cell datasets in under 1 hour with 32 GB of RAM and 8 cores. Fig.1 (b) (c) S Fig. 3 – Benchmarking of SnapATAC performance Ex Fig. 2 Benchmarking comparisons of runtime and memory usage for ArchR, Signac, and SnapATAC
  7. Putative doublets in scATAC-seq data Fig 1. Employed similarly to

    methods for doublet detection in scRNA-seq. 38k cells across 2 replicates, representing 10 different cell lines. Uses a KNN-based method to identify synthetic doublets. These are projected onto the UMAP to highlight regions enriched for potential doublets. Ex Fig 4. Doublets present (black, N = 10,887) ArchR-identified doublets removed (N = 9,702) Were some predicted doublets identified by demuxlet not identified by ArchR, residing within cluster boundaries and not in intermediate zones
  8. Identification of a peak set In the context of scATAC-seq,

    identification of peak regions before cluster identification requires peak calling from all cells as a single group. *SnapATAC has the ability to use a genome-wide 500-bp tile matrix, downstream computation using this high-resolution matrix exceeds the memory limits of common computational infrastructure
  9. Dimensionality reduction and clustering • Latent semantic indexing (LSI) =

    Signac • Landmark diffusion maps (LDM) = SnapATAC • Optimized iterative LSI = ArchR LSI with TF-IDF (Standard Method) • Preprocessing: Uses TF-IDF normalization ((Term Frequency-Inverse Document Frequency) to weight features (regions) based on their importance across cells. TF-IDF emphasizes features that are frequent in specific cells but infrequent across the dataset. • Dimensionality Reduction: Applies Singular Value Decomposition (SVD) to the TF-IDF matrix, capturing major patterns of variability while suppressing noise. Runs once, directly on the preprocessed data, without refinement of feature selection. Iterative LSI (Refined Method) • Feature Selection: starts an initial round of LSI, followed by an iterative refinement of features (regions, peaks, terms) used in the analysis. • Iterative Workflow: After each iteration: (1) identifies features contributing the most meaningful variability. (2) refines the set of features included in subsequent rounds of LSI. (3) re-applies LSI with updated features. The iteration helps eliminate noise and non-informative features. Iterative LSI dynamically adjusts to the data and can improve clustering, visualization, and biological insights.
  10. Dimensionality reduction and clustering Ex Fig.5 (d) • Latent semantic

    indexing (LSI) = Signac • Optimized iterative LSI = ArchR • Landmark diffusion maps (LDM) = SnapATAC t-SNE of downsampled bulk ATAC-seq data from hematopoeitc cells (N = 7,200) to various data quality scales. Low-quality ~1,000 fragments/cell Medium-quality ~5,000 fragments/cell Right ~10,000 fragments/cell
  11. S Fig. 4 – Comparison of clustering results in scATAC-seq

    data derived from PBMCs In both cases, ArchR identified clusters similar to those in other methods while being less biased by low-quality cells and doublets. But, when comparing clustering of the bone marrow cell dataset, we found that ArchR alone maintained the structure of the continuous differentiation trajectories from immature CD34+ hematopoietic stem and progenitor cells through differentiated myeloid, erythroid and B cells
  12. S Fig. 7 – Comparison of estimated LSI in ArchR

    and estimated LDM in SnapATAC in bone marrow cells.
  13. What about Signac vs ArchR • LSI with TF-IDF: works

    well on smaller, cleaner datasets where the initial feature selection captures meaningful biological signals. • Iterative LSI: better for large, noisy datasets or when a more refined analysis is needed, such as identifying subtle subpopulations or resolving complex chromatin accessibility patterns.
  14. Fig. 2: Optimized gene score inference models improve prediction of

    gene expression from scATAC-seq data Gene scores represent inferred gene expression, are critical for annotating biological states in clusters. Previous methods for deriving gene scores were not extensively optimized, so ArchR benchmarked 56 models using matched scATAC-seq and scRNA-seq datasets from PBMCs and bone marrow cells. These varied by the regions included, the sizes of those regions and the weights (based on genomic distance) applied to each region, Used the canonical correlation analysis-based integration implemented in Seurat ArchR now uses this optimized gene score model (Model 42) for all downstream analyses, enabling more accurate cluster annotation and biological state identification.
  15. Fig 3. ArchR massive scale Feature Details Scalability and Performance

    - Processes large datasets (>1 million cells) efficiently. - ~220,000 cells in under 3 hours on modest systems. - Simulated 1.2 million cells in under 8 hours. Dimensionality Reduction - Landmark-based Latent Semantic Indexing (LSI) for effective reduction. Clustering - Plus to identify broad clusters, can identify rare cell types like plasma cells (~0.1% of the population). Peak Identification - Identified 21 differentially accessible peaks across clusters. (215,916) Motif Enrichment Analysis Side-by-side UMAPs of gene scores (left) and motif deviation scores Integration with Bulk Data - Integrated bulk ATAC-seq data with single-cell UMAP embeddings. TF Deviation Analysis - Improved chromVAR implementation. Lines are colored by the 21 clusters shown in “c” chromVAR is a method for determining TF deviations TFs for which the expression is highly correlated with motif accessibility can therefore be identified based on the correlation of the inferred gene score to the chromVAR motif deviation. This analysis identifies known drivers of hematopoietic differentiation, such as GATA1 and EOMES.
  16. S Fig. 10 The interactive ArchR genome browser in real

    time. CRE with active transcription Distal CRE
  17. Supplementary Fig. 10 The interactive ArchR genome browser in real

    time. A coverage plot in genome accessibility track visualization represents the distribution chromatin accessibility signals (seq. reads) across a genomic region. Genomic Coordinates Signal Intensity or Coverage Peaks Annotations, like TSS, exons, and introns Multi Tracks CRE with active transcription Distal CRE Typically associated with specific transcription factor binding sites. Associated with super enhancers Co-accessibility relationships or interaction links between different genomic regions
  18. Genome accessibility track visualization of marker genes with peak co-accessibility

    CD34 genome track showing greater accessibility in earlier hematopoietic clusters (1–5, 7–8 and 12–13). CD14 genome track (chr5, 139,963,285–140,023,286) showing greater accessibility in earlier monocytic clusters (13–15).
  19. Integration of scRNA-seq and scATAC-seq ArchR enables seamless integration of

    scRNA-seq and scATAC-seq data using Seurat. In ArchR, clustering is performed using the addClusters() function which permits additional clustering parameters to be passed to the Seurat::FindClusters() function Clustering using Seurat::FindClusters() is deterministic, meaning that the exact same input will always result in the exact same output. ArchR allows for the identification of clusters with scran by changing the method parameter in addClusters()
  20. Fig. 4: Integration of scATAC-seq and scRNA-seq data by ArchR

    identifies gene regulatory trajectories of hematopoietic differentiation
  21. Ext Fig. 1: Comparison of supported features from currently available

    scATAC-seq software Cluster Peak Calling in Signac Signac integrates with external peak-calling tools, such as MACS2, to perform peak calling on aggregated data for each cluster. The workflow involves aggregating single-cell data for each cluster into pseudo-bulk profiles and then calling peaks on these profiles.
  22. ArchR, snapATAC or Signac ? Common Feature Data Normalization Both

    use TF-IDF normalization to preprocess chromatin accessibility data Dimensionality Reduction Employ Latent Semantic Indexing (LSI) as the primary method Peak Calling Support identifying peaks of chromatin accessibility to study CREs Integration with RNA data Integrate with the Seurat framework Clustering and Visualization Provide tools for clustering and embedding in reduced dimensions (e.g., UMAP, t-SNE). Co-accessibility Analysis Allow analysis of chromatin co-accessibility to identify interacting regulatory regions. Motif Analysis Support motif enrichment analysis to identify TF binding sites in accessible regions. Open-source and R-based Both are open-source and implemented in R
  23. Arch and Signac main differences ? Feature ArchR Signac Normalization

    and Dimensionality Reduction Iterative Latent Semantic Indexing (LSI) approach LSI with Term Frequency-Inverse Document Frequency (TF-IDF) normalization followed by Singular Value Decomposition (SVD) Scalability and Performance High scalability, capable of processing large datasets efficiently (over 1.2 million cells within 8 hours on standard hardware) Effective for smaller datasets, it may face challenges with memory usage and speed when handling larger datasets Downstream Analysis Features Includes doublet removal, unified peak set generation, cellular trajectory identification and TF footprinting Peak calling, motif analysis, and co-accessibility analysis, but may not encompass the full range of downstream analysis features available in ArchR Community Support and Development Greenleaf Lab Satija Lab