Effective and Comparative Methods for Single-Cell Embedding Visualizations

November 13, 2024 Effective and Comparative Methods for Single-Cell Embedding
Visualizations Fritz Lekschas Head of Visualization Research at Ozette Technologies lekschas.de linkedin.com/in/ﬂekschas 1 Visual Analytics Lab at Tufts University

! MASSIVE SHOUT OUTS! Trevor Manz, PhD Candidate at HMS
First author of CEV paper and former Ozette intern Evan Greene, Ozette Co-Founder–––––––– First author and creator of data transformation methods –––––––– Nezar Abdennur, Asst. Prof. at UMASS MED Long term collaborator and embedding nerd Arpan Neupane, Principal Computational Biologist–––––––– helps me better understand immunology–––––––– 2

3 EDUCATION PhD '21 in CS from Harvard University MSc
'16 in Bioinformatics from Freie Universität Berlin RESEARCH Visualization Human-Centered ML Design WORK Head of Visualization Research at Ozette

4 Data-Driven Discovery of High-Resolution and Interpretable Cell Phenotypes in
Single-Cell Cytometry Data

5 a.k.a. features Proteins

5 Data from Mair et al., 2022. Nature. From To
General Cell Types High-Resolution Cell Phenotypes Well-Resolved High-Resolution Cell Phenotypes Cytotoxic T Cells T Helper Cells B Cells Naïve T Cells

6 Data from Mair et al., 2022. Nature. Cytotoxic T
Cells T Helper Cells B Cells Naïve T Cells Healthy Tissue Cancer Tissue

Single-Cell Embeddings Greene et al., 2021, Patterns. Granja et al.,
2020, Nature Biotechnology. FEATURES Chromatin Accessibility Peaks FEATURES Cell-Surface Antibodies FEATURES Genes Tabula Sapiens Consortium, 2022, Science.

Why Visualize Embeddings? OVERVIEW Broad distribution & cell heterogeneity Hypothesis
Generation COMPARE Relative similarity of cell populations Trajectory Analysis CLUSTER Identify cell types/phenotypes Annotate clusters

Generation COMPARE Relative similarity of cell populations Trajectory Analysis CLUSTER Identify cell types/phenotypes Annotate clusters Sheih et al., 2020, Nature Communications.

Generation COMPARE Relative similarity of cell populations Trajectory Analysis CLUSTER Identify cell types/phenotypes Annotate clusters CD4+ Tabula Sapiens Consortium, 2022, Science.

Generation COMPARE Relative similarity of cell populations Trajectory Analysis CLUSTER Identify cell types/phenotypes Annotate clusters CD4 Expression CD4+ CD3 Expression CD3+ Greene et al., 2021, Patterns PD-1 Expression PD-1+ HLADR Expression HLADR+

Visualization Challenges CLUSTER RESOLUTION Focus on general or speciﬁc cellular
phenotypes? SAMPLE COMPARISON How to handle batch effects and aligning embeddings? vs CD8+ T Cells T Helper Cells B Cells Naive T Cells Sample B Sample A

Visualization Challenges CLUSTER RESOLUTION Focus on general or speciﬁc cellular
phenotypes? SAMPLE COMPARISON How to handle batch effects and aligning embeddings? Sample B Sample A EXPLORATION VS EXPLANATION Is the visualization a representation of the clustering? vs CD8+ T Cells T Helper Cells B Cells Naive T Cells

How can we create Ozette’s embedding plot ? How can
we compare embedding plots? 14

15 Cytotoxic T Cells T Helper Cells B Cells Naïve
T Cells Healthy Tissue Cancer Tissue Data from Mair et al., 2022. Nature.

How can we create Ozette’s embedding plot ? 16

FAUST Annotation + Clustering ANNOTATE Deﬁne expression levels E.g.: Positive
/ Negative Fully interpretable clusters Greene et al., 2021, Pattern.

Data Transformation FOR EACH PHENOTYPE: 1. Remove outlier expression values
Winsorize to [1th, 99th] percentile 2. Remove inter marker differences Normalize to zero mean and unit variance 3. Align marker expressions by their expression level Translate mean to a ﬁxed value

Data Transformation 0. Raw Expression FOR EACH PHENOTYPE: 1. Remove
outlier expression values Winsorize to [1th, 99th] percentile 2. Remove inter marker differences Normalize to zero mean and unit variance 3. Align marker expressions by their expression level Translate mean to a ﬁxed value CD3+ CD4+ CD8-

FOR EACH PHENOTYPE: 1. Remove outlier expression values Winsorize to
[1th, 99th] percentile 2. Remove inter marker differences Normalize to zero mean and unit variance 3. Align marker expressions by their expression level Translate mean to a ﬁxed value Data Transformation 0. Raw Expression 1. Winsorized Expression CD3+ CD4+ CD8-

Winsorize to [1th, 99th] percentile 2. Remove inter marker differences Normalize to zero mean and unit variance 3. Align marker expressions by their expression level Translate mean to a ﬁxed value 0. Raw Expression 1. Winsorized Expression 2. Normalized Expression CD3+ CD4+ CD8-

Winsorize to [1th, 99th] percentile 2. Remove inter marker differences Normalize to zero mean and unit variance 3. Align marker expressions by their expression level Translate mean to a ﬁxed value 0. Raw Expression 1. Winsorized Expression 2. Normalized Expression 3. Translated Expression CD3+ CD4+ CD8-

Untransformed Transformed Tumor sample 6 from Mair et al., 2022,
Nature. UMAP Embedding

Nature. t-SNE Embedding

Nature. VAE Embedding

Winsorized Transformed Tumor sample 6 from Mair et al., 2022,
Nature. VAE Embedding

Nature. Cluster Coherence

Nature. CD38 Expression Difference CD4- CD8+ CD3+ CD45RA- CD27+ CD19- CD103+ CD28+ CD69+ PD1+ HLADR- GranzymeB- CD25- ICOS- TCRgd- CD38- CD127- Tim3- CD4- CD8+ CD3+ CD45RA- CD27+ CD19- CD103+ CD28+ CD69+ PD1+ HLADR- GranzymeB- CD25- ICOS- TCRgd- CD38+ CD127- Tim3-

Nature. CD38 Expression Difference CD38- CD38+ CD4- CD8+ CD3+ CD45RA- CD27+ CD19- CD103+ CD28+ CD69+ PD1+ HLADR- GranzymeB- CD25- ICOS- TCRgd- CD38- CD127- Tim3- CD4- CD8+ CD3+ CD45RA- CD27+ CD19- CD103+ CD28+ CD69+ PD1+ HLADR- GranzymeB- CD25- ICOS- TCRgd- CD38+ CD127- Tim3-

Nature. CD38 Expression Difference CD38- CD38+ CD4- CD8+ CD3+ CD45RA- CD27+ CD19- CD103+ CD28+ CD69+ PD1+ HLADR- GranzymeB- CD25- ICOS- TCRgd- CD38- CD127- Tim3- CD4- CD8+ CD3+ CD45RA- CD27+ CD19- CD103+ CD28+ CD69+ PD1+ HLADR- GranzymeB- CD25- ICOS- TCRgd- CD38+ CD127- Tim3- “Our study suggest that increased CD38 expression deﬁnes tumor-inﬁltrating CD8+ T cells been pre-activated …” Wu et al., 2021, Cancer Immunology, Immunotherapy.

Joint Embedding Data from Mair et al., 2022, Nature. Untransformed
Transformed Tumor 27 Tissue 138

Joint Embedding Data from Mair et al., 2022, Nature. Untransformed
Transformed Tumor 27 Tissue 138 Mair et al., 2022, Nature. CD8- CD4+ CD45RA- CD27+ CD103- CD69- CD28 + HLADR+ GranzymeB- PD1+ CD25+ ICOS+ TCRgd- CD38+ Tim3+

SEMI-CONCLUSION • “Tune” the data and the embedding method •
Use a data transformation close to your objective • The annotation transformation is not bound to FAUST

36 Cytotoxic T Cells T Helper Cells B Cells Naïve
T Cells Healthy Tissue Cancer Tissue Data from Mair et al., 2022. Nature.

How can we compare embedding plots? 37

Data from Mair et al., 2022, Nature.

SAME DATA Data from Mair et al., 2022, Nature.

SAME DATA Data from Mair et al., 2022, Nature. Seed
42 Seed 123

42 Seed 123 High Visual Similarity

42 Seed 123 High Visual Similarity Jaccard similarity Different set sizes for Jaccard similarity in kNN graphs Cumulative Probability Point-wise Similarity Low Jaccard Similarity

SAME DATA Data from Mair et al., 2022, Nature.

Data from Mair et al., 2022, Nature. TISSUE

Data from Mair et al., 2022, Nature. TUMOR TISSUE

Data from Mair et al., 2022, Nature. TUMOR TISSUE ???

TUMOR TISSUE

TUMOR TISSUE How can we facilitate more effective and systematic
comparisons of these complex 2D scatters?

TUMOR TISSUE How can we facilitate more effective and systematic
comparisons of these complex 2D scatters? " challenge: establish meaningful relationships between points in diﬀerent views

Class-based comparison • Compare groups of points rather than individual
points • Flexible comparisons at various abstraction levels • Key considerations: • Intermixing / separation • Similarity / cohesion of neighbor groups • Shifts in relative size (for data comparison) \

points • Flexible comparisons at various abstraction levels • Key considerations: • Intermixing / separation • Similarity / cohesion of neighbor groups • Shifts in relative size (for data comparison) \ Where do class labels come from?

points • Flexible comparisons at various abstraction levels • Key considerations: • Intermixing / separation • Similarity / cohesion of neighbor groups • Shifts in relative size (for data comparison) \ Where do class labels come from? External metadata (e.g., ground truth) Unsupervised methods (e.g., clustering algorithms) Can be hierarchical (animal: # $ % ..., fruit: & ' ( ...)

Embedding Confusion Neighborhood Size Plum Lime Blueberry Orange

Embedding Confusion Neighborhood Size Confusion: the degree of intermixing between
points of the same label and others. Orange Plum Lime Blueberry Core

Neighborhood stability: the degree to which local neighbors are shared
between visualizations. Embedding Confusion Neighborhood Size Orange Plum Lime Blueberry Context

Size: the change in relative class-label sizes with respect to
the neighborhood. Embedding Confusion Neighborhood Size Orange Plum Lime Blueberry Combined

Embedding Confusion Neighborhood Size Orange Plum Lime Blueberry

Embedding Confusion Neighborhood Size Less Same Decreased Orange Plum Lime
Blueberry

Methodology • Create Delauney graph • For each label: conduct 
breadth-ﬁrst search for every point with that label • Points within one hop account to label confusion set • Points with 1+ hop and not in the confusion set account for neighborhood set

• Create Delauney graph • For each label: conduct  breadth-ﬁrst
search for every point with that label • Points within one hop account to label confusion set • Points with 1+ hop and not in the confusion set account for neighborhood set Methodology

search for every point with that label • Points within one hop account to label confusion set • Points with 1+ hop and not in the confusion set account for neighborhood set Methodology Candidate Confusion Set for Yellow

search for every point with that label • Points within one hop account to label confusion set • Points with 1+ hop and not in the confusion set account for neighborhood set Methodology Confusion Distance Adjustment Distance Cutoﬀ

search for every point with that label • Points within one hop account to label confusion set • Points with 1+ hop and not in the confusion set account for neighborhood set Methodology Confusion Distance Adjustment Final Confusion Set for Yellow

search for every point with that label • Points within one hop account to label confusion set • Points with 1+ hop and not in the confusion set account for neighborhood set Methodology

search for every point with that label • Points within one hop account to label confusion set • Points with 1+ hop and not in the confusion set account for neighborhood set Methodology Neighborhood Set for Yellow

Methodology Neighborhood connectivity-based adjustment Scale neighborhood strength of each neighboring
label by: 1. Average number of connections between all labels 2. Average distances of connections between all labels 5 connections to blue 2 connections to gray 1 connection to green and purple

Methodology Neighborhood connectivity-based adjustment Scale neighborhood strength of each neighboring
label by: 1. Average number of connections between all labels 2. Average distances of connections between all labels Neighborhood Likelihoods for Yellow 1.0 0.3 0.6 0.8

pip install cev pip install cev

Colored by label Colored by metric

In summary • We compare embedding visualizations based on class
labels • Can be defined dynamically and different levels of abstraction • Addresses limitations of traditional point-based methods • Evaluation study: Guided comparisons increase confidence in findings • Note: Intended to complement embedding quality assessment methods \

Thanks! What’s Next? How to compare more than two embedding
plots? How to conditionally and dynamically balance feature importance? How to dynamically and continuously adjust local vs global patterns? November 13, 2024 Visual Analytics Lab at Tufts University

Effective and Comparative Methods for Single-Ce...

Effective and Comparative Methods for Single-Cell Embedding Visualizations

More Decks by Fritz Lekschas

Other Decks in Science

Featured

Transcript