November 13, 2024
Effective and Comparative
Methods for Single-Cell
Embedding Visualizations
Fritz Lekschas
Head of Visualization Research at Ozette Technologies
lekschas.de
linkedin.com/in/flekschas
1
Visual Analytics Lab at Tufts University
Slide 2
Slide 2 text
! MASSIVE SHOUT OUTS!
Trevor Manz, PhD Candidate at HMS
First author of CEV paper and former Ozette intern
Evan Greene, Ozette Co-Founder––––––––
First author and creator of data transformation methods ––––––––
Nezar Abdennur, Asst. Prof. at UMASS MED
Long term collaborator and embedding nerd
Arpan Neupane, Principal Computational Biologist––––––––
helps me better understand immunology––––––––
2
Slide 3
Slide 3 text
3
EDUCATION
PhD '21 in CS from Harvard University
MSc '16 in Bioinformatics from Freie Universität Berlin
RESEARCH
Visualization Human-Centered ML Design
WORK
Head of Visualization Research at Ozette
Slide 4
Slide 4 text
4
Data-Driven Discovery of
High-Resolution and Interpretable
Cell Phenotypes in
Single-Cell Cytometry Data
Slide 5
Slide 5 text
5
a.k.a. features
Proteins
Slide 6
Slide 6 text
5
Data from Mair et al., 2022. Nature.
From To
General
Cell Types
High-Resolution
Cell Phenotypes
Well-Resolved High-Resolution
Cell Phenotypes
Cytotoxic T Cells
T Helper Cells
B Cells
Naïve T Cells
Slide 7
Slide 7 text
6
Data from Mair et al., 2022. Nature.
Cytotoxic T Cells
T Helper Cells
B Cells
Naïve T Cells
Healthy Tissue Cancer Tissue
Slide 8
Slide 8 text
Single-Cell Embeddings
Greene et al., 2021, Patterns. Granja et al., 2020, Nature Biotechnology.
FEATURES
Chromatin Accessibility Peaks
FEATURES
Cell-Surface Antibodies
FEATURES
Genes
Tabula Sapiens Consortium, 2022, Science.
Visualization Challenges
CLUSTER RESOLUTION
Focus on general or specific cellular phenotypes?
SAMPLE COMPARISON
How to handle batch effects and aligning embeddings?
vs
CD8+ T Cells
T Helper Cells
B Cells
Naive T Cells
Sample B
Sample A
Slide 14
Slide 14 text
Visualization Challenges
CLUSTER RESOLUTION
Focus on general or specific cellular phenotypes?
SAMPLE COMPARISON
How to handle batch effects and aligning embeddings?
Sample B
Sample A
EXPLORATION VS EXPLANATION
Is the visualization a representation of the clustering?
vs
CD8+ T Cells
T Helper Cells
B Cells
Naive T Cells
Slide 15
Slide 15 text
How can we create Ozette’s embedding plot ?
How can we compare embedding plots?
14
Slide 16
Slide 16 text
15
Cytotoxic T Cells
T Helper Cells
B Cells
Naïve T Cells
Healthy Tissue Cancer Tissue
Data from Mair et al., 2022. Nature.
Data Transformation
FOR EACH PHENOTYPE:
1. Remove outlier expression values
Winsorize to [1th, 99th] percentile
2. Remove inter marker differences
Normalize to zero mean and unit variance
3. Align marker expressions by their expression level
Translate mean to a fixed value
Slide 22
Slide 22 text
Data Transformation
0. Raw Expression
FOR EACH PHENOTYPE:
1. Remove outlier expression values
Winsorize to [1th, 99th] percentile
2. Remove inter marker differences
Normalize to zero mean and unit variance
3. Align marker expressions by their expression level
Translate mean to a fixed value
CD3+
CD4+
CD8-
Slide 23
Slide 23 text
Data Transformation
0. Raw Expression
FOR EACH PHENOTYPE:
1. Remove outlier expression values
Winsorize to [1th, 99th] percentile
2. Remove inter marker differences
Normalize to zero mean and unit variance
3. Align marker expressions by their expression level
Translate mean to a fixed value
CD3+
CD4+
CD8-
Slide 24
Slide 24 text
FOR EACH PHENOTYPE:
1. Remove outlier expression values
Winsorize to [1th, 99th] percentile
2. Remove inter marker differences
Normalize to zero mean and unit variance
3. Align marker expressions by their expression level
Translate mean to a fixed value
Data Transformation
0. Raw Expression
1. Winsorized Expression
CD3+
CD4+
CD8-
Slide 25
Slide 25 text
Data Transformation
FOR EACH PHENOTYPE:
1. Remove outlier expression values
Winsorize to [1th, 99th] percentile
2. Remove inter marker differences
Normalize to zero mean and unit variance
3. Align marker expressions by their expression level
Translate mean to a fixed value
0. Raw Expression
1. Winsorized Expression
2. Normalized Expression
CD3+
CD4+
CD8-
Slide 26
Slide 26 text
Data Transformation
FOR EACH PHENOTYPE:
1. Remove outlier expression values
Winsorize to [1th, 99th] percentile
2. Remove inter marker differences
Normalize to zero mean and unit variance
3. Align marker expressions by their expression level
Translate mean to a fixed value
0. Raw Expression
1. Winsorized Expression
2. Normalized Expression
3. Translated Expression
CD3+
CD4+
CD8-
Slide 27
Slide 27 text
Untransformed Transformed
Tumor sample 6 from Mair et al., 2022, Nature.
UMAP Embedding
Slide 28
Slide 28 text
Untransformed Transformed
Tumor sample 6 from Mair et al., 2022, Nature.
t-SNE Embedding
Slide 29
Slide 29 text
Untransformed Transformed
Tumor sample 6 from Mair et al., 2022, Nature.
VAE Embedding
Slide 30
Slide 30 text
Winsorized Transformed
Tumor sample 6 from Mair et al., 2022, Nature.
VAE Embedding
Slide 31
Slide 31 text
Untransformed Transformed
Tumor sample 6 from Mair et al., 2022, Nature.
Cluster Coherence
Untransformed Transformed
Tumor sample 6 from Mair et al., 2022, Nature.
CD38 Expression Difference
CD38-
CD38+
CD4- CD8+ CD3+ CD45RA- CD27+ CD19- CD103+ CD28+ CD69+ PD1+ HLADR- GranzymeB- CD25- ICOS- TCRgd- CD38- CD127- Tim3-
CD4- CD8+ CD3+ CD45RA- CD27+ CD19- CD103+ CD28+ CD69+ PD1+ HLADR- GranzymeB- CD25- ICOS- TCRgd- CD38+ CD127- Tim3-
“Our study suggest that increased CD38 expression defines
tumor-infiltrating CD8+ T cells been pre-activated …”
Wu et al., 2021, Cancer Immunology, Immunotherapy.
Slide 35
Slide 35 text
Joint Embedding
Data from Mair et al., 2022, Nature.
Untransformed Transformed
Tumor 27
Tissue 138
Slide 36
Slide 36 text
Joint Embedding
Data from Mair et al., 2022, Nature.
Untransformed Transformed
Tumor 27
Tissue 138
Slide 37
Slide 37 text
Joint Embedding
Data from Mair et al., 2022, Nature.
Untransformed Transformed
Tumor 27
Tissue 138
Mair et al., 2022, Nature.
CD8- CD4+ CD45RA- CD27+ CD103- CD69-
CD28 + HLADR+ GranzymeB- PD1+ CD25+
ICOS+ TCRgd- CD38+ Tim3+
Slide 38
Slide 38 text
SEMI-CONCLUSION
• “Tune” the data and the embedding method
• Use a data transformation close to your objective
• The annotation transformation is not bound to FAUST
Slide 39
Slide 39 text
36
Cytotoxic T Cells
T Helper Cells
B Cells
Naïve T Cells
Healthy Tissue Cancer Tissue
Data from Mair et al., 2022. Nature.
Slide 40
Slide 40 text
How can we compare embedding plots?
37
Slide 41
Slide 41 text
Data from Mair et al., 2022, Nature.
Slide 42
Slide 42 text
SAME DATA
Data from Mair et al., 2022, Nature.
Slide 43
Slide 43 text
SAME DATA
Data from Mair et al., 2022, Nature.
Slide 44
Slide 44 text
SAME DATA
Data from Mair et al., 2022, Nature.
Seed 42 Seed 123
Slide 45
Slide 45 text
SAME DATA
Data from Mair et al., 2022, Nature.
Seed 42 Seed 123
High Visual Similarity
Slide 46
Slide 46 text
SAME DATA
Data from Mair et al., 2022, Nature.
Seed 42 Seed 123
High Visual Similarity
Jaccard similarity
Different set sizes for
Jaccard similarity in
kNN graphs
Cumulative Probability
Point-wise Similarity
Low Jaccard Similarity
Slide 47
Slide 47 text
SAME DATA
Data from Mair et al., 2022, Nature.
Slide 48
Slide 48 text
Data from Mair et al., 2022, Nature.
TISSUE
Slide 49
Slide 49 text
Data from Mair et al., 2022, Nature.
TUMOR
TISSUE
Slide 50
Slide 50 text
Data from Mair et al., 2022, Nature.
TUMOR
TISSUE
Slide 51
Slide 51 text
Data from Mair et al., 2022, Nature.
TUMOR
TISSUE
???
Slide 52
Slide 52 text
TUMOR
TISSUE
Slide 53
Slide 53 text
TUMOR
TISSUE
How can we facilitate more effective and systematic comparisons
of these complex 2D scatters?
Slide 54
Slide 54 text
TUMOR
TISSUE
How can we facilitate more effective and systematic comparisons
of these complex 2D scatters?
" challenge: establish meaningful relationships
between points in different views
Slide 55
Slide 55 text
Class-based comparison
• Compare groups of points rather than individual points
• Flexible comparisons at various abstraction levels
• Key considerations:
• Intermixing / separation
• Similarity / cohesion of neighbor groups
• Shifts in relative size (for data comparison)
\
Slide 56
Slide 56 text
Class-based comparison
• Compare groups of points rather than individual points
• Flexible comparisons at various abstraction levels
• Key considerations:
• Intermixing / separation
• Similarity / cohesion of neighbor groups
• Shifts in relative size (for data comparison)
\
Where do class labels come from?
Slide 57
Slide 57 text
Class-based comparison
• Compare groups of points rather than individual points
• Flexible comparisons at various abstraction levels
• Key considerations:
• Intermixing / separation
• Similarity / cohesion of neighbor groups
• Shifts in relative size (for data comparison)
\
Where do class labels come from?
External metadata (e.g., ground truth)
Unsupervised methods (e.g., clustering algorithms)
Can be hierarchical (animal: # $ % ..., fruit: & ' ( ...)
Embedding Confusion Neighborhood Size
Confusion: the degree of intermixing between points of
the same label and others.
Orange
Plum
Lime
Blueberry
Core
Slide 61
Slide 61 text
Neighborhood stability: the degree to which local
neighbors are shared between visualizations.
Embedding Confusion Neighborhood Size
Orange
Plum
Lime
Blueberry
Context
Slide 62
Slide 62 text
Size: the change in relative class-label sizes with
respect to the neighborhood.
Embedding Confusion Neighborhood Size
Orange
Plum
Lime
Blueberry
Combined
Embedding Confusion Neighborhood Size
Less Same Decreased
Orange
Plum
Lime
Blueberry
Slide 66
Slide 66 text
Methodology
• Create Delauney graph
• For each label: conduct
breadth-first search for every point
with that label
• Points within one hop account to
label confusion set
• Points with 1+ hop and not in the
confusion set account for
neighborhood set
Slide 67
Slide 67 text
• Create Delauney graph
• For each label: conduct
breadth-first search for every point
with that label
• Points within one hop account to
label confusion set
• Points with 1+ hop and not in the
confusion set account for
neighborhood set
Methodology
Slide 68
Slide 68 text
• Create Delauney graph
• For each label: conduct
breadth-first search for every point
with that label
• Points within one hop account to
label confusion set
• Points with 1+ hop and not in the
confusion set account for
neighborhood set
Methodology
Slide 69
Slide 69 text
• Create Delauney graph
• For each label: conduct
breadth-first search for every point
with that label
• Points within one hop account to
label confusion set
• Points with 1+ hop and not in the
confusion set account for
neighborhood set
Methodology
Slide 70
Slide 70 text
• Create Delauney graph
• For each label: conduct
breadth-first search for every point
with that label
• Points within one hop account to
label confusion set
• Points with 1+ hop and not in the
confusion set account for
neighborhood set
Methodology
Slide 71
Slide 71 text
• Create Delauney graph
• For each label: conduct
breadth-first search for every point
with that label
• Points within one hop account to
label confusion set
• Points with 1+ hop and not in the
confusion set account for
neighborhood set
Methodology
Slide 72
Slide 72 text
• Create Delauney graph
• For each label: conduct
breadth-first search for every point
with that label
• Points within one hop account to
label confusion set
• Points with 1+ hop and not in the
confusion set account for
neighborhood set
Methodology
Slide 73
Slide 73 text
• Create Delauney graph
• For each label: conduct
breadth-first search for every point
with that label
• Points within one hop account to
label confusion set
• Points with 1+ hop and not in the
confusion set account for
neighborhood set
Methodology
Slide 74
Slide 74 text
• Create Delauney graph
• For each label: conduct
breadth-first search for every point
with that label
• Points within one hop account to
label confusion set
• Points with 1+ hop and not in the
confusion set account for
neighborhood set
Methodology
Slide 75
Slide 75 text
• Create Delauney graph
• For each label: conduct
breadth-first search for every point
with that label
• Points within one hop account to
label confusion set
• Points with 1+ hop and not in the
confusion set account for
neighborhood set
Methodology
Slide 76
Slide 76 text
• Create Delauney graph
• For each label: conduct
breadth-first search for every point
with that label
• Points within one hop account to
label confusion set
• Points with 1+ hop and not in the
confusion set account for
neighborhood set
Methodology
Slide 77
Slide 77 text
• Create Delauney graph
• For each label: conduct
breadth-first search for every point
with that label
• Points within one hop account to
label confusion set
• Points with 1+ hop and not in the
confusion set account for
neighborhood set
Methodology
Candidate Confusion Set for Yellow
Slide 78
Slide 78 text
• Create Delauney graph
• For each label: conduct
breadth-first search for every point
with that label
• Points within one hop account to
label confusion set
• Points with 1+ hop and not in the
confusion set account for
neighborhood set
Methodology
Confusion Distance Adjustment
Distance Cutoff
Slide 79
Slide 79 text
• Create Delauney graph
• For each label: conduct
breadth-first search for every point
with that label
• Points within one hop account to
label confusion set
• Points with 1+ hop and not in the
confusion set account for
neighborhood set
Methodology
Confusion Distance Adjustment
Final Confusion Set for Yellow
Slide 80
Slide 80 text
• Create Delauney graph
• For each label: conduct
breadth-first search for every point
with that label
• Points within one hop account to
label confusion set
• Points with 1+ hop and not in the
confusion set account for
neighborhood set
Methodology
Slide 81
Slide 81 text
• Create Delauney graph
• For each label: conduct
breadth-first search for every point
with that label
• Points within one hop account to
label confusion set
• Points with 1+ hop and not in the
confusion set account for
neighborhood set
Methodology
Neighborhood Set for Yellow
Slide 82
Slide 82 text
Methodology
Neighborhood connectivity-based adjustment
Scale neighborhood strength of each
neighboring label by:
1. Average number of connections
between all labels
2. Average distances of connections
between all labels
5 connections to blue
2 connections to gray
1 connection to
green and purple
Slide 83
Slide 83 text
Methodology
Neighborhood connectivity-based adjustment
Scale neighborhood strength of each
neighboring label by:
1. Average number of connections
between all labels
2. Average distances of connections
between all labels
Neighborhood Likelihoods for Yellow
1.0
0.3
0.6
0.8
Slide 84
Slide 84 text
pip install cev
pip install cev
Slide 85
Slide 85 text
pip install cev
pip install cev
Slide 86
Slide 86 text
VS
Slide 87
Slide 87 text
Colored by label
Colored by metric
Slide 88
Slide 88 text
No content
Slide 89
Slide 89 text
No content
Slide 90
Slide 90 text
No content
Slide 91
Slide 91 text
No content
Slide 92
Slide 92 text
No content
Slide 93
Slide 93 text
No content
Slide 94
Slide 94 text
No content
Slide 95
Slide 95 text
No content
Slide 96
Slide 96 text
No content
Slide 97
Slide 97 text
In summary
• We compare embedding visualizations based on class labels
• Can be defined dynamically and different levels of abstraction
• Addresses limitations of traditional point-based methods
• Evaluation study: Guided comparisons increase confidence in findings
• Note: Intended to complement embedding quality assessment methods
\
Slide 98
Slide 98 text
Thanks! What’s Next?
How to compare more than two embedding plots?
How to conditionally and dynamically balance feature importance?
How to dynamically and continuously adjust local vs global patterns?
November 13, 2024 Visual Analytics Lab at Tufts University