Slide 1

Slide 1 text

November 13, 2024 Effective and Comparative Methods for Single-Cell Embedding Visualizations Fritz Lekschas Head of Visualization Research at Ozette Technologies lekschas.de linkedin.com/in/flekschas 1 Visual Analytics Lab at Tufts University

Slide 2

Slide 2 text

! MASSIVE SHOUT OUTS! Trevor Manz, PhD Candidate at HMS First author of CEV paper and former Ozette intern Evan Greene, Ozette Co-Founder–––––––– First author and creator of data transformation methods –––––––– Nezar Abdennur, Asst. Prof. at UMASS MED Long term collaborator and embedding nerd Arpan Neupane, Principal Computational Biologist–––––––– helps me better understand immunology–––––––– 2

Slide 3

Slide 3 text

3 EDUCATION PhD '21 in CS from Harvard University MSc '16 in Bioinformatics from Freie Universität Berlin RESEARCH Visualization Human-Centered ML Design WORK Head of Visualization Research at Ozette

Slide 4

Slide 4 text

4 Data-Driven Discovery of High-Resolution and Interpretable Cell Phenotypes in Single-Cell Cytometry Data

Slide 5

Slide 5 text

5 a.k.a. features Proteins

Slide 6

Slide 6 text

5 Data from Mair et al., 2022. Nature. From To General Cell Types High-Resolution Cell Phenotypes Well-Resolved High-Resolution Cell Phenotypes Cytotoxic T Cells T Helper Cells B Cells Naïve T Cells

Slide 7

Slide 7 text

6 Data from Mair et al., 2022. Nature. Cytotoxic T Cells T Helper Cells B Cells Naïve T Cells Healthy Tissue Cancer Tissue

Slide 8

Slide 8 text

Single-Cell Embeddings Greene et al., 2021, Patterns. Granja et al., 2020, Nature Biotechnology. FEATURES Chromatin Accessibility Peaks FEATURES Cell-Surface Antibodies FEATURES Genes Tabula Sapiens Consortium, 2022, Science.

Slide 9

Slide 9 text

Why Visualize Embeddings? OVERVIEW Broad distribution & cell heterogeneity Hypothesis Generation COMPARE Relative similarity of cell populations Trajectory Analysis CLUSTER Identify cell types/phenotypes Annotate clusters

Slide 10

Slide 10 text

Why Visualize Embeddings? OVERVIEW Broad distribution & cell heterogeneity Hypothesis Generation COMPARE Relative similarity of cell populations Trajectory Analysis CLUSTER Identify cell types/phenotypes Annotate clusters Sheih et al., 2020, Nature Communications.

Slide 11

Slide 11 text

Why Visualize Embeddings? OVERVIEW Broad distribution & cell heterogeneity Hypothesis Generation COMPARE Relative similarity of cell populations Trajectory Analysis CLUSTER Identify cell types/phenotypes Annotate clusters CD4+ Tabula Sapiens Consortium, 2022, Science.

Slide 12

Slide 12 text

Why Visualize Embeddings? OVERVIEW Broad distribution & cell heterogeneity Hypothesis Generation COMPARE Relative similarity of cell populations Trajectory Analysis CLUSTER Identify cell types/phenotypes Annotate clusters CD4 Expression CD4+ CD3 Expression CD3+ Greene et al., 2021, Patterns PD-1 Expression PD-1+ HLADR Expression HLADR+

Slide 13

Slide 13 text

Visualization Challenges CLUSTER RESOLUTION Focus on general or specific cellular phenotypes? SAMPLE COMPARISON How to handle batch effects and aligning embeddings? vs CD8+ T Cells T Helper Cells B Cells Naive T Cells Sample B Sample A

Slide 14

Slide 14 text

Visualization Challenges CLUSTER RESOLUTION Focus on general or specific cellular phenotypes? SAMPLE COMPARISON How to handle batch effects and aligning embeddings? Sample B Sample A EXPLORATION VS EXPLANATION Is the visualization a representation of the clustering? vs CD8+ T Cells T Helper Cells B Cells Naive T Cells

Slide 15

Slide 15 text

How can we create Ozette’s embedding plot ? How can we compare embedding plots? 14

Slide 16

Slide 16 text

15 Cytotoxic T Cells T Helper Cells B Cells Naïve T Cells Healthy Tissue Cancer Tissue Data from Mair et al., 2022. Nature.

Slide 17

Slide 17 text

How can we create Ozette’s embedding plot ? 16

Slide 18

Slide 18 text

FAUST Annotation + Clustering ANNOTATE Define expression levels E.g.: Positive / Negative Fully interpretable clusters Greene et al., 2021, Pattern.

Slide 19

Slide 19 text

FAUST Annotation + Clustering ANNOTATE Define expression levels E.g.: Positive / Negative Fully interpretable clusters Greene et al., 2021, Pattern.

Slide 20

Slide 20 text

FAUST Annotation + Clustering ANNOTATE Define expression levels E.g.: Positive / Negative Fully interpretable clusters Greene et al., 2021, Pattern.

Slide 21

Slide 21 text

Data Transformation FOR EACH PHENOTYPE: 1. Remove outlier expression values Winsorize to [1th, 99th] percentile 2. Remove inter marker differences Normalize to zero mean and unit variance 3. Align marker expressions by their expression level Translate mean to a fixed value

Slide 22

Slide 22 text

Data Transformation 0. Raw Expression FOR EACH PHENOTYPE: 1. Remove outlier expression values Winsorize to [1th, 99th] percentile 2. Remove inter marker differences Normalize to zero mean and unit variance 3. Align marker expressions by their expression level Translate mean to a fixed value CD3+ CD4+ CD8-

Slide 23

Slide 23 text

Data Transformation 0. Raw Expression FOR EACH PHENOTYPE: 1. Remove outlier expression values Winsorize to [1th, 99th] percentile 2. Remove inter marker differences Normalize to zero mean and unit variance 3. Align marker expressions by their expression level Translate mean to a fixed value CD3+ CD4+ CD8-

Slide 24

Slide 24 text

FOR EACH PHENOTYPE: 1. Remove outlier expression values Winsorize to [1th, 99th] percentile 2. Remove inter marker differences Normalize to zero mean and unit variance 3. Align marker expressions by their expression level Translate mean to a fixed value Data Transformation 0. Raw Expression 1. Winsorized Expression CD3+ CD4+ CD8-

Slide 25

Slide 25 text

Data Transformation FOR EACH PHENOTYPE: 1. Remove outlier expression values Winsorize to [1th, 99th] percentile 2. Remove inter marker differences Normalize to zero mean and unit variance 3. Align marker expressions by their expression level Translate mean to a fixed value 0. Raw Expression 1. Winsorized Expression 2. Normalized Expression CD3+ CD4+ CD8-

Slide 26

Slide 26 text

Data Transformation FOR EACH PHENOTYPE: 1. Remove outlier expression values Winsorize to [1th, 99th] percentile 2. Remove inter marker differences Normalize to zero mean and unit variance 3. Align marker expressions by their expression level Translate mean to a fixed value 0. Raw Expression 1. Winsorized Expression 2. Normalized Expression 3. Translated Expression CD3+ CD4+ CD8-

Slide 27

Slide 27 text

Untransformed Transformed Tumor sample 6 from Mair et al., 2022, Nature. UMAP Embedding

Slide 28

Slide 28 text

Untransformed Transformed Tumor sample 6 from Mair et al., 2022, Nature. t-SNE Embedding

Slide 29

Slide 29 text

Untransformed Transformed Tumor sample 6 from Mair et al., 2022, Nature. VAE Embedding

Slide 30

Slide 30 text

Winsorized Transformed Tumor sample 6 from Mair et al., 2022, Nature. VAE Embedding

Slide 31

Slide 31 text

Untransformed Transformed Tumor sample 6 from Mair et al., 2022, Nature. Cluster Coherence

Slide 32

Slide 32 text

Untransformed Transformed Tumor sample 6 from Mair et al., 2022, Nature. CD38 Expression Difference CD4- CD8+ CD3+ CD45RA- CD27+ CD19- CD103+ CD28+ CD69+ PD1+ HLADR- GranzymeB- CD25- ICOS- TCRgd- CD38- CD127- Tim3- CD4- CD8+ CD3+ CD45RA- CD27+ CD19- CD103+ CD28+ CD69+ PD1+ HLADR- GranzymeB- CD25- ICOS- TCRgd- CD38+ CD127- Tim3-

Slide 33

Slide 33 text

Untransformed Transformed Tumor sample 6 from Mair et al., 2022, Nature. CD38 Expression Difference CD38- CD38+ CD4- CD8+ CD3+ CD45RA- CD27+ CD19- CD103+ CD28+ CD69+ PD1+ HLADR- GranzymeB- CD25- ICOS- TCRgd- CD38- CD127- Tim3- CD4- CD8+ CD3+ CD45RA- CD27+ CD19- CD103+ CD28+ CD69+ PD1+ HLADR- GranzymeB- CD25- ICOS- TCRgd- CD38+ CD127- Tim3-

Slide 34

Slide 34 text

Untransformed Transformed Tumor sample 6 from Mair et al., 2022, Nature. CD38 Expression Difference CD38- CD38+ CD4- CD8+ CD3+ CD45RA- CD27+ CD19- CD103+ CD28+ CD69+ PD1+ HLADR- GranzymeB- CD25- ICOS- TCRgd- CD38- CD127- Tim3- CD4- CD8+ CD3+ CD45RA- CD27+ CD19- CD103+ CD28+ CD69+ PD1+ HLADR- GranzymeB- CD25- ICOS- TCRgd- CD38+ CD127- Tim3- “Our study suggest that increased CD38 expression defines tumor-infiltrating CD8+ T cells been pre-activated …” Wu et al., 2021, Cancer Immunology, Immunotherapy.

Slide 35

Slide 35 text

Joint Embedding Data from Mair et al., 2022, Nature. Untransformed Transformed Tumor 27 Tissue 138

Slide 36

Slide 36 text

Joint Embedding Data from Mair et al., 2022, Nature. Untransformed Transformed Tumor 27 Tissue 138

Slide 37

Slide 37 text

Joint Embedding Data from Mair et al., 2022, Nature. Untransformed Transformed Tumor 27 Tissue 138 Mair et al., 2022, Nature. CD8- CD4+ CD45RA- CD27+ CD103- CD69- CD28 + HLADR+ GranzymeB- PD1+ CD25+ ICOS+ TCRgd- CD38+ Tim3+

Slide 38

Slide 38 text

SEMI-CONCLUSION • “Tune” the data and the embedding method • Use a data transformation close to your objective • The annotation transformation is not bound to FAUST

Slide 39

Slide 39 text

36 Cytotoxic T Cells T Helper Cells B Cells Naïve T Cells Healthy Tissue Cancer Tissue Data from Mair et al., 2022. Nature.

Slide 40

Slide 40 text

How can we compare embedding plots? 37

Slide 41

Slide 41 text

Data from Mair et al., 2022, Nature.

Slide 42

Slide 42 text

SAME DATA Data from Mair et al., 2022, Nature.

Slide 43

Slide 43 text

SAME DATA Data from Mair et al., 2022, Nature.

Slide 44

Slide 44 text

SAME DATA Data from Mair et al., 2022, Nature. Seed 42 Seed 123

Slide 45

Slide 45 text

SAME DATA Data from Mair et al., 2022, Nature. Seed 42 Seed 123 High Visual Similarity

Slide 46

Slide 46 text

SAME DATA Data from Mair et al., 2022, Nature. Seed 42 Seed 123 High Visual Similarity Jaccard similarity Different set sizes for Jaccard similarity in kNN graphs Cumulative Probability Point-wise Similarity Low Jaccard Similarity

Slide 47

Slide 47 text

SAME DATA Data from Mair et al., 2022, Nature.

Slide 48

Slide 48 text

Data from Mair et al., 2022, Nature. TISSUE

Slide 49

Slide 49 text

Data from Mair et al., 2022, Nature. TUMOR TISSUE

Slide 50

Slide 50 text

Data from Mair et al., 2022, Nature. TUMOR TISSUE

Slide 51

Slide 51 text

Data from Mair et al., 2022, Nature. TUMOR TISSUE ???

Slide 52

Slide 52 text

TUMOR TISSUE

Slide 53

Slide 53 text

TUMOR TISSUE How can we facilitate more effective and systematic comparisons of these complex 2D scatters?

Slide 54

Slide 54 text

TUMOR TISSUE How can we facilitate more effective and systematic comparisons of these complex 2D scatters? " challenge: establish meaningful relationships between points in different views

Slide 55

Slide 55 text

Class-based comparison • Compare groups of points rather than individual points • Flexible comparisons at various abstraction levels • Key considerations: • Intermixing / separation • Similarity / cohesion of neighbor groups • Shifts in relative size (for data comparison) \

Slide 56

Slide 56 text

Class-based comparison • Compare groups of points rather than individual points • Flexible comparisons at various abstraction levels • Key considerations: • Intermixing / separation • Similarity / cohesion of neighbor groups • Shifts in relative size (for data comparison) \ Where do class labels come from?

Slide 57

Slide 57 text

Class-based comparison • Compare groups of points rather than individual points • Flexible comparisons at various abstraction levels • Key considerations: • Intermixing / separation • Similarity / cohesion of neighbor groups • Shifts in relative size (for data comparison) \ Where do class labels come from? External metadata (e.g., ground truth) Unsupervised methods (e.g., clustering algorithms) Can be hierarchical (animal: # $ % ..., fruit: & ' ( ...)

Slide 58

Slide 58 text

Embedding Confusion Neighborhood Size Plum Lime Blueberry Orange

Slide 59

Slide 59 text

Embedding Confusion Neighborhood Size Plum Lime Blueberry Orange

Slide 60

Slide 60 text

Embedding Confusion Neighborhood Size Confusion: the degree of intermixing between points of the same label and others. Orange Plum Lime Blueberry Core

Slide 61

Slide 61 text

Neighborhood stability: the degree to which local neighbors are shared between visualizations. Embedding Confusion Neighborhood Size Orange Plum Lime Blueberry Context

Slide 62

Slide 62 text

Size: the change in relative class-label sizes with respect to the neighborhood. Embedding Confusion Neighborhood Size Orange Plum Lime Blueberry Combined

Slide 63

Slide 63 text

Embedding Confusion Neighborhood Size Orange Plum Lime Blueberry

Slide 64

Slide 64 text

Embedding Confusion Neighborhood Size Orange Plum Lime Blueberry

Slide 65

Slide 65 text

Embedding Confusion Neighborhood Size Less Same Decreased Orange Plum Lime Blueberry

Slide 66

Slide 66 text

Methodology • Create Delauney graph • For each label: conduct
 breadth-first search for every point with that label • Points within one hop account to label confusion set • Points with 1+ hop and not in the confusion set account for neighborhood set

Slide 67

Slide 67 text

• Create Delauney graph • For each label: conduct
 breadth-first search for every point with that label • Points within one hop account to label confusion set • Points with 1+ hop and not in the confusion set account for neighborhood set Methodology

Slide 68

Slide 68 text

• Create Delauney graph • For each label: conduct
 breadth-first search for every point with that label • Points within one hop account to label confusion set • Points with 1+ hop and not in the confusion set account for neighborhood set Methodology

Slide 69

Slide 69 text

• Create Delauney graph • For each label: conduct
 breadth-first search for every point with that label • Points within one hop account to label confusion set • Points with 1+ hop and not in the confusion set account for neighborhood set Methodology

Slide 70

Slide 70 text

• Create Delauney graph • For each label: conduct
 breadth-first search for every point with that label • Points within one hop account to label confusion set • Points with 1+ hop and not in the confusion set account for neighborhood set Methodology

Slide 71

Slide 71 text

• Create Delauney graph • For each label: conduct
 breadth-first search for every point with that label • Points within one hop account to label confusion set • Points with 1+ hop and not in the confusion set account for neighborhood set Methodology

Slide 72

Slide 72 text

• Create Delauney graph • For each label: conduct
 breadth-first search for every point with that label • Points within one hop account to label confusion set • Points with 1+ hop and not in the confusion set account for neighborhood set Methodology

Slide 73

Slide 73 text

• Create Delauney graph • For each label: conduct
 breadth-first search for every point with that label • Points within one hop account to label confusion set • Points with 1+ hop and not in the confusion set account for neighborhood set Methodology

Slide 74

Slide 74 text

• Create Delauney graph • For each label: conduct
 breadth-first search for every point with that label • Points within one hop account to label confusion set • Points with 1+ hop and not in the confusion set account for neighborhood set Methodology

Slide 75

Slide 75 text

• Create Delauney graph • For each label: conduct
 breadth-first search for every point with that label • Points within one hop account to label confusion set • Points with 1+ hop and not in the confusion set account for neighborhood set Methodology

Slide 76

Slide 76 text

• Create Delauney graph • For each label: conduct
 breadth-first search for every point with that label • Points within one hop account to label confusion set • Points with 1+ hop and not in the confusion set account for neighborhood set Methodology

Slide 77

Slide 77 text

• Create Delauney graph • For each label: conduct
 breadth-first search for every point with that label • Points within one hop account to label confusion set • Points with 1+ hop and not in the confusion set account for neighborhood set Methodology Candidate Confusion Set for Yellow

Slide 78

Slide 78 text

• Create Delauney graph • For each label: conduct
 breadth-first search for every point with that label • Points within one hop account to label confusion set • Points with 1+ hop and not in the confusion set account for neighborhood set Methodology Confusion Distance Adjustment Distance Cutoff

Slide 79

Slide 79 text

• Create Delauney graph • For each label: conduct
 breadth-first search for every point with that label • Points within one hop account to label confusion set • Points with 1+ hop and not in the confusion set account for neighborhood set Methodology Confusion Distance Adjustment Final Confusion Set for Yellow

Slide 80

Slide 80 text

• Create Delauney graph • For each label: conduct
 breadth-first search for every point with that label • Points within one hop account to label confusion set • Points with 1+ hop and not in the confusion set account for neighborhood set Methodology

Slide 81

Slide 81 text

• Create Delauney graph • For each label: conduct
 breadth-first search for every point with that label • Points within one hop account to label confusion set • Points with 1+ hop and not in the confusion set account for neighborhood set Methodology Neighborhood Set for Yellow

Slide 82

Slide 82 text

Methodology Neighborhood connectivity-based adjustment Scale neighborhood strength of each neighboring label by: 1. Average number of connections between all labels 2. Average distances of connections between all labels 5 connections to blue 2 connections to gray 1 connection to green and purple

Slide 83

Slide 83 text

Methodology Neighborhood connectivity-based adjustment Scale neighborhood strength of each neighboring label by: 1. Average number of connections between all labels 2. Average distances of connections between all labels Neighborhood Likelihoods for Yellow 1.0 0.3 0.6 0.8

Slide 84

Slide 84 text

pip install cev pip install cev

Slide 85

Slide 85 text

pip install cev pip install cev

Slide 86

Slide 86 text

VS

Slide 87

Slide 87 text

Colored by label Colored by metric

Slide 88

Slide 88 text

No content

Slide 89

Slide 89 text

No content

Slide 90

Slide 90 text

No content

Slide 91

Slide 91 text

No content

Slide 92

Slide 92 text

No content

Slide 93

Slide 93 text

No content

Slide 94

Slide 94 text

No content

Slide 95

Slide 95 text

No content

Slide 96

Slide 96 text

No content

Slide 97

Slide 97 text

In summary • We compare embedding visualizations based on class labels • Can be defined dynamically and different levels of abstraction • Addresses limitations of traditional point-based methods • Evaluation study: Guided comparisons increase confidence in findings • Note: Intended to complement embedding quality assessment methods \

Slide 98

Slide 98 text

Thanks! What’s Next? How to compare more than two embedding plots? How to conditionally and dynamically balance feature importance? How to dynamically and continuously adjust local vs global patterns? November 13, 2024 Visual Analytics Lab at Tufts University