Slide 1

Slide 1 text

Peax Fritz Lekschas, Ph.D. candidate Harvard University Interactive Visual Pattern Search in Epigenomic Data Using Unsupervised Deep Representation Learning July 27, 2020 Brand Peterson, Daniel Haehn, Eric Ma,
 Nils Gehlenborg, and Hanspeter Pfister

Slide 2

Slide 2 text

No content

Slide 3

Slide 3 text

Search

Slide 4

Slide 4 text

Search ? ? ? ? ? ?

Slide 5

Slide 5 text

Davis et al. (2018) The Encyclopedia of DNA elements (ENCODE): data portal update.

Slide 6

Slide 6 text

Davis et al. (2018) The Encyclopedia of DNA elements (ENCODE): data portal update. Maurano et al. (2012). >90% of disease- associated variants found in GWAS are located in non- coding regions

Slide 7

Slide 7 text

Why not just search computationally?

Slide 8

Slide 8 text

• Little to no ground truth • Peak calling is not solved • Feature calling is very hard • Formally defining patterns is hard

Slide 9

Slide 9 text

• Little to no ground truth • Peak calling is not solved • Feature calling is very hard • Formally defining patterns is hard Visual quality control

Slide 10

Slide 10 text

• Little to no ground truth • Peak calling is not solved • Feature calling is very hard • Formally defining patterns is hard Visual quality control

Slide 11

Slide 11 text

• Little to no ground truth • Peak calling is not solved • Feature calling is very hard • Formally defining patterns is hard Visual quality control Interactive visual query

Slide 12

Slide 12 text

Search Query

Slide 13

Slide 13 text

Search Query Features Number of peaks Height of peaks Shape of peaks Position of peaks Average signal ... 3 37 0.9 14 5 2 51 1.3 12 7 4 29 9.1 14 11 2 41 1.0 14 8 Example

Slide 14

Slide 14 text

Search Query Features Number of peaks Height of peaks Shape of peaks Position of peaks Average signal ... 3 37 0.9 14 5 2 51 1.3 12 7 4 29 9.1 14 11 2 41 1.0 14 8 Example Result

Slide 15

Slide 15 text

1. Encoding 2. Active Learning & User Interface PEAX

Slide 16

Slide 16 text

1. Encoding 2. Active Learning & User

Slide 17

Slide 17 text

2. Active Learning & User Interface

Slide 18

Slide 18 text

1. Data Processing 2. Convolutional Autoencoder DATA ENCODING

Slide 19

Slide 19 text

1. Data Processing 2. Convolutional Autoencoder DATA ENCODING

Slide 20

Slide 20 text

1. Data Processing 2. Convolutional Autoencoder 1 0 DATA ENCODING

Slide 21

Slide 21 text

1. Data Processing 2. Convolutional Autoencoder DATA ENCODING

Slide 22

Slide 22 text

1. Data Processing 2. Convolutional Autoencoder DATA ENCODING

Slide 23

Slide 23 text

Trained 6 autoencoders: 3 window sizes × 2 data types Data types: DNase, histone mark ChIP Window and bin sizes: 3 kb (25 bp), 12 kb (100 bp), 120 kb (1000 bp) 120 DNase-seq datasets from ENCODE 49 histone mark ChIP-seq experiments from Roadmap Epigenomics:
 H3K4me1/me3, H3K27ac/me3, H3K9ac/me3, H3K36me3

Slide 24

Slide 24 text

3 kb 12 kb 120 kb DNase-seq ChIP-seq

Slide 25

Slide 25 text

3 kb 12 kb 120 kb DNase-seq R2 .98 .90 .78 R2 .84 .69 .73 ChIP-seq

Slide 26

Slide 26 text

Is the learned encoding useful for similarity search? Yes!

Slide 27

Slide 27 text

Is the learned encoding useful for similarity search? Yes! Yes!

Slide 28

Slide 28 text

Ours: CAE ED SAX DTW XCORR UMAP TSFRESH

Slide 29

Slide 29 text

Ours: CAE ED UMAP TSFRESH XCORR DTW SAX

Slide 30

Slide 30 text

Ours: CAE ED UMAP TSFRESH XCORR DTW SAX

Slide 31

Slide 31 text

No content

Slide 32

Slide 32 text

Query view Reconstructions

Slide 33

Slide 33 text

List view Labeling Tabs

Slide 34

Slide 34 text

Embedding

Slide 35

Slide 35 text

Progress

Slide 36

Slide 36 text

No content

Slide 37

Slide 37 text

Find Asymmetrical Peaks Video screencast available at
 youtu.be/FlzTdFUVE-M?t=220

Slide 38

Slide 38 text

Select Pattern for Querying Initial Sampling Binary Labeling Train First Classifier Active Learning Sampling Training Progress Embedding View Resolve Conflicts Explore Final Results Spatially Freely browse and select a region for quering Size of the selected region is fixed and based on the autoencoder We use HiGlass (Kerpedjiev et al. 2018) as the genome browser

Slide 39

Slide 39 text

Initial Sampling Binary Labeling Train First Classifier Active Learning Sampling Training Progress Embedding View Resolve Conflicts Explore Final Results Spatially al. 2018) as the genome browser Increase distance of samples to the query Sample regions in dense areas Maximize pairwise distance between samples All in the latent space

Slide 40

Slide 40 text

Initial Sampling Binary Labeling Train First Classifier Active Learning Sampling Training Progress Embedding View Resolve Conflicts Explore Final Results Spatially al. 2018) as the genome browser Increase distance of samples to the query Sample regions in dense areas Maximize pairwise distance between samples All in the latent space

Slide 41

Slide 41 text

Binary Labeling Train First Classifier Active Learning Sampling Training Progress Embedding View Resolve Conflicts Explore Final Results Spatially al. 2018) as the genome browser Select regions that match and do not match the query Inconclusive regions can simply be skipped

Slide 42

Slide 42 text

Train First Classifier Active Learning Sampling Training Progress Embedding View Resolve Conflicts Explore Final Results Spatially al. 2018) as the genome browser A random forrest classifier is trained online with the labels Each time a new set of samples is requested a new classfier is trained A new classifier can also be trained in between after labels have changed

Slide 43

Slide 43 text

Active Learning Sampling Training Progress Embedding View Resolve Conflicts Explore Final Results Spatially al. 2018) as the genome browser Regions are sampled by their: - prediction uncertain
 - proximity to the target
 - in dense neighborhoods
 - with high pairwise distance

Slide 44

Slide 44 text

Training Progress Embedding View Resolve Conflicts Explore Final Results Spatially al. 2018) as the genome browser Progress is tracked for every trained classifier Uncertainty is the overall prediction probability Change of the prediction probaility Convergence and divergence

Slide 45

Slide 45 text

Embedding View Resolve Conflicts Explore Final Results Spatially al. 2018) as the genome browser Convergence and divergence 2D UMAP embedding of all encoded regions Probability color encoder:
 ⬤ means matching
 ⬤ means non-matching
 ⬤ means unpredictable View is interactive and dots are selectable

Slide 46

Slide 46 text

Embedding View Resolve Conflicts Explore Final Results Spatially al. 2018) as the genome browser Convergence and divergence 2D UMAP embedding of all encoded regions Probability color encoder:
 ⬤ means matching
 ⬤ means non-matching
 ⬤ means unpredictable View is interactive and dots are selectable

Slide 47

Slide 47 text

Resolve Conflicts Explore Final Results Spatially al. 2018) as the genome browser Convergence and divergence View is interactive and dots are selectable Peax warns about false positives and negatives when the labels and the classifier's predictions disagree

Slide 48

Slide 48 text

Explore Final Results Spatially al. 2018) as the genome browser Convergence and divergence View is interactive and dots are selectable The query view is interactive A bed-like track shows the prediction probabilities:
 ⬤ means matching
 ⬤ means non-matching
 ⬤ means unpredictable

Slide 49

Slide 49 text

Find Differentially Accessible Peaks Using Existing Peak Calls

Slide 50

Slide 50 text

ENCODE e11.5 DNase-seq from face and hindbrain Differential, central, strong peak calls Balance positives and negatives Initial Classifier

Slide 51

Slide 51 text

ENCODE e11.5 DNase-seq from face and hindbrain Differential, central, strong peak calls Balance positives and negatives Initial Classifier

Slide 52

Slide 52 text

ENCODE e11.5 DNase-seq from face and hindbrain Differential, central, strong peak calls Balance positives and negatives Initial Classifier

Slide 53

Slide 53 text

Resolve Conflicts Refine Labels & Assess Recall

Slide 54

Slide 54 text

Explore Local Neighborhoods

Slide 55

Slide 55 text

Final Results Excluding Labels

Slide 56

Slide 56 text

CONCLUSION Leverage deep learning to augment human intelligence
 for visual pattern exploration Complementary to specialized feature detectors FUTURE WORK Explore other types of encoders Evaluate different active learning strategies

Slide 57

Slide 57 text

PAPER, SOURCE CODE, MODELS, ETC. vcg.seas.harvard.edu/pubs/peax [email protected] @flekschas lekschas.de CONTACT Thank You!

Slide 58

Slide 58 text

H3K4me1 H3K4me3 H3K27ac Individual histone marks H3K27me3 H3K9ac H3K9me3 H3K36me3 3 kb 12 kb 120 kb

Slide 59

Slide 59 text

Backend: Python (Flask) Frontend: JavaScript (React) Autoencoders: Keras (Tensorflow) Genome Browser: HiGlass Search Setup: JSON file ! { "encoders": [{ "content_type": "dnase-seq-3kb", "from_file": "examples/autoencoders.json" }, { "content_type": "histone-mark-chip-seq-3kb", "from_file": "examples/autoencoders.json" }], "datasets": [{ "filepath": "examples/data/ENCFF641OPE.bigWig", "content_type": "dnase-seq-3kb", "id": "encode-e11-5-limb-dnase-rdns", "name": "e11.5 limb DNase rdn signal" }, { "filepath": "examples/data/ENCFF336LAW.bigWig", "content_type": "histone-mark-chip-seq-3kb", "id": "encode-e11-5-limb-chip-h3k27ac-fc", "name": "e11.5 limb H3K27ac fc" }], "coords": "mm10", "chroms": ["chr12"], "step_freq": 2, "db_path": "examples/search-e11-5-limb.db" } TECHNOLOGY EXAMPLE SEARCH SETUP