Interactive Visual Pattern Search in Epigenomic Data with Peax: NIH ENCODE presentation

Peax Fritz Lekschas, Ph.D. candidate Harvard University Interactive Visual Pattern
Search in Epigenomic Data Using Unsupervised Deep Representation Learning July 27, 2020 Brand Peterson, Daniel Haehn, Eric Ma,  Nils Gehlenborg, and Hanspeter Pﬁster

Search

Search ? ? ? ? ? ?

Davis et al. (2018) The Encyclopedia of DNA elements (ENCODE):
data portal update.

Davis et al. (2018) The Encyclopedia of DNA elements (ENCODE):
data portal update. Maurano et al. (2012). >90% of disease- associated variants found in GWAS are located in non- coding regions

Why not just search computationally?

• Little to no ground truth • Peak calling is
not solved • Feature calling is very hard • Formally deﬁning patterns is hard

not solved • Feature calling is very hard • Formally deﬁning patterns is hard Visual quality control

not solved • Feature calling is very hard • Formally deﬁning patterns is hard Visual quality control Interactive visual query

Search Query

Search Query Features Number of peaks Height of peaks Shape
of peaks Position of peaks Average signal ... 3 37 0.9 14 5 2 51 1.3 12 7 4 29 9.1 14 11 2 41 1.0 14 8 Example

Search Query Features Number of peaks Height of peaks Shape
of peaks Position of peaks Average signal ... 3 37 0.9 14 5 2 51 1.3 12 7 4 29 9.1 14 11 2 41 1.0 14 8 Example Result

1. Encoding 2. Active Learning & User Interface PEAX

1. Encoding 2. Active Learning & User

2. Active Learning & User Interface

1. Data Processing 2. Convolutional Autoencoder DATA ENCODING

1. Data Processing 2. Convolutional Autoencoder 1 0 DATA ENCODING

1. Data Processing 2. Convolutional Autoencoder DATA ENCODING

Trained 6 autoencoders: 3 window sizes × 2 data types
Data types: DNase, histone mark ChIP Window and bin sizes: 3 kb (25 bp), 12 kb (100 bp), 120 kb (1000 bp) 120 DNase-seq datasets from ENCODE 49 histone mark ChIP-seq experiments from Roadmap Epigenomics:  H3K4me1/me3, H3K27ac/me3, H3K9ac/me3, H3K36me3

3 kb 12 kb 120 kb DNase-seq ChIP-seq

3 kb 12 kb 120 kb DNase-seq R2 .98 .90
.78 R2 .84 .69 .73 ChIP-seq

Is the learned encoding useful for similarity search? Yes!

Is the learned encoding useful for similarity search? Yes! Yes!

Ours: CAE ED SAX DTW XCORR UMAP TSFRESH

Ours: CAE ED UMAP TSFRESH XCORR DTW SAX

Query view Reconstructions

List view Labeling Tabs

Embedding

Progress

Find Asymmetrical Peaks Video screencast available at  youtu.be/FlzTdFUVE-M?t=220

Select Pattern for Querying Initial Sampling Binary Labeling Train First
Classifier Active Learning Sampling Training Progress Embedding View Resolve Conflicts Explore Final Results Spatially Freely browse and select a region for quering Size of the selected region is fixed and based on the autoencoder We use HiGlass (Kerpedjiev et al. 2018) as the genome browser

Initial Sampling Binary Labeling Train First Classiﬁer Active Learning Sampling
Training Progress Embedding View Resolve Conﬂicts Explore Final Results Spatially al. 2018) as the genome browser Increase distance of samples to the query Sample regions in dense areas Maximize pairwise distance between samples All in the latent space

Binary Labeling Train First Classiﬁer Active Learning Sampling Training Progress
Embedding View Resolve Conﬂicts Explore Final Results Spatially al. 2018) as the genome browser Select regions that match and do not match the query Inconclusive regions can simply be skipped

Train First Classifier Active Learning Sampling Training Progress Embedding View
Resolve Conflicts Explore Final Results Spatially al. 2018) as the genome browser A random forrest classifier is trained online with the labels Each time a new set of samples is requested a new classfier is trained A new classifier can also be trained in between after labels have changed

Active Learning Sampling Training Progress Embedding View Resolve Conﬂicts Explore
Final Results Spatially al. 2018) as the genome browser Regions are sampled by their: - prediction uncertain  - proximity to the target  - in dense neighborhoods  - with high pairwise distance

Training Progress Embedding View Resolve Conﬂicts Explore Final Results Spatially
al. 2018) as the genome browser Progress is tracked for every trained classiﬁer Uncertainty is the overall prediction probability Change of the prediction probaility Convergence and divergence

Embedding View Resolve Conﬂicts Explore Final Results Spatially al. 2018)
as the genome browser Convergence and divergence 2D UMAP embedding of all encoded regions Probability color encoder:  ⬤ means matching  ⬤ means non-matching  ⬤ means unpredictable View is interactive and dots are selectable

Resolve Conﬂicts Explore Final Results Spatially al. 2018) as the
genome browser Convergence and divergence View is interactive and dots are selectable Peax warns about false positives and negatives when the labels and the classiﬁer's predictions disagree

Explore Final Results Spatially al. 2018) as the genome browser
Convergence and divergence View is interactive and dots are selectable The query view is interactive A bed-like track shows the prediction probabilities:  ⬤ means matching  ⬤ means non-matching  ⬤ means unpredictable

Find Differentially Accessible Peaks Using Existing Peak Calls

ENCODE e11.5 DNase-seq from face and hindbrain Differential, central, strong
peak calls Balance positives and negatives Initial Classiﬁer

Resolve Conﬂicts Reﬁne Labels & Assess Recall

Explore Local Neighborhoods

Final Results Excluding Labels

CONCLUSION Leverage deep learning to augment human intelligence  for visual
pattern exploration Complementary to specialized feature detectors FUTURE WORK Explore other types of encoders Evaluate different active learning strategies

PAPER, SOURCE CODE, MODELS, ETC. vcg.seas.harvard.edu/pubs/peax lekschas@seas.harvard.edu @ﬂekschas lekschas.de CONTACT
Thank You!

H3K4me1 H3K4me3 H3K27ac Individual histone marks H3K27me3 H3K9ac H3K9me3 H3K36me3
3 kb 12 kb 120 kb

Backend: Python (Flask) Frontend: JavaScript (React) Autoencoders: Keras (Tensorﬂow) Genome
Browser: HiGlass Search Setup: JSON ﬁle ! { "encoders": [{ "content_type": "dnase-seq-3kb", "from_file": "examples/autoencoders.json" }, { "content_type": "histone-mark-chip-seq-3kb", "from_file": "examples/autoencoders.json" }], "datasets": [{ "filepath": "examples/data/ENCFF641OPE.bigWig", "content_type": "dnase-seq-3kb", "id": "encode-e11-5-limb-dnase-rdns", "name": "e11.5 limb DNase rdn signal" }, { "filepath": "examples/data/ENCFF336LAW.bigWig", "content_type": "histone-mark-chip-seq-3kb", "id": "encode-e11-5-limb-chip-h3k27ac-fc", "name": "e11.5 limb H3K27ac fc" }], "coords": "mm10", "chroms": ["chr12"], "step_freq": 2, "db_path": "examples/search-e11-5-limb.db" } TECHNOLOGY EXAMPLE SEARCH SETUP

Interactive Visual Pattern Search in Epigenomic...

Interactive Visual Pattern Search in Epigenomic Data with Peax: NIH ENCODE presentation

More Decks by Fritz Lekschas

Other Decks in Research

Featured

Transcript