Interactive Visual Pattern Search in Epigenomic Data with Peax: NIH ENCODE presentation

Interactive Visual Pattern Search in Epigenomic Data with Peax: NIH ENCODE presentation

Slides from my presentation on visual pattern search in epigenomic data with Peax for the NIH ENCODE Analysis Working Group.

Project page: http://peax.lekschas.de
Paper: https://vcg.seas.harvard.edu/pubs/peax
Video introduction: https://youtu.be/FlzTdFUVE-M

090f9e164337989b54dabd7fedfd39b4?s=128

Fritz Lekschas

July 30, 2020
Tweet

Transcript

  1. Peax Fritz Lekschas, Ph.D. candidate Harvard University Interactive Visual Pattern

    Search in Epigenomic Data Using Unsupervised Deep Representation Learning July 27, 2020 Brand Peterson, Daniel Haehn, Eric Ma,
 Nils Gehlenborg, and Hanspeter Pfister
  2. None
  3. Search

  4. Search ? ? ? ? ? ?

  5. Davis et al. (2018) The Encyclopedia of DNA elements (ENCODE):

    data portal update.
  6. Davis et al. (2018) The Encyclopedia of DNA elements (ENCODE):

    data portal update. Maurano et al. (2012). >90% of disease- associated variants found in GWAS are located in non- coding regions
  7. Why not just search computationally?

  8. • Little to no ground truth • Peak calling is

    not solved • Feature calling is very hard • Formally defining patterns is hard
  9. • Little to no ground truth • Peak calling is

    not solved • Feature calling is very hard • Formally defining patterns is hard Visual quality control
  10. • Little to no ground truth • Peak calling is

    not solved • Feature calling is very hard • Formally defining patterns is hard Visual quality control
  11. • Little to no ground truth • Peak calling is

    not solved • Feature calling is very hard • Formally defining patterns is hard Visual quality control Interactive visual query
  12. Search Query

  13. Search Query Features Number of peaks Height of peaks Shape

    of peaks Position of peaks Average signal ... 3 37 0.9 14 5 2 51 1.3 12 7 4 29 9.1 14 11 2 41 1.0 14 8 Example
  14. Search Query Features Number of peaks Height of peaks Shape

    of peaks Position of peaks Average signal ... 3 37 0.9 14 5 2 51 1.3 12 7 4 29 9.1 14 11 2 41 1.0 14 8 Example Result
  15. 1. Encoding 2. Active Learning & User Interface PEAX

  16. 1. Encoding 2. Active Learning & User

  17. 2. Active Learning & User Interface

  18. 1. Data Processing 2. Convolutional Autoencoder DATA ENCODING

  19. 1. Data Processing 2. Convolutional Autoencoder DATA ENCODING

  20. 1. Data Processing 2. Convolutional Autoencoder 1 0 DATA ENCODING

  21. 1. Data Processing 2. Convolutional Autoencoder DATA ENCODING

  22. 1. Data Processing 2. Convolutional Autoencoder DATA ENCODING

  23. Trained 6 autoencoders: 3 window sizes × 2 data types

    Data types: DNase, histone mark ChIP Window and bin sizes: 3 kb (25 bp), 12 kb (100 bp), 120 kb (1000 bp) 120 DNase-seq datasets from ENCODE 49 histone mark ChIP-seq experiments from Roadmap Epigenomics:
 H3K4me1/me3, H3K27ac/me3, H3K9ac/me3, H3K36me3
  24. 3 kb 12 kb 120 kb DNase-seq ChIP-seq

  25. 3 kb 12 kb 120 kb DNase-seq R2 .98 .90

    .78 R2 .84 .69 .73 ChIP-seq
  26. Is the learned encoding useful for similarity search? Yes!

  27. Is the learned encoding useful for similarity search? Yes! Yes!

  28. Ours: CAE ED SAX DTW XCORR UMAP TSFRESH

  29. Ours: CAE ED UMAP TSFRESH XCORR DTW SAX

  30. Ours: CAE ED UMAP TSFRESH XCORR DTW SAX

  31. None
  32. Query view Reconstructions

  33. List view Labeling Tabs

  34. Embedding

  35. Progress

  36. None
  37. Find Asymmetrical Peaks Video screencast available at
 youtu.be/FlzTdFUVE-M?t=220

  38. Select Pattern for Querying Initial Sampling Binary Labeling Train First

    Classifier Active Learning Sampling Training Progress Embedding View Resolve Conflicts Explore Final Results Spatially Freely browse and select a region for quering Size of the selected region is fixed and based on the autoencoder We use HiGlass (Kerpedjiev et al. 2018) as the genome browser
  39. Initial Sampling Binary Labeling Train First Classifier Active Learning Sampling

    Training Progress Embedding View Resolve Conflicts Explore Final Results Spatially al. 2018) as the genome browser Increase distance of samples to the query Sample regions in dense areas Maximize pairwise distance between samples All in the latent space
  40. Initial Sampling Binary Labeling Train First Classifier Active Learning Sampling

    Training Progress Embedding View Resolve Conflicts Explore Final Results Spatially al. 2018) as the genome browser Increase distance of samples to the query Sample regions in dense areas Maximize pairwise distance between samples All in the latent space
  41. Binary Labeling Train First Classifier Active Learning Sampling Training Progress

    Embedding View Resolve Conflicts Explore Final Results Spatially al. 2018) as the genome browser Select regions that match and do not match the query Inconclusive regions can simply be skipped
  42. Train First Classifier Active Learning Sampling Training Progress Embedding View

    Resolve Conflicts Explore Final Results Spatially al. 2018) as the genome browser A random forrest classifier is trained online with the labels Each time a new set of samples is requested a new classfier is trained A new classifier can also be trained in between after labels have changed
  43. Active Learning Sampling Training Progress Embedding View Resolve Conflicts Explore

    Final Results Spatially al. 2018) as the genome browser Regions are sampled by their: - prediction uncertain
 - proximity to the target
 - in dense neighborhoods
 - with high pairwise distance
  44. Training Progress Embedding View Resolve Conflicts Explore Final Results Spatially

    al. 2018) as the genome browser Progress is tracked for every trained classifier Uncertainty is the overall prediction probability Change of the prediction probaility Convergence and divergence
  45. Embedding View Resolve Conflicts Explore Final Results Spatially al. 2018)

    as the genome browser Convergence and divergence 2D UMAP embedding of all encoded regions Probability color encoder:
 ⬤ means matching
 ⬤ means non-matching
 ⬤ means unpredictable View is interactive and dots are selectable
  46. Embedding View Resolve Conflicts Explore Final Results Spatially al. 2018)

    as the genome browser Convergence and divergence 2D UMAP embedding of all encoded regions Probability color encoder:
 ⬤ means matching
 ⬤ means non-matching
 ⬤ means unpredictable View is interactive and dots are selectable
  47. Resolve Conflicts Explore Final Results Spatially al. 2018) as the

    genome browser Convergence and divergence View is interactive and dots are selectable Peax warns about false positives and negatives when the labels and the classifier's predictions disagree
  48. Explore Final Results Spatially al. 2018) as the genome browser

    Convergence and divergence View is interactive and dots are selectable The query view is interactive A bed-like track shows the prediction probabilities:
 ⬤ means matching
 ⬤ means non-matching
 ⬤ means unpredictable
  49. Find Differentially Accessible Peaks Using Existing Peak Calls

  50. ENCODE e11.5 DNase-seq from face and hindbrain Differential, central, strong

    peak calls Balance positives and negatives Initial Classifier
  51. ENCODE e11.5 DNase-seq from face and hindbrain Differential, central, strong

    peak calls Balance positives and negatives Initial Classifier
  52. ENCODE e11.5 DNase-seq from face and hindbrain Differential, central, strong

    peak calls Balance positives and negatives Initial Classifier
  53. Resolve Conflicts Refine Labels & Assess Recall

  54. Explore Local Neighborhoods

  55. Final Results Excluding Labels

  56. CONCLUSION Leverage deep learning to augment human intelligence
 for visual

    pattern exploration Complementary to specialized feature detectors FUTURE WORK Explore other types of encoders Evaluate different active learning strategies
  57. PAPER, SOURCE CODE, MODELS, ETC. vcg.seas.harvard.edu/pubs/peax lekschas@seas.harvard.edu @flekschas lekschas.de CONTACT

    Thank You!
  58. H3K4me1 H3K4me3 H3K27ac Individual histone marks H3K27me3 H3K9ac H3K9me3 H3K36me3

    3 kb 12 kb 120 kb
  59. Backend: Python (Flask) Frontend: JavaScript (React) Autoencoders: Keras (Tensorflow) Genome

    Browser: HiGlass Search Setup: JSON file ! { "encoders": [{ "content_type": "dnase-seq-3kb", "from_file": "examples/autoencoders.json" }, { "content_type": "histone-mark-chip-seq-3kb", "from_file": "examples/autoencoders.json" }], "datasets": [{ "filepath": "examples/data/ENCFF641OPE.bigWig", "content_type": "dnase-seq-3kb", "id": "encode-e11-5-limb-dnase-rdns", "name": "e11.5 limb DNase rdn signal" }, { "filepath": "examples/data/ENCFF336LAW.bigWig", "content_type": "histone-mark-chip-seq-3kb", "id": "encode-e11-5-limb-chip-h3k27ac-fc", "name": "e11.5 limb H3K27ac fc" }], "coords": "mm10", "chroms": ["chr12"], "step_freq": 2, "db_path": "examples/search-e11-5-limb.db" } TECHNOLOGY EXAMPLE SEARCH SETUP