Peax Interactive Concept Learning for Visual Exploration of Epigenetic Patterns

Peax Interactive Concept Learning for Visual Exploration of Epigenetic Patterns

Epigenomic data expresses a rich body of diverse patterns that help to identify regulatory elements like promoter, enhancers, etc. But finding these patterns reliably genome-wide is challenging. Peax is a tool for interactive visual pattern search and exploration of epigenomic patterns based on unsupervised representation learning with convolutional autoencoders. The visual search is driven by manually labeled genomic regions for actively learning a classifier to reflect your notion of interestingness.

090f9e164337989b54dabd7fedfd39b4?s=128

Fritz Lekschas

April 17, 2019
Tweet

Transcript

  1. Peax Fritz Lekschas, Ph.D. candidate Harvard University Interactive Concept Learning

    for Visual Exploration of Epigenetic Patterns April 17, 2019 Brand Peterson, Daniel Haehn, Eric Ma, Nils Gehlenborg, and Hanspeter Pfister Bio-IT World
  2. None
  3. Search

  4. Search ? ? ? ? ? ?

  5. None
  6. The Epigenome

  7. Davis et al. (2018) The Encyclopedia of DNA elements (ENCODE):

    data portal update.
  8. Davis et al. (2018) The Encyclopedia of DNA elements (ENCODE):

    data portal update.
  9. Davis et al. (2018) The Encyclopedia of DNA elements (ENCODE):

    data portal update. Maurano et al. (2012). >90% of disease- associated variants found in GWAS are located in non- coding regions
  10. Why not just search computationally?

  11. • Little to no ground truth • Peak calling is

    not solved • Feature calling is very hard • Formally defining patterns is hard
  12. • Little to no ground truth • Peak calling is

    not solved • Feature calling is very hard • Formally defining patterns is hard Visual quality control
  13. • Little to no ground truth • Peak calling is

    not solved • Feature calling is very hard • Formally defining patterns is hard Visual quality control
  14. • Little to no ground truth • Peak calling is

    not solved • Feature calling is very hard • Formally defining patterns is hard Visual quality control Interactive visual query
  15. How does Peax find similar patterns?

  16. 1. Encoding 2. Active Learning & User Interface

  17. 1. Data Processing 2. Convolutional Autoencoder

  18. Trained 6 autoencoders: 3 window sizes × 2 data types

    Data types: DNase, histone mark ChIP Window and bin sizes: 3 kb (25 bp), 12 kb (100 bp), 120 kb (1000 bp) 120 DNase-seq datasets from ENCODE 49 histone mark ChIP-seq experiments from Roadmap Epigenomics:
 H3K4me1/me3, H3K27ac/me3, H3K9ac/me3, H3K36me3
  19. 3 kb 12 kb 120 kb DNase-seq R2 .98 .90

    .78
  20. 3 kb 12 kb 120 kb Average histone mark ChIP-seq

    R2 .84 .69 .73
  21. Is the learned encoding useful for similarity search? Yes!

  22. Is the learned encoding useful for similarity search? Yes! Yes!

  23. Ours: CAE ED SAX DTW XCORR UMAP TSFRESH

  24. Ours: CAE ED UMAP TSFRESH XCORR DTW SAX

  25. Ours: CAE ED UMAP TSFRESH XCORR DTW SAX

  26. None
  27. Query view Reconstructions

  28. List view Labeling Tabs

  29. Embedding

  30. Progress

  31. None
  32. Find Asymmetrical Peaks Using a Visual Query as the Start

  33. Select Pattern for Querying Initial Sampling Binary Labeling Train First

    Classifier Active Learning Sampling Training Progress Embedding View Resolve Conflicts Explore Final Results Spatially Freely browse and select a region for quering Size of the selected region is fixed and based on the autoencoder We use HiGlass (Kerpedjiev et al. 2018) as the genome browser
  34. Select Pattern for Querying Initial Sampling Binary Labeling Train First

    Classifier Active Learning Sampling Training Progress Embedding View Resolve Conflicts Explore Final Results Spatially Freely browse and select a region for quering Size of the selected region is fixed and based on the autoencoder We use HiGlass (Kerpedjiev et al. 2018) as the genome browser
  35. Initial Sampling Binary Labeling Train First Classifier Active Learning Sampling

    Training Progress Embedding View Resolve Conflicts Explore Final Results Spatially al. 2018) as the genome browser Increase distance of samples to the query Sample regions in dense areas Maximize pairwise distance between samples All in the latent space
  36. Binary Labeling Train First Classifier Active Learning Sampling Training Progress

    Embedding View Resolve Conflicts Explore Final Results Spatially al. 2018) as the genome browser Select regions that match and do not match the query Inconclusive regions can simply be skipped
  37. Train First Classifier Active Learning Sampling Training Progress Embedding View

    Resolve Conflicts Explore Final Results Spatially al. 2018) as the genome browser A random forrest classifier is trained online with the labels Each time a new set of samples is requested a new classfier is trained A new classifier can also be trained in between after labels have changed
  38. Active Learning Sampling Training Progress Embedding View Resolve Conflicts Explore

    Final Results Spatially al. 2018) as the genome browser Regions are sampled by their: - prediction uncertain
 - proximity to the target
 - in dense neighborhoods
 - with high pairwise distance
  39. Training Progress Embedding View Resolve Conflicts Explore Final Results Spatially

    al. 2018) as the genome browser Progress is tracked for every trained classifier Uncertainty is the overall prediction probability Change of the prediction probaility Convergence and divergence
  40. Embedding View Resolve Conflicts Explore Final Results Spatially al. 2018)

    as the genome browser Convergence and divergence 2D UMAP embedding of all encoded regions Probability color encoder:
 ⬤ means matching
 ⬤ means non-matching
 ⬤ means unpredictable View is interactive and dots are selectable
  41. Embedding View Resolve Conflicts Explore Final Results Spatially al. 2018)

    as the genome browser Convergence and divergence 2D UMAP embedding of all encoded regions Probability color encoder:
 ⬤ means matching
 ⬤ means non-matching
 ⬤ means unpredictable View is interactive and dots are selectable
  42. Resolve Conflicts Explore Final Results Spatially al. 2018) as the

    genome browser Convergence and divergence View is interactive and dots are selectable Peax warns about false positives and negatives when the labels and the classifier's predictions disagree
  43. Explore Final Results Spatially al. 2018) as the genome browser

    Convergence and divergence View is interactive and dots are selectable The query view is interactive A bed-like track shows the prediction probabilities:
 ⬤ means matching
 ⬤ means non-matching
 ⬤ means unpredictable
  44. Find Differentially Accessible Peaks Using Existing Peak Calls

  45. ENCODE e11.5 DNase-seq from face and hindbrain Differential, central, strong

    peak calls Balance positives and negatives Initial Classifier
  46. ENCODE e11.5 DNase-seq from face and hindbrain Differential, central, strong

    peak calls Balance positives and negatives Initial Classifier
  47. ENCODE e11.5 DNase-seq from face and hindbrain Differential, central, strong

    peak calls Balance positives and negatives Initial Classifier
  48. Resolve Conflicts Refine Labels & Assess Recall

  49. Explore Local Neighborhoods

  50. Final Results Excluding Labels

  51. CONCLUSION Leverage machine learning to improve visual exploratory pattern search

    Generative models work well as they show the encoding quality FUTURE WORK Explore other types of encoders (VAE, GAN, LSTM, etc.) Compare and evaluate different active learning strategies
  52. PRERINT, SOURCE CODE, MODELS, ETC. peax.lekschas.de lekschas@seas.harvard.edu @flekschas lekschas.de CONTACT

    Thank You!