Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Peax Interactive Concept Learning for Visual Exploration of Epigenetic Patterns

Peax Interactive Concept Learning for Visual Exploration of Epigenetic Patterns

Epigenomic data expresses a rich body of diverse patterns that help to identify regulatory elements like promoter, enhancers, etc. But finding these patterns reliably genome-wide is challenging. Peax is a tool for interactive visual pattern search and exploration of epigenomic patterns based on unsupervised representation learning with convolutional autoencoders. The visual search is driven by manually labeled genomic regions for actively learning a classifier to reflect your notion of interestingness.

Fritz Lekschas

April 17, 2019
Tweet

More Decks by Fritz Lekschas

Other Decks in Science

Transcript

  1. Peax
    Fritz Lekschas, Ph.D. candidate
    Harvard University
    Interactive Concept Learning for Visual
    Exploration of Epigenetic Patterns
    April 17, 2019
    Brand Peterson, Daniel Haehn, Eric Ma, Nils Gehlenborg, and Hanspeter Pfister
    Bio-IT World

    View Slide

  2. View Slide

  3. Search

    View Slide

  4. Search
    ? ? ? ? ? ?

    View Slide

  5. View Slide

  6. The Epigenome

    View Slide

  7. Davis et al. (2018) The Encyclopedia of DNA elements (ENCODE): data portal update.

    View Slide

  8. Davis et al. (2018) The Encyclopedia of DNA elements (ENCODE): data portal update.

    View Slide

  9. Davis et al. (2018) The Encyclopedia of DNA elements (ENCODE): data portal update. Maurano et al. (2012).
    >90% of disease-
    associated variants
    found in GWAS are
    located in non-
    coding regions

    View Slide

  10. Why not just search
    computationally?

    View Slide

  11. • Little to no ground truth
    • Peak calling is not solved
    • Feature calling is very hard
    • Formally defining patterns is hard

    View Slide

  12. • Little to no ground truth
    • Peak calling is not solved
    • Feature calling is very hard
    • Formally defining patterns is hard
    Visual quality control

    View Slide

  13. • Little to no ground truth
    • Peak calling is not solved
    • Feature calling is very hard
    • Formally defining patterns is hard
    Visual quality control

    View Slide

  14. • Little to no ground truth
    • Peak calling is not solved
    • Feature calling is very hard
    • Formally defining patterns is hard
    Visual quality control
    Interactive visual query

    View Slide

  15. How does Peax
    find similar patterns?

    View Slide

  16. 1. Encoding 2. Active Learning & User Interface

    View Slide

  17. 1. Data Processing 2. Convolutional Autoencoder

    View Slide

  18. Trained 6 autoencoders: 3 window sizes × 2 data types
    Data types: DNase, histone mark ChIP
    Window and bin sizes: 3 kb (25 bp), 12 kb (100 bp), 120 kb (1000 bp)
    120 DNase-seq datasets from ENCODE
    49 histone mark ChIP-seq experiments from Roadmap Epigenomics:

    H3K4me1/me3, H3K27ac/me3, H3K9ac/me3, H3K36me3

    View Slide

  19. 3 kb 12 kb 120 kb
    DNase-seq
    R2 .98 .90 .78

    View Slide

  20. 3 kb 12 kb 120 kb
    Average histone mark ChIP-seq
    R2 .84 .69 .73

    View Slide

  21. Is the learned encoding useful for similarity search?
    Yes!

    View Slide

  22. Is the learned encoding useful for similarity search?
    Yes!
    Yes!

    View Slide

  23. Ours: CAE
    ED
    SAX
    DTW
    XCORR
    UMAP
    TSFRESH

    View Slide

  24. Ours: CAE
    ED
    UMAP TSFRESH XCORR DTW
    SAX

    View Slide

  25. Ours: CAE
    ED
    UMAP TSFRESH XCORR DTW
    SAX

    View Slide

  26. View Slide

  27. Query view
    Reconstructions

    View Slide

  28. List view
    Labeling
    Tabs

    View Slide

  29. Embedding

    View Slide

  30. Progress

    View Slide

  31. View Slide

  32. Find Asymmetrical Peaks
    Using a Visual Query as the Start

    View Slide

  33. Select Pattern for Querying
    Initial Sampling
    Binary Labeling
    Train First Classifier
    Active Learning Sampling
    Training Progress
    Embedding View
    Resolve Conflicts
    Explore Final Results Spatially Freely browse and select a
    region for quering
    Size of the selected region is
    fixed and based on the
    autoencoder
    We use HiGlass (Kerpedjiev et
    al. 2018) as the genome
    browser

    View Slide

  34. Select Pattern for Querying
    Initial Sampling
    Binary Labeling
    Train First Classifier
    Active Learning Sampling
    Training Progress
    Embedding View
    Resolve Conflicts
    Explore Final Results Spatially Freely browse and select a
    region for quering
    Size of the selected region is
    fixed and based on the
    autoencoder
    We use HiGlass (Kerpedjiev et
    al. 2018) as the genome
    browser

    View Slide

  35. Initial Sampling
    Binary Labeling
    Train First Classifier
    Active Learning Sampling
    Training Progress
    Embedding View
    Resolve Conflicts
    Explore Final Results Spatially
    al. 2018) as the genome
    browser
    Increase distance of samples
    to the query
    Sample regions in dense areas
    Maximize pairwise distance
    between samples
    All in the latent space

    View Slide

  36. Binary Labeling
    Train First Classifier
    Active Learning Sampling
    Training Progress
    Embedding View
    Resolve Conflicts
    Explore Final Results Spatially
    al. 2018) as the genome
    browser
    Select regions that match and
    do not match the query
    Inconclusive regions can
    simply be skipped

    View Slide

  37. Train First Classifier
    Active Learning Sampling
    Training Progress
    Embedding View
    Resolve Conflicts
    Explore Final Results Spatially
    al. 2018) as the genome
    browser
    A random forrest classifier is
    trained online with the labels
    Each time a new set of
    samples is requested a new
    classfier is trained
    A new classifier can also be
    trained in between after labels
    have changed

    View Slide

  38. Active Learning Sampling
    Training Progress
    Embedding View
    Resolve Conflicts
    Explore Final Results Spatially
    al. 2018) as the genome
    browser
    Regions are sampled by their:
    - prediction uncertain

    - proximity to the target

    - in dense neighborhoods

    - with high pairwise distance

    View Slide

  39. Training Progress
    Embedding View
    Resolve Conflicts
    Explore Final Results Spatially
    al. 2018) as the genome
    browser
    Progress is tracked for every
    trained classifier
    Uncertainty is the overall
    prediction probability
    Change of the prediction
    probaility
    Convergence and divergence

    View Slide

  40. Embedding View
    Resolve Conflicts
    Explore Final Results Spatially
    al. 2018) as the genome
    browser
    Convergence and divergence
    2D UMAP embedding of all
    encoded regions
    Probability color encoder:

    ⬤ means matching

    ⬤ means non-matching

    ⬤ means unpredictable
    View is interactive and dots are
    selectable

    View Slide

  41. Embedding View
    Resolve Conflicts
    Explore Final Results Spatially
    al. 2018) as the genome
    browser
    Convergence and divergence
    2D UMAP embedding of all
    encoded regions
    Probability color encoder:

    ⬤ means matching

    ⬤ means non-matching

    ⬤ means unpredictable
    View is interactive and dots are
    selectable

    View Slide

  42. Resolve Conflicts
    Explore Final Results Spatially
    al. 2018) as the genome
    browser
    Convergence and divergence
    View is interactive and dots are
    selectable
    Peax warns about false
    positives and negatives when
    the labels and the classifier's
    predictions disagree

    View Slide

  43. Explore Final Results Spatially
    al. 2018) as the genome
    browser
    Convergence and divergence
    View is interactive and dots are
    selectable
    The query view is interactive
    A bed-like track shows the
    prediction probabilities:

    ⬤ means matching

    ⬤ means non-matching

    ⬤ means unpredictable

    View Slide

  44. Find Differentially Accessible Peaks
    Using Existing Peak Calls

    View Slide

  45. ENCODE e11.5 DNase-seq from face and hindbrain
    Differential, central, strong peak calls
    Balance positives and negatives
    Initial Classifier

    View Slide

  46. ENCODE e11.5 DNase-seq from face and hindbrain
    Differential, central, strong peak calls
    Balance positives and negatives
    Initial Classifier

    View Slide

  47. ENCODE e11.5 DNase-seq from face and hindbrain
    Differential, central, strong peak calls
    Balance positives and negatives
    Initial Classifier

    View Slide

  48. Resolve Conflicts
    Refine Labels & Assess Recall

    View Slide

  49. Explore Local Neighborhoods

    View Slide

  50. Final Results Excluding Labels

    View Slide

  51. CONCLUSION
    Leverage machine learning to improve visual exploratory pattern search
    Generative models work well as they show the encoding quality
    FUTURE WORK
    Explore other types of encoders (VAE, GAN, LSTM, etc.)
    Compare and evaluate different active learning strategies

    View Slide

  52. PRERINT, SOURCE CODE, MODELS, ETC.
    peax.lekschas.de
    [email protected]
    @flekschas
    lekschas.de
    CONTACT
    Thank You!

    View Slide