Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Peax Interactive Concept Learning for Visual Exploration of Epigenetic Patterns

Peax Interactive Concept Learning for Visual Exploration of Epigenetic Patterns

Epigenomic data expresses a rich body of diverse patterns that help to identify regulatory elements like promoter, enhancers, etc. But finding these patterns reliably genome-wide is challenging. Peax is a tool for interactive visual pattern search and exploration of epigenomic patterns based on unsupervised representation learning with convolutional autoencoders. The visual search is driven by manually labeled genomic regions for actively learning a classifier to reflect your notion of interestingness.

Fritz Lekschas

April 17, 2019
Tweet

More Decks by Fritz Lekschas

Other Decks in Science

Transcript

  1. Peax
    Fritz Lekschas, Ph.D. candidate
    Harvard University
    Interactive Concept Learning for Visual
    Exploration of Epigenetic Patterns
    April 17, 2019
    Brand Peterson, Daniel Haehn, Eric Ma, Nils Gehlenborg, and Hanspeter Pfister
    Bio-IT World

    View full-size slide

  2. Search
    ? ? ? ? ? ?

    View full-size slide

  3. The Epigenome

    View full-size slide

  4. Davis et al. (2018) The Encyclopedia of DNA elements (ENCODE): data portal update.

    View full-size slide

  5. Davis et al. (2018) The Encyclopedia of DNA elements (ENCODE): data portal update.

    View full-size slide

  6. Davis et al. (2018) The Encyclopedia of DNA elements (ENCODE): data portal update. Maurano et al. (2012).
    >90% of disease-
    associated variants
    found in GWAS are
    located in non-
    coding regions

    View full-size slide

  7. Why not just search
    computationally?

    View full-size slide

  8. • Little to no ground truth
    • Peak calling is not solved
    • Feature calling is very hard
    • Formally defining patterns is hard

    View full-size slide

  9. • Little to no ground truth
    • Peak calling is not solved
    • Feature calling is very hard
    • Formally defining patterns is hard
    Visual quality control

    View full-size slide

  10. • Little to no ground truth
    • Peak calling is not solved
    • Feature calling is very hard
    • Formally defining patterns is hard
    Visual quality control

    View full-size slide

  11. • Little to no ground truth
    • Peak calling is not solved
    • Feature calling is very hard
    • Formally defining patterns is hard
    Visual quality control
    Interactive visual query

    View full-size slide

  12. How does Peax
    find similar patterns?

    View full-size slide

  13. 1. Encoding 2. Active Learning & User Interface

    View full-size slide

  14. 1. Data Processing 2. Convolutional Autoencoder

    View full-size slide

  15. Trained 6 autoencoders: 3 window sizes × 2 data types
    Data types: DNase, histone mark ChIP
    Window and bin sizes: 3 kb (25 bp), 12 kb (100 bp), 120 kb (1000 bp)
    120 DNase-seq datasets from ENCODE
    49 histone mark ChIP-seq experiments from Roadmap Epigenomics:

    H3K4me1/me3, H3K27ac/me3, H3K9ac/me3, H3K36me3

    View full-size slide

  16. 3 kb 12 kb 120 kb
    DNase-seq
    R2 .98 .90 .78

    View full-size slide

  17. 3 kb 12 kb 120 kb
    Average histone mark ChIP-seq
    R2 .84 .69 .73

    View full-size slide

  18. Is the learned encoding useful for similarity search?
    Yes!

    View full-size slide

  19. Is the learned encoding useful for similarity search?
    Yes!
    Yes!

    View full-size slide

  20. Ours: CAE
    ED
    SAX
    DTW
    XCORR
    UMAP
    TSFRESH

    View full-size slide

  21. Ours: CAE
    ED
    UMAP TSFRESH XCORR DTW
    SAX

    View full-size slide

  22. Ours: CAE
    ED
    UMAP TSFRESH XCORR DTW
    SAX

    View full-size slide

  23. Query view
    Reconstructions

    View full-size slide

  24. List view
    Labeling
    Tabs

    View full-size slide

  25. Find Asymmetrical Peaks
    Using a Visual Query as the Start

    View full-size slide

  26. Select Pattern for Querying
    Initial Sampling
    Binary Labeling
    Train First Classifier
    Active Learning Sampling
    Training Progress
    Embedding View
    Resolve Conflicts
    Explore Final Results Spatially Freely browse and select a
    region for quering
    Size of the selected region is
    fixed and based on the
    autoencoder
    We use HiGlass (Kerpedjiev et
    al. 2018) as the genome
    browser

    View full-size slide

  27. Select Pattern for Querying
    Initial Sampling
    Binary Labeling
    Train First Classifier
    Active Learning Sampling
    Training Progress
    Embedding View
    Resolve Conflicts
    Explore Final Results Spatially Freely browse and select a
    region for quering
    Size of the selected region is
    fixed and based on the
    autoencoder
    We use HiGlass (Kerpedjiev et
    al. 2018) as the genome
    browser

    View full-size slide

  28. Initial Sampling
    Binary Labeling
    Train First Classifier
    Active Learning Sampling
    Training Progress
    Embedding View
    Resolve Conflicts
    Explore Final Results Spatially
    al. 2018) as the genome
    browser
    Increase distance of samples
    to the query
    Sample regions in dense areas
    Maximize pairwise distance
    between samples
    All in the latent space

    View full-size slide

  29. Binary Labeling
    Train First Classifier
    Active Learning Sampling
    Training Progress
    Embedding View
    Resolve Conflicts
    Explore Final Results Spatially
    al. 2018) as the genome
    browser
    Select regions that match and
    do not match the query
    Inconclusive regions can
    simply be skipped

    View full-size slide

  30. Train First Classifier
    Active Learning Sampling
    Training Progress
    Embedding View
    Resolve Conflicts
    Explore Final Results Spatially
    al. 2018) as the genome
    browser
    A random forrest classifier is
    trained online with the labels
    Each time a new set of
    samples is requested a new
    classfier is trained
    A new classifier can also be
    trained in between after labels
    have changed

    View full-size slide

  31. Active Learning Sampling
    Training Progress
    Embedding View
    Resolve Conflicts
    Explore Final Results Spatially
    al. 2018) as the genome
    browser
    Regions are sampled by their:
    - prediction uncertain

    - proximity to the target

    - in dense neighborhoods

    - with high pairwise distance

    View full-size slide

  32. Training Progress
    Embedding View
    Resolve Conflicts
    Explore Final Results Spatially
    al. 2018) as the genome
    browser
    Progress is tracked for every
    trained classifier
    Uncertainty is the overall
    prediction probability
    Change of the prediction
    probaility
    Convergence and divergence

    View full-size slide

  33. Embedding View
    Resolve Conflicts
    Explore Final Results Spatially
    al. 2018) as the genome
    browser
    Convergence and divergence
    2D UMAP embedding of all
    encoded regions
    Probability color encoder:

    ⬤ means matching

    ⬤ means non-matching

    ⬤ means unpredictable
    View is interactive and dots are
    selectable

    View full-size slide

  34. Embedding View
    Resolve Conflicts
    Explore Final Results Spatially
    al. 2018) as the genome
    browser
    Convergence and divergence
    2D UMAP embedding of all
    encoded regions
    Probability color encoder:

    ⬤ means matching

    ⬤ means non-matching

    ⬤ means unpredictable
    View is interactive and dots are
    selectable

    View full-size slide

  35. Resolve Conflicts
    Explore Final Results Spatially
    al. 2018) as the genome
    browser
    Convergence and divergence
    View is interactive and dots are
    selectable
    Peax warns about false
    positives and negatives when
    the labels and the classifier's
    predictions disagree

    View full-size slide

  36. Explore Final Results Spatially
    al. 2018) as the genome
    browser
    Convergence and divergence
    View is interactive and dots are
    selectable
    The query view is interactive
    A bed-like track shows the
    prediction probabilities:

    ⬤ means matching

    ⬤ means non-matching

    ⬤ means unpredictable

    View full-size slide

  37. Find Differentially Accessible Peaks
    Using Existing Peak Calls

    View full-size slide

  38. ENCODE e11.5 DNase-seq from face and hindbrain
    Differential, central, strong peak calls
    Balance positives and negatives
    Initial Classifier

    View full-size slide

  39. ENCODE e11.5 DNase-seq from face and hindbrain
    Differential, central, strong peak calls
    Balance positives and negatives
    Initial Classifier

    View full-size slide

  40. ENCODE e11.5 DNase-seq from face and hindbrain
    Differential, central, strong peak calls
    Balance positives and negatives
    Initial Classifier

    View full-size slide

  41. Resolve Conflicts
    Refine Labels & Assess Recall

    View full-size slide

  42. Explore Local Neighborhoods

    View full-size slide

  43. Final Results Excluding Labels

    View full-size slide

  44. CONCLUSION
    Leverage machine learning to improve visual exploratory pattern search
    Generative models work well as they show the encoding quality
    FUTURE WORK
    Explore other types of encoders (VAE, GAN, LSTM, etc.)
    Compare and evaluate different active learning strategies

    View full-size slide

  45. PRERINT, SOURCE CODE, MODELS, ETC.
    peax.lekschas.de
    [email protected]
    @flekschas
    lekschas.de
    CONTACT
    Thank You!

    View full-size slide