Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Interactive Visual Pattern Search in Epigenomic Data with Peax: NIH ENCODE presentation

Interactive Visual Pattern Search in Epigenomic Data with Peax: NIH ENCODE presentation

Slides from my presentation on visual pattern search in epigenomic data with Peax for the NIH ENCODE Analysis Working Group.

Project page: http://peax.lekschas.de
Paper: https://vcg.seas.harvard.edu/pubs/peax
Video introduction: https://youtu.be/FlzTdFUVE-M

Fritz Lekschas

July 30, 2020
Tweet

More Decks by Fritz Lekschas

Other Decks in Research

Transcript

  1. Peax
    Fritz Lekschas, Ph.D. candidate
    Harvard University
    Interactive Visual Pattern Search in Epigenomic Data
    Using Unsupervised Deep Representation Learning
    July 27, 2020
    Brand Peterson, Daniel Haehn, Eric Ma,

    Nils Gehlenborg, and Hanspeter Pfister

    View Slide

  2. View Slide

  3. Search

    View Slide

  4. Search
    ? ? ? ? ? ?

    View Slide

  5. Davis et al. (2018) The Encyclopedia of DNA elements (ENCODE): data portal update.

    View Slide

  6. Davis et al. (2018) The Encyclopedia of DNA elements (ENCODE): data portal update. Maurano et al. (2012).
    >90% of disease-
    associated variants
    found in GWAS are
    located in non-
    coding regions

    View Slide

  7. Why not just search
    computationally?

    View Slide

  8. • Little to no ground truth
    • Peak calling is not solved
    • Feature calling is very hard
    • Formally defining patterns is hard

    View Slide

  9. • Little to no ground truth
    • Peak calling is not solved
    • Feature calling is very hard
    • Formally defining patterns is hard
    Visual quality control

    View Slide

  10. • Little to no ground truth
    • Peak calling is not solved
    • Feature calling is very hard
    • Formally defining patterns is hard
    Visual quality control

    View Slide

  11. • Little to no ground truth
    • Peak calling is not solved
    • Feature calling is very hard
    • Formally defining patterns is hard
    Visual quality control
    Interactive visual query

    View Slide

  12. Search Query

    View Slide

  13. Search Query Features
    Number of peaks
    Height of peaks
    Shape of peaks
    Position of peaks
    Average signal
    ...
    3
    37
    0.9
    14
    5
    2
    51
    1.3
    12
    7
    4
    29
    9.1
    14
    11
    2
    41
    1.0
    14
    8
    Example

    View Slide

  14. Search Query Features
    Number of peaks
    Height of peaks
    Shape of peaks
    Position of peaks
    Average signal
    ...
    3
    37
    0.9
    14
    5
    2
    51
    1.3
    12
    7
    4
    29
    9.1
    14
    11
    2
    41
    1.0
    14
    8
    Example
    Result

    View Slide

  15. 1. Encoding 2. Active Learning & User Interface
    PEAX

    View Slide

  16. 1. Encoding 2. Active Learning & User

    View Slide

  17. 2. Active Learning & User Interface

    View Slide

  18. 1. Data Processing 2. Convolutional Autoencoder
    DATA ENCODING

    View Slide

  19. 1. Data Processing 2. Convolutional Autoencoder
    DATA ENCODING

    View Slide

  20. 1. Data Processing 2. Convolutional Autoencoder
    1
    0
    DATA ENCODING

    View Slide

  21. 1. Data Processing 2. Convolutional Autoencoder
    DATA ENCODING

    View Slide

  22. 1. Data Processing 2. Convolutional Autoencoder
    DATA ENCODING

    View Slide

  23. Trained 6 autoencoders: 3 window sizes × 2 data types
    Data types: DNase, histone mark ChIP
    Window and bin sizes: 3 kb (25 bp), 12 kb (100 bp), 120 kb (1000 bp)
    120 DNase-seq datasets from ENCODE
    49 histone mark ChIP-seq experiments from Roadmap Epigenomics:

    H3K4me1/me3, H3K27ac/me3, H3K9ac/me3, H3K36me3

    View Slide

  24. 3 kb 12 kb 120 kb
    DNase-seq
    ChIP-seq

    View Slide

  25. 3 kb 12 kb 120 kb
    DNase-seq
    R2 .98 .90 .78
    R2 .84 .69 .73
    ChIP-seq

    View Slide

  26. Is the learned encoding useful for similarity search?
    Yes!

    View Slide

  27. Is the learned encoding useful for similarity search?
    Yes!
    Yes!

    View Slide

  28. Ours: CAE
    ED
    SAX
    DTW
    XCORR
    UMAP
    TSFRESH

    View Slide

  29. Ours: CAE
    ED
    UMAP TSFRESH XCORR DTW
    SAX

    View Slide

  30. Ours: CAE
    ED
    UMAP TSFRESH XCORR DTW
    SAX

    View Slide

  31. View Slide

  32. Query view
    Reconstructions

    View Slide

  33. List view
    Labeling
    Tabs

    View Slide

  34. Embedding

    View Slide

  35. Progress

    View Slide

  36. View Slide

  37. Find Asymmetrical Peaks
    Video screencast available at

    youtu.be/FlzTdFUVE-M?t=220

    View Slide

  38. Select Pattern for Querying
    Initial Sampling
    Binary Labeling
    Train First Classifier
    Active Learning Sampling
    Training Progress
    Embedding View
    Resolve Conflicts
    Explore Final Results Spatially Freely browse and select a
    region for quering
    Size of the selected region is
    fixed and based on the
    autoencoder
    We use HiGlass (Kerpedjiev et
    al. 2018) as the genome
    browser

    View Slide

  39. Initial Sampling
    Binary Labeling
    Train First Classifier
    Active Learning Sampling
    Training Progress
    Embedding View
    Resolve Conflicts
    Explore Final Results Spatially
    al. 2018) as the genome
    browser
    Increase distance of samples
    to the query
    Sample regions in dense areas
    Maximize pairwise distance
    between samples
    All in the latent space

    View Slide

  40. Initial Sampling
    Binary Labeling
    Train First Classifier
    Active Learning Sampling
    Training Progress
    Embedding View
    Resolve Conflicts
    Explore Final Results Spatially
    al. 2018) as the genome
    browser
    Increase distance of samples
    to the query
    Sample regions in dense areas
    Maximize pairwise distance
    between samples
    All in the latent space

    View Slide

  41. Binary Labeling
    Train First Classifier
    Active Learning Sampling
    Training Progress
    Embedding View
    Resolve Conflicts
    Explore Final Results Spatially
    al. 2018) as the genome
    browser
    Select regions that match and
    do not match the query
    Inconclusive regions can
    simply be skipped

    View Slide

  42. Train First Classifier
    Active Learning Sampling
    Training Progress
    Embedding View
    Resolve Conflicts
    Explore Final Results Spatially
    al. 2018) as the genome
    browser
    A random forrest classifier is
    trained online with the labels
    Each time a new set of
    samples is requested a new
    classfier is trained
    A new classifier can also be
    trained in between after labels
    have changed

    View Slide

  43. Active Learning Sampling
    Training Progress
    Embedding View
    Resolve Conflicts
    Explore Final Results Spatially
    al. 2018) as the genome
    browser
    Regions are sampled by their:
    - prediction uncertain

    - proximity to the target

    - in dense neighborhoods

    - with high pairwise distance

    View Slide

  44. Training Progress
    Embedding View
    Resolve Conflicts
    Explore Final Results Spatially
    al. 2018) as the genome
    browser
    Progress is tracked for every
    trained classifier
    Uncertainty is the overall
    prediction probability
    Change of the prediction
    probaility
    Convergence and divergence

    View Slide

  45. Embedding View
    Resolve Conflicts
    Explore Final Results Spatially
    al. 2018) as the genome
    browser
    Convergence and divergence
    2D UMAP embedding of all
    encoded regions
    Probability color encoder:

    ⬤ means matching

    ⬤ means non-matching

    ⬤ means unpredictable
    View is interactive and dots are
    selectable

    View Slide

  46. Embedding View
    Resolve Conflicts
    Explore Final Results Spatially
    al. 2018) as the genome
    browser
    Convergence and divergence
    2D UMAP embedding of all
    encoded regions
    Probability color encoder:

    ⬤ means matching

    ⬤ means non-matching

    ⬤ means unpredictable
    View is interactive and dots are
    selectable

    View Slide

  47. Resolve Conflicts
    Explore Final Results Spatially
    al. 2018) as the genome
    browser
    Convergence and divergence
    View is interactive and dots are
    selectable
    Peax warns about false
    positives and negatives when
    the labels and the classifier's
    predictions disagree

    View Slide

  48. Explore Final Results Spatially
    al. 2018) as the genome
    browser
    Convergence and divergence
    View is interactive and dots are
    selectable
    The query view is interactive
    A bed-like track shows the
    prediction probabilities:

    ⬤ means matching

    ⬤ means non-matching

    ⬤ means unpredictable

    View Slide

  49. Find Differentially Accessible Peaks
    Using Existing Peak Calls

    View Slide

  50. ENCODE e11.5 DNase-seq from face and hindbrain
    Differential, central, strong peak calls
    Balance positives and negatives
    Initial Classifier

    View Slide

  51. ENCODE e11.5 DNase-seq from face and hindbrain
    Differential, central, strong peak calls
    Balance positives and negatives
    Initial Classifier

    View Slide

  52. ENCODE e11.5 DNase-seq from face and hindbrain
    Differential, central, strong peak calls
    Balance positives and negatives
    Initial Classifier

    View Slide

  53. Resolve Conflicts
    Refine Labels & Assess Recall

    View Slide

  54. Explore Local Neighborhoods

    View Slide

  55. Final Results Excluding Labels

    View Slide

  56. CONCLUSION
    Leverage deep learning to augment human intelligence

    for visual pattern exploration
    Complementary to specialized feature detectors
    FUTURE WORK
    Explore other types of encoders
    Evaluate different active learning strategies

    View Slide

  57. PAPER, SOURCE CODE, MODELS, ETC.
    vcg.seas.harvard.edu/pubs/peax
    [email protected]
    @flekschas
    lekschas.de
    CONTACT
    Thank You!

    View Slide

  58. H3K4me1 H3K4me3 H3K27ac
    Individual histone marks
    H3K27me3 H3K9ac H3K9me3 H3K36me3
    3 kb
    12 kb
    120 kb

    View Slide

  59. Backend: Python (Flask)
    Frontend: JavaScript (React)
    Autoencoders: Keras (Tensorflow)
    Genome Browser: HiGlass
    Search Setup: JSON file !
    {
    "encoders": [{
    "content_type": "dnase-seq-3kb",
    "from_file": "examples/autoencoders.json"
    }, {
    "content_type": "histone-mark-chip-seq-3kb",
    "from_file": "examples/autoencoders.json"
    }],
    "datasets": [{
    "filepath": "examples/data/ENCFF641OPE.bigWig",
    "content_type": "dnase-seq-3kb",
    "id": "encode-e11-5-limb-dnase-rdns",
    "name": "e11.5 limb DNase rdn signal"
    }, {
    "filepath": "examples/data/ENCFF336LAW.bigWig",
    "content_type": "histone-mark-chip-seq-3kb",
    "id": "encode-e11-5-limb-chip-h3k27ac-fc",
    "name": "e11.5 limb H3K27ac fc"
    }],
    "coords": "mm10",
    "chroms": ["chr12"],
    "step_freq": 2,
    "db_path": "examples/search-e11-5-limb.db"
    }
    TECHNOLOGY EXAMPLE SEARCH SETUP

    View Slide