$30 off During Our Annual Pro Sale. View Details »

Scoring functions, heuristics, and prior data in read alignment

Ben Langmead
September 29, 2018

Scoring functions, heuristics, and prior data in read alignment

Ben Langmead

September 29, 2018
Tweet

More Decks by Ben Langmead

Other Decks in Research

Transcript

  1. Ben Langmead
    Assistant Professor, JHU Computer Science
    [email protected], langmead-lab.org, @BenLangmead
    Workshop on the Future of Algorithms in Biology,
    September 29, 2018
    Scoring functions, heuristics, and prior
    data in read alignment

    View Slide

  2. Read alignment
    "Sequence mapping is the cornerstone of modern genomics"
    Novak AM, Rosen Y, Haussler D, Paten B. Canonical, stable, general mapping
    using context schemes. Bioinformatics. 2015 Nov 15;31(22):3569-76.
    But how do you...
    Predict mapping quality?
    Characterize what the heuristics are missing?
    Select an appropriate scoring scheme?
    See John Kececioglu's talk!

    View Slide

  3. A C G T op ex
    A 2 -6 -6 -6 -5 -3
    C -6 2 -6 -6 -5 -3
    G -6 -6 2 -6 -5 -3
    T -6 -6 -6 2 -5 -3
    op -5 -5 -5 -5
    ex -3 -3 -3 -3
    A C G T op ex
    A 2 -4 -4 -4 -4 -2
    C -4 2 -4 -4 -4 -2
    G -4 -4 2 -4 -4 -2
    T -4 -4 -4 2 -4 -2
    op -4 -4 -4 -4
    ex -2 -2 -2 -2
    Bowtie 2 minimap 2
    Affine-gap scoring
    A C G T op ex
    A 9 Qu Qu Qu -40 -6
    C Qu 9 Qu Qu -40 -6
    G Qu Qu 9 Qu -40 -6
    T Qu Qu Qu 9 -40 -6
    op -40 -40 -40 -40
    ex -6 -6 -6 -6
    Novoalign Classic: PAM250

    View Slide

  4. Heuristics
    0
    0 35
    30 35
    30
    0
    0 35
    30 35
    30
    Ref string 1
    Ref string 3
    Ref substring
    Read
    Read substring
    Ref string 1
    Ref string 3
    Ref substring

    Read substring
    0
    0 35
    30 35
    30
    0
    0 35
    30 35
    30
    0
    0 35
    30 35
    30
    Read substring
    x 0
    0 35
    30 35
    30
    Basically a good match for affine-gap scoring
    Seed and extend

    View Slide

  5. Spliced alignment
    Image by Rgocs

    View Slide

  6. Fair BJ, Pleiss JA. The power of fission: yeast as a tool for understanding
    complex splicing. Curr Genet. 2017 Jun; 63(3):375-380.
    Aspects of splicing not captured by an affine-gap scoring function. (a) nearby motifs,
    notably donors & acceptors. (b) intron length distributions

    View Slide

  7. Gotoh O. Modeling one thousand intron length distributions with fitild.
    Bioinformatics. 2018 Oct 1;34(19):3258-3264.
    Chicken
    Nematode
    Green alga
    Fungus
    Protist
    Plant
    Tapeworms

    View Slide

  8. Pass 1: align to genome, make junction calls
    Pass 2: re-align to genome with putative junctions
    Reads:
    Ref:
    Readlets:

    View Slide

  9. Graph alignment
    Jouni S, Välimäki N, and Mäkinen V. "Indexing graphs for path queries with applications in genome
    research." IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB) 11.2 (2014): 375-388.
    Works around the fact that typical scoring
    scheme ignores known ALTs
    REF
    ALT
    ALT
    ALT
    ALT

    View Slide

  10. Graph alignment
    ALTs to include in graph are selected (carefully)
    offline. Figure: results from FORGe study.
    Jacob Pritt, Nae-Chyun Chen, Ben Langmead
    doi: https://doi.org/10.1101/311720

    View Slide

  11. Graph alignment
    Phase 1: Select ALTs to
    include and form genome
    representation & index
    containing them
    Phase 2: Align
    with standard
    heuristics and
    scoring function
    FORGe
    Model & score ALTs
    Select top X%
    Build graph
    index
    Align to
    reference with
    ALTs
    VCF FASTA FASTQ
    SAM
    Inputs:
    Outputs: ALT
    subset
    Index w/
    ALTs

    View Slide

  12. Using prior data
    FORGe
    Model & score ALTs
    Select top X%
    Build graph
    index
    Align to
    reference with
    ALTs
    Junction discovery
    Align readlets
    Call junctions
    Build graph
    index
    Align to
    reference with
    splice junctions

    View Slide

  13. Questions
    Two variations on read alignment -- spliced and graph
    alignment -- modify the reference to work around the fact
    that typical heuristics and scoring functions aren’t
    appropriate in those settings.
    Is this a paradigm that we can or should improve on?
    What can we do differently?
    Thank you: Jacob Pritt
    Nae-Chyun Chen
    NSF: IIS-1349906

    View Slide