Scoring functions, heuristics, and prior data in read alignment

2faef7dd62bc570c9fbe5a3620726ff3?s=47 Ben Langmead
September 29, 2018

Scoring functions, heuristics, and prior data in read alignment

2faef7dd62bc570c9fbe5a3620726ff3?s=128

Ben Langmead

September 29, 2018
Tweet

Transcript

  1. Ben Langmead Assistant Professor, JHU Computer Science langmea@cs.jhu.edu, langmead-lab.org, @BenLangmead

    Workshop on the Future of Algorithms in Biology, September 29, 2018 Scoring functions, heuristics, and prior data in read alignment
  2. Read alignment "Sequence mapping is the cornerstone of modern genomics"

    Novak AM, Rosen Y, Haussler D, Paten B. Canonical, stable, general mapping using context schemes. Bioinformatics. 2015 Nov 15;31(22):3569-76. But how do you... Predict mapping quality? Characterize what the heuristics are missing? Select an appropriate scoring scheme? See John Kececioglu's talk!
  3. A C G T op ex A 2 -6 -6

    -6 -5 -3 C -6 2 -6 -6 -5 -3 G -6 -6 2 -6 -5 -3 T -6 -6 -6 2 -5 -3 op -5 -5 -5 -5 ex -3 -3 -3 -3 A C G T op ex A 2 -4 -4 -4 -4 -2 C -4 2 -4 -4 -4 -2 G -4 -4 2 -4 -4 -2 T -4 -4 -4 2 -4 -2 op -4 -4 -4 -4 ex -2 -2 -2 -2 Bowtie 2 minimap 2 Affine-gap scoring A C G T op ex A 9 Qu Qu Qu -40 -6 C Qu 9 Qu Qu -40 -6 G Qu Qu 9 Qu -40 -6 T Qu Qu Qu 9 -40 -6 op -40 -40 -40 -40 ex -6 -6 -6 -6 Novoalign Classic: PAM250
  4. Heuristics 0 0 35 30 35 30 0 0 35

    30 35 30 Ref string 1 Ref string 3 Ref substring Read Read substring Ref string 1 Ref string 3 Ref substring ∅ Read substring 0 0 35 30 35 30 0 0 35 30 35 30 0 0 35 30 35 30 Read substring x 0 0 35 30 35 30 Basically a good match for affine-gap scoring Seed and extend
  5. Spliced alignment Image by Rgocs

  6. Fair BJ, Pleiss JA. The power of fission: yeast as

    a tool for understanding complex splicing. Curr Genet. 2017 Jun; 63(3):375-380. Aspects of splicing not captured by an affine-gap scoring function. (a) nearby motifs, notably donors & acceptors. (b) intron length distributions
  7. Gotoh O. Modeling one thousand intron length distributions with fitild.

    Bioinformatics. 2018 Oct 1;34(19):3258-3264. Chicken Nematode Green alga Fungus Protist Plant Tapeworms
  8. Pass 1: align to genome, make junction calls Pass 2:

    re-align to genome with putative junctions Reads: Ref: Readlets:
  9. Graph alignment Jouni S, Välimäki N, and Mäkinen V. "Indexing

    graphs for path queries with applications in genome research." IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB) 11.2 (2014): 375-388. Works around the fact that typical scoring scheme ignores known ALTs REF ALT ALT ALT ALT
  10. Graph alignment ALTs to include in graph are selected (carefully)

    offline. Figure: results from FORGe study. Jacob Pritt, Nae-Chyun Chen, Ben Langmead doi: https://doi.org/10.1101/311720
  11. Graph alignment Phase 1: Select ALTs to include and form

    genome representation & index containing them Phase 2: Align with standard heuristics and scoring function FORGe Model & score ALTs Select top X% Build graph index Align to reference with ALTs VCF FASTA FASTQ SAM Inputs: Outputs: ALT subset Index w/ ALTs
  12. Using prior data FORGe Model & score ALTs Select top

    X% Build graph index Align to reference with ALTs Junction discovery Align readlets Call junctions Build graph index Align to reference with splice junctions
  13. Questions Two variations on read alignment -- spliced and graph

    alignment -- modify the reference to work around the fact that typical heuristics and scoring functions aren’t appropriate in those settings. Is this a paradigm that we can or should improve on? What can we do differently? Thank you: Jacob Pritt Nae-Chyun Chen NSF: IIS-1349906