Scoring functions, heuristics, and prior data in read alignment

Ben Langmead Assistant Professor, JHU Computer Science [email protected], langmead-lab.org, @BenLangmead
Workshop on the Future of Algorithms in Biology, September 29, 2018 Scoring functions, heuristics, and prior data in read alignment

Read alignment "Sequence mapping is the cornerstone of modern genomics"
Novak AM, Rosen Y, Haussler D, Paten B. Canonical, stable, general mapping using context schemes. Bioinformatics. 2015 Nov 15;31(22):3569-76. But how do you... Predict mapping quality? Characterize what the heuristics are missing? Select an appropriate scoring scheme? See John Kececioglu's talk!

A C G T op ex A 2 -6 -6
-6 -5 -3 C -6 2 -6 -6 -5 -3 G -6 -6 2 -6 -5 -3 T -6 -6 -6 2 -5 -3 op -5 -5 -5 -5 ex -3 -3 -3 -3 A C G T op ex A 2 -4 -4 -4 -4 -2 C -4 2 -4 -4 -4 -2 G -4 -4 2 -4 -4 -2 T -4 -4 -4 2 -4 -2 op -4 -4 -4 -4 ex -2 -2 -2 -2 Bowtie 2 minimap 2 Aﬃne-gap scoring A C G T op ex A 9 Qu Qu Qu -40 -6 C Qu 9 Qu Qu -40 -6 G Qu Qu 9 Qu -40 -6 T Qu Qu Qu 9 -40 -6 op -40 -40 -40 -40 ex -6 -6 -6 -6 Novoalign Classic: PAM250

Heuristics 0 0 35 30 35 30 0 0 35
30 35 30 Ref string 1 Ref string 3 Ref substring Read Read substring Ref string 1 Ref string 3 Ref substring ∅ Read substring 0 0 35 30 35 30 0 0 35 30 35 30 0 0 35 30 35 30 Read substring x 0 0 35 30 35 30 Basically a good match for aﬃne-gap scoring Seed and extend

Spliced alignment Image by Rgocs

Fair BJ, Pleiss JA. The power of ﬁssion: yeast as
a tool for understanding complex splicing. Curr Genet. 2017 Jun; 63(3):375-380. Aspects of splicing not captured by an aﬃne-gap scoring function. (a) nearby motifs, notably donors & acceptors. (b) intron length distributions

Gotoh O. Modeling one thousand intron length distributions with ﬁtild.
Bioinformatics. 2018 Oct 1;34(19):3258-3264. Chicken Nematode Green alga Fungus Protist Plant Tapeworms

Pass 1: align to genome, make junction calls Pass 2:
re-align to genome with putative junctions Reads: Ref: Readlets:

Graph alignment Jouni S, Välimäki N, and Mäkinen V. "Indexing
graphs for path queries with applications in genome research." IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB) 11.2 (2014): 375-388. Works around the fact that typical scoring scheme ignores known ALTs REF ALT ALT ALT ALT

Graph alignment ALTs to include in graph are selected (carefully)
oﬄine. Figure: results from FORGe study. Jacob Pritt, Nae-Chyun Chen, Ben Langmead doi: https://doi.org/10.1101/311720

Graph alignment Phase 1: Select ALTs to include and form
genome representation & index containing them Phase 2: Align with standard heuristics and scoring function FORGe Model & score ALTs Select top X% Build graph index Align to reference with ALTs VCF FASTA FASTQ SAM Inputs: Outputs: ALT subset Index w/ ALTs

Using prior data FORGe Model & score ALTs Select top
X% Build graph index Align to reference with ALTs Junction discovery Align readlets Call junctions Build graph index Align to reference with splice junctions

Questions Two variations on read alignment -- spliced and graph
alignment -- modify the reference to work around the fact that typical heuristics and scoring functions aren’t appropriate in those settings. Is this a paradigm that we can or should improve on? What can we do diﬀerently? Thank you: Jacob Pritt Nae-Chyun Chen NSF: IIS-1349906

Scoring functions, heuristics, and prior data i...

Scoring functions, heuristics, and prior data in read alignment

Ben Langmead

More Decks by Ben Langmead

Other Decks in Research

Featured

Transcript

Ben Langmead Assistant Professor, JHU Computer Science [email protected], langmead-lab.org, @BenLangmead

Read alignment "Sequence mapping is the cornerstone of modern genomics"

A C G T op ex A 2 -6 -6

Heuristics 0 0 35 30 35 30 0 0 35

Spliced alignment Image by Rgocs

Fair BJ, Pleiss JA. The power of ﬁssion: yeast as

Gotoh O. Modeling one thousand intron length distributions with ﬁtild.

Pass 1: align to genome, make junction calls Pass 2:

Graph alignment Jouni S, Välimäki N, and Mäkinen V. "Indexing

Graph alignment ALTs to include in graph are selected (carefully)

Graph alignment Phase 1: Select ALTs to include and form

Using prior data FORGe Model & score ALTs Select top

Questions Two variations on read alignment -- spliced and graph