Ben Langmead Assistant Professor, JHU Computer Science [email protected], langmead-lab.org, @BenLangmead Workshop on the Future of Algorithms in Biology, September 29, 2018 Scoring functions, heuristics, and prior data in read alignment
Read alignment "Sequence mapping is the cornerstone of modern genomics" Novak AM, Rosen Y, Haussler D, Paten B. Canonical, stable, general mapping using context schemes. Bioinformatics. 2015 Nov 15;31(22):3569-76. But how do you... Predict mapping quality? Characterize what the heuristics are missing? Select an appropriate scoring scheme? See John Kececioglu's talk!
A C G T op ex A 2 -6 -6 -6 -5 -3 C -6 2 -6 -6 -5 -3 G -6 -6 2 -6 -5 -3 T -6 -6 -6 2 -5 -3 op -5 -5 -5 -5 ex -3 -3 -3 -3 A C G T op ex A 2 -4 -4 -4 -4 -2 C -4 2 -4 -4 -4 -2 G -4 -4 2 -4 -4 -2 T -4 -4 -4 2 -4 -2 op -4 -4 -4 -4 ex -2 -2 -2 -2 Bowtie 2 minimap 2 Affine-gap scoring A C G T op ex A 9 Qu Qu Qu -40 -6 C Qu 9 Qu Qu -40 -6 G Qu Qu 9 Qu -40 -6 T Qu Qu Qu 9 -40 -6 op -40 -40 -40 -40 ex -6 -6 -6 -6 Novoalign Classic: PAM250
Fair BJ, Pleiss JA. The power of fission: yeast as a tool for understanding complex splicing. Curr Genet. 2017 Jun; 63(3):375-380. Aspects of splicing not captured by an affine-gap scoring function. (a) nearby motifs, notably donors & acceptors. (b) intron length distributions
Gotoh O. Modeling one thousand intron length distributions with fitild. Bioinformatics. 2018 Oct 1;34(19):3258-3264. Chicken Nematode Green alga Fungus Protist Plant Tapeworms
Graph alignment Jouni S, Välimäki N, and Mäkinen V. "Indexing graphs for path queries with applications in genome research." IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB) 11.2 (2014): 375-388. Works around the fact that typical scoring scheme ignores known ALTs REF ALT ALT ALT ALT
Graph alignment ALTs to include in graph are selected (carefully) offline. Figure: results from FORGe study. Jacob Pritt, Nae-Chyun Chen, Ben Langmead doi: https://doi.org/10.1101/311720
Graph alignment Phase 1: Select ALTs to include and form genome representation & index containing them Phase 2: Align with standard heuristics and scoring function FORGe Model & score ALTs Select top X% Build graph index Align to reference with ALTs VCF FASTA FASTQ SAM Inputs: Outputs: ALT subset Index w/ ALTs
Using prior data FORGe Model & score ALTs Select top X% Build graph index Align to reference with ALTs Junction discovery Align readlets Call junctions Build graph index Align to reference with splice junctions
Questions Two variations on read alignment -- spliced and graph alignment -- modify the reference to work around the fact that typical heuristics and scoring functions aren’t appropriate in those settings. Is this a paradigm that we can or should improve on? What can we do differently? Thank you: Jacob Pritt Nae-Chyun Chen NSF: IIS-1349906