A Tandem Simulation Approach to Predicting Mapping Quality

Ben Langmead Assistant Professor, JHU Computer Science [email protected], www.langmead-lab.org, @BenLangmead
Biological Data Science, CSHL, October 28, 2016 A Tandem Simulation Approach to Predicting Mapping Quality

• Foes conspire to create alignment error: Alignment errors •
Alignment error: aligner fails to map a read to its true point of origin • Contamination • Genetic variation • Sequencing error • Repetitive genomes

Mapping quality • q known as Mapping Quality, MAPQ p
cor ⌘ Probability correct q = 10 · log10(1 p cor)

MAPQ in SAM ERR050083.28 0 5 10643
1 100M * 0 0 TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCT ERR050083.24 16 1 10023 11 100M * 0 0 CCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTTACC ERR050083.26 0 X 156030515 11 15M1D85M * 0 0 GGGTTAGGGTTAGGGTAGGGTTAGGGTTAGGGTTAGGGT ERR050083.21 0 12 10063 11 100M * 0 0 AACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTA ERR050083.14 16 5 11575 1 21S79M * 0 0 CCCCTCACCCCCACTCCCTCTACCCCAAACCCTAACCCTAACC ERR050083.19 0 15 101981086 11 76M1D24M * 0 0 TAGGGTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGG ERR050083.33 16 3 10591 11 7S93M * 0 0 ACCCCTACCTCCCCCCACCCCTCCCCCTACCCCTAACCCTAAC ERR050083.30 16 1 10108 1 39M1D61M * 0 0 CACCCCTAACCCTATCCCTAACCCTACCCCTAACCCTAA ERR050083.34 16 1 10052 11 98M2S * 0 0 CCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAC ERR050083.32 16 10 10275 11 57M1D5M1D38M * 0 0 CCCTAACCCTAACCCTAACCCTAACCCTACCCCTA ERR050083.39 0 1 10013 11 100M * 0 0 TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCT ERR050083.36 0 5 10216 1 94M1D6M * 0 0 CTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC ERR050083.38 16 5 10308 1 100M * 0 0 AACCCCAACCCTAACCCCAACCCTAACCCTAACCCTAACCCTA ERR050083.35 16 5 11274 1 100M * 0 0 AACCCTACACCTAACCCTAACCCTAACCCTAACCCTAACCCTA ERR050083.29 0 1 248945997 1 60M1D40M * 0 0 AGGGTTAGGGTTAGGGTTAGGGTAGGGTTAGGGTTAGGG ERR050083.43 0 1 10012 11 100M * 0 0 CTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC ERR050083.41 0 X 156030454 1 85M2I13M * 0 0 AGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGG ERR050083.49 0 3 10740 14 73M1D12M1D14M1S * 0 0 AACCCTAACCCTTAACCCTACCCCTAACCCTAACC ERR050083.44 16 17 83247157 11 4S7M1D6M1D6M1D45M1D9M1I22M * 0 0 GTTTAGGGTTAGGGTTAGGGTTAG ERR050083.40 0 1 10014 14 100M * 0 0 AACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTA ERR050083.45 16 1 10185 11 57M1I42M * 0 0 CTAACCCTACCCCTAACCCTAACCCTAACCCTAACCCTA ERR050083.50 16 11 191803 2 43S8M1I47M1S * 0 0 CAACCCCACCCCCACCCCCACCCCCAACCCCAACC ERR050083.42 16 7 10001 11 1S65M3D34M * 0 0 CCTACCCCTACCCCTAACCCTACCCCTAACCCTAACCCT ERR050083.48 16 1 10134 11 44M2I54M * 0 0 ACCCCAACCCTAACCCCTATCCCTACCCCTAACCCTAAC ERR050083.37 16 X 156030581 11 3S12M1D85M * 0 0 GTGGGGTGAGGGTTAGGTTAGGGTTAGGGTTAGGGTGAG ERR050083.52 16 1 10295 11 58M1I40M1S * 0 0 CCCAACCCCAACCCCAACCCCTACCCCAACCCCAACCCC ERR050083.46 16 X 156030594 14 100M * 0 0 GGTTGGGGTTAGGGTTAGGGTGAGGGTTAGGGTTAGGGTTAGG round( q = 10 · log10(1 p cor))

MAPQ GATK McKenna A, Hanna M, Banks E, Sivachenko A,
Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, DePristo MA. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010 Sep;20(9):1297-303.

MAPQ MAQ Li H, Ruan J, Durbin R. Mapping short
DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 2008 Nov;18(11):1851-8.

Tandem simulation framework • Predicts accurate, useful MAPQs • Works
alongside various aligners (w/ tweaks) • Already adapted to Bowtie 2, BWA-MEM, SNAP • Uses supervised learning, but training is automatic & tailored to problem at hand

Tandem simulation framework Aligner (Bowtie 2, BWA MEM, SNAP) Tandem
simulator Input reads Input alignments Tandem reads Tandem alignments Aligner (Bowtie 2, BWA MEM, SNAP) MAPQ Estimator Input alignments with corrected MAPQ Fit model Estimate MAPQs MAPQ model 1 Build model 2 Input model Simulate reads 3 4 Training records Extract records 5 6 7

simulator Input reads Input alignments Tandem reads Tandem alignments Aligner (Bowtie 2, BWA MEM, SNAP) MAPQ Estimator Input alignments with corrected MAPQ Fit model Estimate MAPQs MAPQ model 1 Build model 2 Input model Simulate reads 3 4 Training records Extract records 5 6 7 ?

Templates Template M M M M M D M M
M R M M | | | | | | | | | | M M M M M - M M M G M M 4 % I C D H 9 9 9 8 3 Ref Read Strand: + read40 0 chr5 10643 20 5M1D6M * 0 0 GATTTAGCCAC 4%ICDH99983 MD:Z:5^T3G2 CIGAR MD:Z FLAGS QUALS

Tandem simulator “Apply” template to create new read mimicking the
original C C T C T T A A C T A A Mutate C C T C T - A A C G A A Template 1 CIGAR, MD:Z M M M M M D M M M R M M | | | | | | | | | | M M M M M - M M M G M M 4 % I C D H 9 9 9 8 3 Ref Read Strand: + Add quals C C T C T A A C G A A 4 % I C D H 9 9 9 8 3 Possibly reverse comp Draw substring Genome

Tandem simulator “Apply” template to create new read mimicking the
original Genome C C T C T T A A C T A A Mutate C C T C T - A A C G A A Template 1 CIGAR, MD:Z M M M M M D M M M R M M | | | | | | | | | | M M M M M - M M M G M M 4 % I C D H 9 9 9 8 3 Ref Read Strand: + Add quals @ m y _ o r i g i n C C T C T A A C G A A + 4 % I C D H 9 9 9 8 3 Possibly reverse comp Draw substring Tandem read

MAPQ model RandomForestRegressor from scikit-learn (or similar) Tandem chr5 106
5M1D6M AS:i:90 … chr1 994 1S8M AS:i:95 … chr9 463 9M AS:i:80 … chr1 450 9M AS:i:90 … SAM Correct? 1 1 0 1 ∝ x N p N Input chr3 106 9M AS:i:92 … SAM Correct? ? ? ? ? chr4 924 5M1D6M AS:i:80 … chr5 62 9M AS:i:88 … chr2 422 7M2S AS:i:75 … 30 15 32 4 q Features & labels Features Predictions

Features E.g., for unpaired read aligned with Bowtie 2: •
Alignment score of best aln (AS:i) • Diﬀerence b/t best, 2nd-best (AS:i - XS:i) From SAM From aligner • How well did seed-and-extend heuristic work? • Fraction seeds aligned • Fraction aligned repetitively • Average # hits per seed • etc

Qtip • Implemented the tandem simulation framework in software tool
called Qtip and evaluated it • Qtip: https://github.com/BenLangmead/qtip • Experiments: https://github.com/BenLangmead/qtip-experiments • Coming soon: preprint

Evaluation High q Qtip’s predictions Aligner’s predictions High q Low
q Cum error2 Cum error2 Low q Qtip - Aligner 0 When line is below y=0, Qtip’s predictions give lower cum error2 for alignments over x threshold (p cor 1 cor )2

Evaluation: various aligners & modes −10000 −1000 −100 −10 −1
1 10 100 1000 10000 1M 2M 3M 100 bp unpaired −10000 −1000 −100 −10 −1 1 10 100 1000 10000 1M 2M 3M 4M 250 bp unpaired −10000 −1000 −100 −10 −1 1 10 100 1000 10000 2M 4M 6M 100 bp paired −10000 −1000 −100 −10 −1 1 10 100 1000 10000 2M 4M 6M 250 bp paired Bowtie 2 end−to−end Bowtie 2 local BWA−MEM SNAP

Evaluation: various aligners & species −1e+05 −1e+04 −1e+03 −1e+02 −1e+01
−1e+00 1e+00 1e+01 1e+02 1e+03 1e+04 1e+05 1M 2M 3M 4M 100 bp unpaired −1e+05 −1e+04 −1e+03 −1e+02 −1e+01 −1e+00 1e+00 1e+01 1e+02 1e+03 1e+04 1e+05 1M 2M 3M 4M 250 bp unpaired −1e+05 −1e+04 −1e+03 −1e+02 −1e+01 −1e+00 1e+00 1e+01 1e+02 1e+03 1e+04 1e+05 2M 4M 6M 100 bp paired −1e+05 −1e+04 −1e+03 −1e+02 −1e+01 −1e+00 1e+00 1e+01 1e+02 1e+03 1e+04 1e+05 2M 4M 6M 250 bp paired Human (GRCh38) Mouse (GRCm38) Zea Mays (AGPv4) Bowtie 2 BWA−MEM SNAP

Evaluation: various aligners & species 100 nt 250 nt RCA
RCE RCA RCE mean sd mean sd mean sd mean sd Unpaired Human Bowtie 2 -7.43 1.82 -18.53 2.61 -16.82 2.01 -19.97 0.44 BWA-MEM -15.96 0.20 -47.46 0.19 -16.52 0.28 -51.40 0.13 SNAP -19.58 0.18 -36.58 0.47 -14.74 0.27 -25.47 0.32 Mouse Bowtie 2 -5.60 0.24 -17.19 0.45 -7.05 0.33 -17.73 0.37 BWA-MEM -13.69 0.05 -46.50 0.08 -16.80 0.08 -51.40 0.13 SNAP -9.02 0.17 -31.07 0.33 -10.61 0.20 -31.78 0.43 Zea mays Bowtie 2 -6.63 0.32 -19.56 0.25 -17.09 0.38 -25.89 0.44 BWA-MEM -19.23 0.05 -58.74 0.07 -25.17 0.11 -67.15 0.08 SNAP -13.02 0.24 -38.48 0.53 -24.01 0.43 -53.80 0.42 Paired Human Bowtie 2 -34.93 1.00 -54.32 0.19 -47.93 0.26 -57.77 0.19 BWA-MEM -14.16 0.12 -41.29 0.22 -12.32 0.42 -42.73 0.28 SNAP -51.36 0.98 -11.16 1.12 -51.32 1.45 4.34 2.29 Mouse Bowtie 2 -22.67 0.22 -43.87 0.21 -31.07 0.18 -49.77 0.23 BWA-MEM -11.84 0.11 -36.11 0.32 -13.25 0.20 -39.82 0.20 SNAP -29.90 0.67 -17.04 0.76 -30.16 0.30 -15.79 0.47 Zea mays Bowtie 2 -31.92 0.18 -52.15 0.17 -57.90 0.13 -76.08 0.19 BWA-MEM -17.02 0.13 -47.42 0.14 -21.35 0.24 -56.47 0.16 SNAP -36.28 0.55 -17.08 0.79 -26.45 4.44 -20.05 0.52 Table 1: Relative change in area under CID (RCA) and relative change in sum of squared error (RCE) for various aligners and reference genomes, expressed as percent change. The experiments used 100 nt or 250 nt reads, and unpaired or paired-end reads, as indicated. Results are means and standard deviations over RCE: relative change in sum of squared error, as percentage Mean and standard deviation over 10 trials

Computational overhead Time (minutes) Peak memory (gigabytes) Time +Qtip %
inc Memory +Qtip % inc ERR050082 Unpaired Bowtie 2 258.08 282.12 9.31 3.16 3.36 6.28 BWA-MEM 252.52 278.05 10.11 5.59 5.83 4.26 SNAP 98.00 123.60 26.11 28.42 28.66 0.85 Paired Bowtie 2 518.00 577.95 11.57 3.18 3.72 17.04 BWA-MEM 572.72 638.40 11.47 5.82 6.34 9.00 SNAP 103.73 161.52 55.71 40.09 40.59 1.25 ERR050083 Unpaired Bowtie 2 389.27 427.48 9.81 3.16 3.39 7.03 BWA-MEM 347.80 388.07 11.58 5.84 6.08 4.11 SNAP 130.60 167.95 28.60 40.94 41.20 0.63 Paired Bowtie 2 887.30 979.60 10.40 3.18 3.84 20.84 BWA-MEM 872.05 978.52 12.21 5.87 6.57 11.88 SNAP 182.57 275.43 50.86 34.34 34.99 1.87 Table 1: Overhead of the Qtip tool. Measured as the increase in running time (left) and peak memory footprint (right) from when the aligner runs by itself (“Time”) to when the aligner runs in combination with Qtip (“+Qtip”). “% inc” columns give the percent increase. Times are in minutes and memory footprints are in gigabytes. Adds 9.3 - 12.2% time overhead for Bowtie 2, BWA-MEM, >50% for SNAP Adds 6 - 21 % peak memory overhead for Bowtie 2 4 - 12% for BWA-MEM, <2% for SNAP

Thank you Funding: • NSF: IIS-1349906 • Sloan foundation: BR2014-008
Looking for a postdoc! [email protected] Or chat with me. • Rafa Irizarry (Dana Farber) • Abhi Nellore (OHSU) • Alexis Battle (JHU) • Mike Schatz (JHU)

A Tandem Simulation Approach to Predicting Mapp...

A Tandem Simulation Approach to Predicting Mapping Quality

Ben Langmead

More Decks by Ben Langmead

Other Decks in Science

Featured

Transcript

Ben Langmead Assistant Professor, JHU Computer Science [email protected], www.langmead-lab.org, @BenLangmead

• Foes conspire to create alignment error: Alignment errors •

Mapping quality • q known as Mapping Quality, MAPQ p

MAPQ in SAM ERR050083.28 0 5 10643

MAPQ GATK McKenna A, Hanna M, Banks E, Sivachenko A,

MAPQ MAQ Li H, Ruan J, Durbin R. Mapping short

Tandem simulation framework • Predicts accurate, useful MAPQs • Works

Tandem simulation framework Aligner (Bowtie 2, BWA MEM, SNAP) Tandem

Tandem simulation framework Aligner (Bowtie 2, BWA MEM, SNAP) Tandem

Tandem simulation framework Aligner (Bowtie 2, BWA MEM, SNAP) Tandem

Tandem simulation framework Aligner (Bowtie 2, BWA MEM, SNAP) Tandem

Tandem simulation framework Aligner (Bowtie 2, BWA MEM, SNAP) Tandem

Tandem simulation framework Aligner (Bowtie 2, BWA MEM, SNAP) Tandem

Templates Template M M M M M D M M

Tandem simulator “Apply” template to create new read mimicking the

Tandem simulator “Apply” template to create new read mimicking the

Tandem simulation framework Aligner (Bowtie 2, BWA MEM, SNAP) Tandem

Tandem simulation framework Aligner (Bowtie 2, BWA MEM, SNAP) Tandem

MAPQ model RandomForestRegressor from scikit-learn (or similar) Tandem chr5 106

Features E.g., for unpaired read aligned with Bowtie 2: •

Qtip • Implemented the tandem simulation framework in software tool

Evaluation High q Qtip’s predictions Aligner’s predictions High q Low

Evaluation: various aligners & modes −10000 −1000 −100 −10 −1

Evaluation: various aligners & species −1e+05 −1e+04 −1e+03 −1e+02 −1e+01

Evaluation: various aligners & species 100 nt 250 nt RCA

Computational overhead Time (minutes) Peak memory (gigabytes) Time +Qtip %

Thank you Funding: • NSF: IIS-1349906 • Sloan foundation: BR2014-008