Slide 1

Slide 1 text

Ben Langmead Assistant Professor, JHU Computer Science [email protected], www.langmead-lab.org, @BenLangmead Biological Data Science, CSHL, October 28, 2016 A Tandem Simulation Approach to Predicting Mapping Quality

Slide 2

Slide 2 text

• Foes conspire to create alignment error: Alignment errors • Alignment error: aligner fails to map a read to its true point of origin • Contamination • Genetic variation • Sequencing error • Repetitive genomes

Slide 3

Slide 3 text

Mapping quality • q known as Mapping Quality, MAPQ p cor ⌘ Probability correct q = 10 · log10(1 p cor)

Slide 4

Slide 4 text

MAPQ in SAM ERR050083.28   0   5   10643              1   100M   *   0   0   TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCT ERR050083.24   16   1   10023              11   100M   *   0   0   CCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTTACC ERR050083.26   0   X   156030515      11   15M1D85M   *   0   0   GGGTTAGGGTTAGGGTAGGGTTAGGGTTAGGGTTAGGGT ERR050083.21   0   12   10063              11   100M   *   0   0   AACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTA ERR050083.14   16   5   11575              1   21S79M   *   0   0   CCCCTCACCCCCACTCCCTCTACCCCAAACCCTAACCCTAACC ERR050083.19   0   15   101981086      11   76M1D24M   *   0   0   TAGGGTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGG ERR050083.33   16   3   10591              11   7S93M   *   0   0   ACCCCTACCTCCCCCCACCCCTCCCCCTACCCCTAACCCTAAC ERR050083.30   16   1   10108              1   39M1D61M   *   0   0   CACCCCTAACCCTATCCCTAACCCTACCCCTAACCCTAA ERR050083.34   16   1   10052              11   98M2S   *   0   0   CCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAC ERR050083.32   16   10   10275              11   57M1D5M1D38M   *   0   0   CCCTAACCCTAACCCTAACCCTAACCCTACCCCTA ERR050083.39   0   1   10013              11   100M   *   0   0   TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCT ERR050083.36   0   5   10216              1   94M1D6M  *   0   0   CTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC ERR050083.38   16   5   10308              1   100M   *   0   0   AACCCCAACCCTAACCCCAACCCTAACCCTAACCCTAACCCTA ERR050083.35   16   5   11274              1   100M   *   0   0   AACCCTACACCTAACCCTAACCCTAACCCTAACCCTAACCCTA ERR050083.29   0   1   248945997      1   60M1D40M   *   0   0   AGGGTTAGGGTTAGGGTTAGGGTAGGGTTAGGGTTAGGG ERR050083.43   0   1   10012              11   100M   *   0   0   CTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC ERR050083.41   0   X   156030454      1   85M2I13M   *   0   0   AGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGG ERR050083.49   0   3   10740              14   73M1D12M1D14M1S   *   0   0   AACCCTAACCCTTAACCCTACCCCTAACCCTAACC ERR050083.44   16   17   83247157        11   4S7M1D6M1D6M1D45M1D9M1I22M  *   0   0   GTTTAGGGTTAGGGTTAGGGTTAG ERR050083.40   0   1   10014              14   100M   *   0   0   AACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTA ERR050083.45   16   1   10185              11   57M1I42M   *   0   0   CTAACCCTACCCCTAACCCTAACCCTAACCCTAACCCTA ERR050083.50   16   11   191803            2   43S8M1I47M1S   *   0   0   CAACCCCACCCCCACCCCCACCCCCAACCCCAACC ERR050083.42   16   7   10001              11   1S65M3D34M   *   0   0   CCTACCCCTACCCCTAACCCTACCCCTAACCCTAACCCT ERR050083.48   16   1   10134              11   44M2I54M   *   0   0   ACCCCAACCCTAACCCCTATCCCTACCCCTAACCCTAAC ERR050083.37   16   X   156030581      11   3S12M1D85M   *   0   0   GTGGGGTGAGGGTTAGGTTAGGGTTAGGGTTAGGGTGAG ERR050083.52   16   1   10295              11   58M1I40M1S   *   0   0   CCCAACCCCAACCCCAACCCCTACCCCAACCCCAACCCC ERR050083.46   16   X   156030594      14   100M   *   0   0   GGTTGGGGTTAGGGTTAGGGTGAGGGTTAGGGTTAGGGTTAGG round( q = 10 · log10(1 p cor))

Slide 5

Slide 5 text

MAPQ GATK McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, DePristo MA. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010 Sep;20(9):1297-303.

Slide 6

Slide 6 text

MAPQ MAQ Li H, Ruan J, Durbin R. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 2008 Nov;18(11):1851-8.

Slide 7

Slide 7 text

Tandem simulation framework • Predicts accurate, useful MAPQs • Works alongside various aligners (w/ tweaks) • Already adapted to Bowtie 2, BWA-MEM, SNAP • Uses supervised learning, but training is automatic & tailored to problem at hand

Slide 8

Slide 8 text

Tandem simulation framework Aligner (Bowtie 2, BWA MEM, SNAP) Tandem simulator Input reads Input alignments Tandem reads Tandem alignments Aligner (Bowtie 2, BWA MEM, SNAP) MAPQ Estimator Input alignments with corrected MAPQ Fit model Estimate MAPQs MAPQ model 1 Build model 2 Input model Simulate reads 3 4 Training records Extract records 5 6 7

Slide 9

Slide 9 text

Tandem simulation framework Aligner (Bowtie 2, BWA MEM, SNAP) Tandem simulator Input reads Input alignments Tandem reads Tandem alignments Aligner (Bowtie 2, BWA MEM, SNAP) MAPQ Estimator Input alignments with corrected MAPQ Fit model Estimate MAPQs MAPQ model 1 Build model 2 Input model Simulate reads 3 4 Training records Extract records 5 6 7 ?

Slide 10

Slide 10 text

Tandem simulation framework Aligner (Bowtie 2, BWA MEM, SNAP) Tandem simulator Input reads Input alignments Tandem reads Tandem alignments Aligner (Bowtie 2, BWA MEM, SNAP) MAPQ Estimator Input alignments with corrected MAPQ Fit model Estimate MAPQs MAPQ model 1 Build model 2 Input model Simulate reads 3 4 Training records Extract records 5 6 7

Slide 11

Slide 11 text

Tandem simulation framework Aligner (Bowtie 2, BWA MEM, SNAP) Tandem simulator Input reads Input alignments Tandem reads Tandem alignments Aligner (Bowtie 2, BWA MEM, SNAP) MAPQ Estimator Input alignments with corrected MAPQ Fit model Estimate MAPQs MAPQ model 1 Build model 2 Input model Simulate reads 3 4 Training records Extract records 5 6 7

Slide 12

Slide 12 text

Tandem simulation framework Aligner (Bowtie 2, BWA MEM, SNAP) Tandem simulator Input reads Input alignments Tandem reads Tandem alignments Aligner (Bowtie 2, BWA MEM, SNAP) MAPQ Estimator Input alignments with corrected MAPQ Fit model Estimate MAPQs MAPQ model 1 Build model 2 Input model Simulate reads 3 4 Training records Extract records 5 6 7

Slide 13

Slide 13 text

Tandem simulation framework Aligner (Bowtie 2, BWA MEM, SNAP) Tandem simulator Input reads Input alignments Tandem reads Tandem alignments Aligner (Bowtie 2, BWA MEM, SNAP) MAPQ Estimator Input alignments with corrected MAPQ Fit model Estimate MAPQs MAPQ model 1 Build model 2 Input model Simulate reads 3 4 Training records Extract records 5 6 7

Slide 14

Slide 14 text

Templates Template M M M M M D M M M R M M | | | | | | | | | | M M M M M - M M M G M M 4 % I C D H 9 9 9 8 3 Ref Read Strand: + read40   0   chr5  10643   20      5M1D6M  *   0   0   GATTTAGCCAC    4%ICDH99983      MD:Z:5^T3G2 CIGAR MD:Z FLAGS QUALS

Slide 15

Slide 15 text

Tandem simulator “Apply” template to create new read mimicking the original C C T C T T A A C T A A Mutate C C T C T - A A C G A A Template 1 CIGAR, MD:Z M M M M M D M M M R M M | | | | | | | | | | M M M M M - M M M G M M 4 % I C D H 9 9 9 8 3 Ref Read Strand: + Add quals C C T C T A A C G A A 4 % I C D H 9 9 9 8 3 Possibly reverse comp Draw substring Genome

Slide 16

Slide 16 text

Tandem simulator “Apply” template to create new read mimicking the original Genome C C T C T T A A C T A A Mutate C C T C T - A A C G A A Template 1 CIGAR, MD:Z M M M M M D M M M R M M | | | | | | | | | | M M M M M - M M M G M M 4 % I C D H 9 9 9 8 3 Ref Read Strand: + Add quals @ m y _ o r i g i n C C T C T A A C G A A + 4 % I C D H 9 9 9 8 3 Possibly reverse comp Draw substring Tandem read

Slide 17

Slide 17 text

Tandem simulation framework Aligner (Bowtie 2, BWA MEM, SNAP) Tandem simulator Input reads Input alignments Tandem reads Tandem alignments Aligner (Bowtie 2, BWA MEM, SNAP) MAPQ Estimator Input alignments with corrected MAPQ Fit model Estimate MAPQs MAPQ model 1 Build model 2 Input model Simulate reads 3 4 Training records Extract records 5 6 7

Slide 18

Slide 18 text

Tandem simulation framework Aligner (Bowtie 2, BWA MEM, SNAP) Tandem simulator Input reads Input alignments Tandem reads Tandem alignments Aligner (Bowtie 2, BWA MEM, SNAP) MAPQ Estimator Input alignments with corrected MAPQ Fit model Estimate MAPQs MAPQ model 1 Build model 2 Input model Simulate reads 3 4 Training records Extract records 5 6 7

Slide 19

Slide 19 text

MAPQ model RandomForestRegressor from scikit-learn (or similar) Tandem chr5  106  5M1D6M  AS:i:90  … chr1  994  1S8M  AS:i:95  … chr9  463  9M  AS:i:80  … chr1  450  9M  AS:i:90  … SAM Correct? 1 1 0 1 ∝ x N p N Input chr3  106  9M  AS:i:92  … SAM Correct? ? ? ? ? chr4  924  5M1D6M  AS:i:80  … chr5  62  9M  AS:i:88  … chr2  422  7M2S  AS:i:75  … 30 15 32 4 q Features & labels Features Predictions

Slide 20

Slide 20 text

Features E.g., for unpaired read aligned with Bowtie 2: • Alignment score of best aln (AS:i) • Difference b/t best, 2nd-best (AS:i - XS:i) From SAM From aligner • How well did seed-and-extend heuristic work? • Fraction seeds aligned • Fraction aligned repetitively • Average # hits per seed • etc

Slide 21

Slide 21 text

Qtip • Implemented the tandem simulation framework in software tool called Qtip and evaluated it • Qtip: https://github.com/BenLangmead/qtip • Experiments: https://github.com/BenLangmead/qtip-experiments • Coming soon: preprint

Slide 22

Slide 22 text

Evaluation High q Qtip’s predictions Aligner’s predictions High q Low q Cum error2 Cum error2 Low q Qtip - Aligner 0 When line is below y=0, Qtip’s predictions give lower cum error2 for alignments over x threshold (p cor 1 cor )2

Slide 23

Slide 23 text

Evaluation: various aligners & modes −10000 −1000 −100 −10 −1 1 10 100 1000 10000 1M 2M 3M 100 bp unpaired −10000 −1000 −100 −10 −1 1 10 100 1000 10000 1M 2M 3M 4M 250 bp unpaired −10000 −1000 −100 −10 −1 1 10 100 1000 10000 2M 4M 6M 100 bp paired −10000 −1000 −100 −10 −1 1 10 100 1000 10000 2M 4M 6M 250 bp paired Bowtie 2 end−to−end Bowtie 2 local BWA−MEM SNAP

Slide 24

Slide 24 text

Evaluation: various aligners & species −1e+05 −1e+04 −1e+03 −1e+02 −1e+01 −1e+00 1e+00 1e+01 1e+02 1e+03 1e+04 1e+05 1M 2M 3M 4M 100 bp unpaired −1e+05 −1e+04 −1e+03 −1e+02 −1e+01 −1e+00 1e+00 1e+01 1e+02 1e+03 1e+04 1e+05 1M 2M 3M 4M 250 bp unpaired −1e+05 −1e+04 −1e+03 −1e+02 −1e+01 −1e+00 1e+00 1e+01 1e+02 1e+03 1e+04 1e+05 2M 4M 6M 100 bp paired −1e+05 −1e+04 −1e+03 −1e+02 −1e+01 −1e+00 1e+00 1e+01 1e+02 1e+03 1e+04 1e+05 2M 4M 6M 250 bp paired Human (GRCh38) Mouse (GRCm38) Zea Mays (AGPv4) Bowtie 2 BWA−MEM SNAP

Slide 25

Slide 25 text

Evaluation: various aligners & species 100 nt 250 nt RCA RCE RCA RCE mean sd mean sd mean sd mean sd Unpaired Human Bowtie 2 -7.43 1.82 -18.53 2.61 -16.82 2.01 -19.97 0.44 BWA-MEM -15.96 0.20 -47.46 0.19 -16.52 0.28 -51.40 0.13 SNAP -19.58 0.18 -36.58 0.47 -14.74 0.27 -25.47 0.32 Mouse Bowtie 2 -5.60 0.24 -17.19 0.45 -7.05 0.33 -17.73 0.37 BWA-MEM -13.69 0.05 -46.50 0.08 -16.80 0.08 -51.40 0.13 SNAP -9.02 0.17 -31.07 0.33 -10.61 0.20 -31.78 0.43 Zea mays Bowtie 2 -6.63 0.32 -19.56 0.25 -17.09 0.38 -25.89 0.44 BWA-MEM -19.23 0.05 -58.74 0.07 -25.17 0.11 -67.15 0.08 SNAP -13.02 0.24 -38.48 0.53 -24.01 0.43 -53.80 0.42 Paired Human Bowtie 2 -34.93 1.00 -54.32 0.19 -47.93 0.26 -57.77 0.19 BWA-MEM -14.16 0.12 -41.29 0.22 -12.32 0.42 -42.73 0.28 SNAP -51.36 0.98 -11.16 1.12 -51.32 1.45 4.34 2.29 Mouse Bowtie 2 -22.67 0.22 -43.87 0.21 -31.07 0.18 -49.77 0.23 BWA-MEM -11.84 0.11 -36.11 0.32 -13.25 0.20 -39.82 0.20 SNAP -29.90 0.67 -17.04 0.76 -30.16 0.30 -15.79 0.47 Zea mays Bowtie 2 -31.92 0.18 -52.15 0.17 -57.90 0.13 -76.08 0.19 BWA-MEM -17.02 0.13 -47.42 0.14 -21.35 0.24 -56.47 0.16 SNAP -36.28 0.55 -17.08 0.79 -26.45 4.44 -20.05 0.52 Table 1: Relative change in area under CID (RCA) and relative change in sum of squared error (RCE) for various aligners and reference genomes, expressed as percent change. The experiments used 100 nt or 250 nt reads, and unpaired or paired-end reads, as indicated. Results are means and standard deviations over RCE: relative change in sum of squared error, as percentage Mean and standard deviation over 10 trials

Slide 26

Slide 26 text

Computational overhead Time (minutes) Peak memory (gigabytes) Time +Qtip % inc Memory +Qtip % inc ERR050082 Unpaired Bowtie 2 258.08 282.12 9.31 3.16 3.36 6.28 BWA-MEM 252.52 278.05 10.11 5.59 5.83 4.26 SNAP 98.00 123.60 26.11 28.42 28.66 0.85 Paired Bowtie 2 518.00 577.95 11.57 3.18 3.72 17.04 BWA-MEM 572.72 638.40 11.47 5.82 6.34 9.00 SNAP 103.73 161.52 55.71 40.09 40.59 1.25 ERR050083 Unpaired Bowtie 2 389.27 427.48 9.81 3.16 3.39 7.03 BWA-MEM 347.80 388.07 11.58 5.84 6.08 4.11 SNAP 130.60 167.95 28.60 40.94 41.20 0.63 Paired Bowtie 2 887.30 979.60 10.40 3.18 3.84 20.84 BWA-MEM 872.05 978.52 12.21 5.87 6.57 11.88 SNAP 182.57 275.43 50.86 34.34 34.99 1.87 Table 1: Overhead of the Qtip tool. Measured as the increase in running time (left) and peak memory footprint (right) from when the aligner runs by itself (“Time”) to when the aligner runs in combination with Qtip (“+Qtip”). “% inc” columns give the percent increase. Times are in minutes and memory footprints are in gigabytes. Adds 9.3 - 12.2% time overhead for Bowtie 2, BWA-MEM, >50% for SNAP Adds 6 - 21 % peak memory overhead for Bowtie 2 4 - 12% for BWA-MEM, <2% for SNAP

Slide 27

Slide 27 text

Thank you Funding: • NSF: IIS-1349906 • Sloan foundation: BR2014-008 Looking for a postdoc! [email protected] Or chat with me. • Rafa Irizarry (Dana Farber) • Abhi Nellore (OHSU) • Alexis Battle (JHU) • Mike Schatz (JHU)