Upgrade to Pro — share decks privately, control downloads, hide ads and more …

A Tandem Simulation Approach to Predicting Mapping Quality

Ben Langmead
October 28, 2016

A Tandem Simulation Approach to Predicting Mapping Quality

Read alignment is the first step in most sequencing data analyses. It is also a source of errors and interpretability problems. Repetitive genomes, algorithmic shortcuts, and genetic variation impede the aligner's ability to find a read's true point of origin. Aligners therefore report a mapping quality: the probability the reported point of origin for a read is incorrect. However, there are no proposals for how to calculate mapping qualities in a general way, applicable across tools and alignment scenarios. I describe an accurate, aligner-agnostic framework for predicting mapping qualities that works by simulating a set of "tandem" reads, similar to the input reads in important ways, but for which the true point of origin is known. Alignments of tandem reads are used to build a model for predicting mapping quality, which is then applied to the input-read alignments. The model is automatically tailored to the alignment scenario at hand, allowing it to make accurate mapping-quality predictions across a range of read lengths, alignment parameters, genomes, and read aligners. I implement this approach in a software tool called Qtip, which is accurate, low-overhead, and compatible with popular read aligners. Qtip is open source software available from https://github.com/BenLangmead/qtip and experiments are published at https://github.com/BenLangmead/qtip-experiments.

Ben Langmead

October 28, 2016
Tweet

More Decks by Ben Langmead

Other Decks in Science

Transcript

  1. Ben Langmead Assistant Professor, JHU Computer Science [email protected], www.langmead-lab.org, @BenLangmead

    Biological Data Science, CSHL, October 28, 2016 A Tandem Simulation Approach to Predicting Mapping Quality
  2. • Foes conspire to create alignment error: Alignment errors •

    Alignment error: aligner fails to map a read to its true point of origin • Contamination • Genetic variation • Sequencing error • Repetitive genomes
  3. Mapping quality • q known as Mapping Quality, MAPQ p

    cor ⌘ Probability correct q = 10 · log10(1 p cor)
  4. MAPQ in SAM ERR050083.28   0   5   10643

                 1   100M   *   0   0   TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCT ERR050083.24   16   1   10023              11   100M   *   0   0   CCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTTACC ERR050083.26   0   X   156030515      11   15M1D85M   *   0   0   GGGTTAGGGTTAGGGTAGGGTTAGGGTTAGGGTTAGGGT ERR050083.21   0   12   10063              11   100M   *   0   0   AACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTA ERR050083.14   16   5   11575              1   21S79M   *   0   0   CCCCTCACCCCCACTCCCTCTACCCCAAACCCTAACCCTAACC ERR050083.19   0   15   101981086      11   76M1D24M   *   0   0   TAGGGTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGG ERR050083.33   16   3   10591              11   7S93M   *   0   0   ACCCCTACCTCCCCCCACCCCTCCCCCTACCCCTAACCCTAAC ERR050083.30   16   1   10108              1   39M1D61M   *   0   0   CACCCCTAACCCTATCCCTAACCCTACCCCTAACCCTAA ERR050083.34   16   1   10052              11   98M2S   *   0   0   CCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAC ERR050083.32   16   10   10275              11   57M1D5M1D38M   *   0   0   CCCTAACCCTAACCCTAACCCTAACCCTACCCCTA ERR050083.39   0   1   10013              11   100M   *   0   0   TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCT ERR050083.36   0   5   10216              1   94M1D6M  *   0   0   CTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC ERR050083.38   16   5   10308              1   100M   *   0   0   AACCCCAACCCTAACCCCAACCCTAACCCTAACCCTAACCCTA ERR050083.35   16   5   11274              1   100M   *   0   0   AACCCTACACCTAACCCTAACCCTAACCCTAACCCTAACCCTA ERR050083.29   0   1   248945997      1   60M1D40M   *   0   0   AGGGTTAGGGTTAGGGTTAGGGTAGGGTTAGGGTTAGGG ERR050083.43   0   1   10012              11   100M   *   0   0   CTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC ERR050083.41   0   X   156030454      1   85M2I13M   *   0   0   AGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGG ERR050083.49   0   3   10740              14   73M1D12M1D14M1S   *   0   0   AACCCTAACCCTTAACCCTACCCCTAACCCTAACC ERR050083.44   16   17   83247157        11   4S7M1D6M1D6M1D45M1D9M1I22M  *   0   0   GTTTAGGGTTAGGGTTAGGGTTAG ERR050083.40   0   1   10014              14   100M   *   0   0   AACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTA ERR050083.45   16   1   10185              11   57M1I42M   *   0   0   CTAACCCTACCCCTAACCCTAACCCTAACCCTAACCCTA ERR050083.50   16   11   191803            2   43S8M1I47M1S   *   0   0   CAACCCCACCCCCACCCCCACCCCCAACCCCAACC ERR050083.42   16   7   10001              11   1S65M3D34M   *   0   0   CCTACCCCTACCCCTAACCCTACCCCTAACCCTAACCCT ERR050083.48   16   1   10134              11   44M2I54M   *   0   0   ACCCCAACCCTAACCCCTATCCCTACCCCTAACCCTAAC ERR050083.37   16   X   156030581      11   3S12M1D85M   *   0   0   GTGGGGTGAGGGTTAGGTTAGGGTTAGGGTTAGGGTGAG ERR050083.52   16   1   10295              11   58M1I40M1S   *   0   0   CCCAACCCCAACCCCAACCCCTACCCCAACCCCAACCCC ERR050083.46   16   X   156030594      14   100M   *   0   0   GGTTGGGGTTAGGGTTAGGGTGAGGGTTAGGGTTAGGGTTAGG round( q = 10 · log10(1 p cor))
  5. MAPQ GATK McKenna A, Hanna M, Banks E, Sivachenko A,

    Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, DePristo MA. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010 Sep;20(9):1297-303.
  6. MAPQ MAQ Li H, Ruan J, Durbin R. Mapping short

    DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 2008 Nov;18(11):1851-8.
  7. Tandem simulation framework • Predicts accurate, useful MAPQs • Works

    alongside various aligners (w/ tweaks) • Already adapted to Bowtie 2, BWA-MEM, SNAP • Uses supervised learning, but training is automatic & tailored to problem at hand
  8. Tandem simulation framework Aligner (Bowtie 2, BWA MEM, SNAP) Tandem

    simulator Input reads Input alignments Tandem reads Tandem alignments Aligner (Bowtie 2, BWA MEM, SNAP) MAPQ Estimator Input alignments with corrected MAPQ Fit model Estimate MAPQs MAPQ model 1 Build model 2 Input model Simulate reads 3 4 Training records Extract records 5 6 7
  9. Tandem simulation framework Aligner (Bowtie 2, BWA MEM, SNAP) Tandem

    simulator Input reads Input alignments Tandem reads Tandem alignments Aligner (Bowtie 2, BWA MEM, SNAP) MAPQ Estimator Input alignments with corrected MAPQ Fit model Estimate MAPQs MAPQ model 1 Build model 2 Input model Simulate reads 3 4 Training records Extract records 5 6 7 ?
  10. Tandem simulation framework Aligner (Bowtie 2, BWA MEM, SNAP) Tandem

    simulator Input reads Input alignments Tandem reads Tandem alignments Aligner (Bowtie 2, BWA MEM, SNAP) MAPQ Estimator Input alignments with corrected MAPQ Fit model Estimate MAPQs MAPQ model 1 Build model 2 Input model Simulate reads 3 4 Training records Extract records 5 6 7
  11. Tandem simulation framework Aligner (Bowtie 2, BWA MEM, SNAP) Tandem

    simulator Input reads Input alignments Tandem reads Tandem alignments Aligner (Bowtie 2, BWA MEM, SNAP) MAPQ Estimator Input alignments with corrected MAPQ Fit model Estimate MAPQs MAPQ model 1 Build model 2 Input model Simulate reads 3 4 Training records Extract records 5 6 7
  12. Tandem simulation framework Aligner (Bowtie 2, BWA MEM, SNAP) Tandem

    simulator Input reads Input alignments Tandem reads Tandem alignments Aligner (Bowtie 2, BWA MEM, SNAP) MAPQ Estimator Input alignments with corrected MAPQ Fit model Estimate MAPQs MAPQ model 1 Build model 2 Input model Simulate reads 3 4 Training records Extract records 5 6 7
  13. Tandem simulation framework Aligner (Bowtie 2, BWA MEM, SNAP) Tandem

    simulator Input reads Input alignments Tandem reads Tandem alignments Aligner (Bowtie 2, BWA MEM, SNAP) MAPQ Estimator Input alignments with corrected MAPQ Fit model Estimate MAPQs MAPQ model 1 Build model 2 Input model Simulate reads 3 4 Training records Extract records 5 6 7
  14. Templates Template M M M M M D M M

    M R M M | | | | | | | | | | M M M M M - M M M G M M 4 % I C D H 9 9 9 8 3 Ref Read Strand: + read40   0   chr5  10643   20      5M1D6M  *   0   0   GATTTAGCCAC    4%ICDH99983      MD:Z:5^T3G2 CIGAR MD:Z FLAGS QUALS
  15. Tandem simulator “Apply” template to create new read mimicking the

    original C C T C T T A A C T A A Mutate C C T C T - A A C G A A Template 1 CIGAR, MD:Z M M M M M D M M M R M M | | | | | | | | | | M M M M M - M M M G M M 4 % I C D H 9 9 9 8 3 Ref Read Strand: + Add quals C C T C T A A C G A A 4 % I C D H 9 9 9 8 3 Possibly reverse comp Draw substring Genome
  16. Tandem simulator “Apply” template to create new read mimicking the

    original Genome C C T C T T A A C T A A Mutate C C T C T - A A C G A A Template 1 CIGAR, MD:Z M M M M M D M M M R M M | | | | | | | | | | M M M M M - M M M G M M 4 % I C D H 9 9 9 8 3 Ref Read Strand: + Add quals @ m y _ o r i g i n C C T C T A A C G A A + 4 % I C D H 9 9 9 8 3 Possibly reverse comp Draw substring Tandem read
  17. Tandem simulation framework Aligner (Bowtie 2, BWA MEM, SNAP) Tandem

    simulator Input reads Input alignments Tandem reads Tandem alignments Aligner (Bowtie 2, BWA MEM, SNAP) MAPQ Estimator Input alignments with corrected MAPQ Fit model Estimate MAPQs MAPQ model 1 Build model 2 Input model Simulate reads 3 4 Training records Extract records 5 6 7
  18. Tandem simulation framework Aligner (Bowtie 2, BWA MEM, SNAP) Tandem

    simulator Input reads Input alignments Tandem reads Tandem alignments Aligner (Bowtie 2, BWA MEM, SNAP) MAPQ Estimator Input alignments with corrected MAPQ Fit model Estimate MAPQs MAPQ model 1 Build model 2 Input model Simulate reads 3 4 Training records Extract records 5 6 7
  19. MAPQ model RandomForestRegressor from scikit-learn (or similar) Tandem chr5  106

     5M1D6M  AS:i:90  … chr1  994  1S8M  AS:i:95  … chr9  463  9M  AS:i:80  … chr1  450  9M  AS:i:90  … SAM Correct? 1 1 0 1 ∝ x N p N Input chr3  106  9M  AS:i:92  … SAM Correct? ? ? ? ? chr4  924  5M1D6M  AS:i:80  … chr5  62  9M  AS:i:88  … chr2  422  7M2S  AS:i:75  … 30 15 32 4 q Features & labels Features Predictions
  20. Features E.g., for unpaired read aligned with Bowtie 2: •

    Alignment score of best aln (AS:i) • Difference b/t best, 2nd-best (AS:i - XS:i) From SAM From aligner • How well did seed-and-extend heuristic work? • Fraction seeds aligned • Fraction aligned repetitively • Average # hits per seed • etc
  21. Qtip • Implemented the tandem simulation framework in software tool

    called Qtip and evaluated it • Qtip: https://github.com/BenLangmead/qtip • Experiments: https://github.com/BenLangmead/qtip-experiments • Coming soon: preprint
  22. Evaluation High q Qtip’s predictions Aligner’s predictions High q Low

    q Cum error2 Cum error2 Low q Qtip - Aligner 0 When line is below y=0, Qtip’s predictions give lower cum error2 for alignments over x threshold (p cor 1 cor )2
  23. Evaluation: various aligners & modes −10000 −1000 −100 −10 −1

    1 10 100 1000 10000 1M 2M 3M 100 bp unpaired −10000 −1000 −100 −10 −1 1 10 100 1000 10000 1M 2M 3M 4M 250 bp unpaired −10000 −1000 −100 −10 −1 1 10 100 1000 10000 2M 4M 6M 100 bp paired −10000 −1000 −100 −10 −1 1 10 100 1000 10000 2M 4M 6M 250 bp paired Bowtie 2 end−to−end Bowtie 2 local BWA−MEM SNAP
  24. Evaluation: various aligners & species −1e+05 −1e+04 −1e+03 −1e+02 −1e+01

    −1e+00 1e+00 1e+01 1e+02 1e+03 1e+04 1e+05 1M 2M 3M 4M 100 bp unpaired −1e+05 −1e+04 −1e+03 −1e+02 −1e+01 −1e+00 1e+00 1e+01 1e+02 1e+03 1e+04 1e+05 1M 2M 3M 4M 250 bp unpaired −1e+05 −1e+04 −1e+03 −1e+02 −1e+01 −1e+00 1e+00 1e+01 1e+02 1e+03 1e+04 1e+05 2M 4M 6M 100 bp paired −1e+05 −1e+04 −1e+03 −1e+02 −1e+01 −1e+00 1e+00 1e+01 1e+02 1e+03 1e+04 1e+05 2M 4M 6M 250 bp paired Human (GRCh38) Mouse (GRCm38) Zea Mays (AGPv4) Bowtie 2 BWA−MEM SNAP
  25. Evaluation: various aligners & species 100 nt 250 nt RCA

    RCE RCA RCE mean sd mean sd mean sd mean sd Unpaired Human Bowtie 2 -7.43 1.82 -18.53 2.61 -16.82 2.01 -19.97 0.44 BWA-MEM -15.96 0.20 -47.46 0.19 -16.52 0.28 -51.40 0.13 SNAP -19.58 0.18 -36.58 0.47 -14.74 0.27 -25.47 0.32 Mouse Bowtie 2 -5.60 0.24 -17.19 0.45 -7.05 0.33 -17.73 0.37 BWA-MEM -13.69 0.05 -46.50 0.08 -16.80 0.08 -51.40 0.13 SNAP -9.02 0.17 -31.07 0.33 -10.61 0.20 -31.78 0.43 Zea mays Bowtie 2 -6.63 0.32 -19.56 0.25 -17.09 0.38 -25.89 0.44 BWA-MEM -19.23 0.05 -58.74 0.07 -25.17 0.11 -67.15 0.08 SNAP -13.02 0.24 -38.48 0.53 -24.01 0.43 -53.80 0.42 Paired Human Bowtie 2 -34.93 1.00 -54.32 0.19 -47.93 0.26 -57.77 0.19 BWA-MEM -14.16 0.12 -41.29 0.22 -12.32 0.42 -42.73 0.28 SNAP -51.36 0.98 -11.16 1.12 -51.32 1.45 4.34 2.29 Mouse Bowtie 2 -22.67 0.22 -43.87 0.21 -31.07 0.18 -49.77 0.23 BWA-MEM -11.84 0.11 -36.11 0.32 -13.25 0.20 -39.82 0.20 SNAP -29.90 0.67 -17.04 0.76 -30.16 0.30 -15.79 0.47 Zea mays Bowtie 2 -31.92 0.18 -52.15 0.17 -57.90 0.13 -76.08 0.19 BWA-MEM -17.02 0.13 -47.42 0.14 -21.35 0.24 -56.47 0.16 SNAP -36.28 0.55 -17.08 0.79 -26.45 4.44 -20.05 0.52 Table 1: Relative change in area under CID (RCA) and relative change in sum of squared error (RCE) for various aligners and reference genomes, expressed as percent change. The experiments used 100 nt or 250 nt reads, and unpaired or paired-end reads, as indicated. Results are means and standard deviations over RCE: relative change in sum of squared error, as percentage Mean and standard deviation over 10 trials
  26. Computational overhead Time (minutes) Peak memory (gigabytes) Time +Qtip %

    inc Memory +Qtip % inc ERR050082 Unpaired Bowtie 2 258.08 282.12 9.31 3.16 3.36 6.28 BWA-MEM 252.52 278.05 10.11 5.59 5.83 4.26 SNAP 98.00 123.60 26.11 28.42 28.66 0.85 Paired Bowtie 2 518.00 577.95 11.57 3.18 3.72 17.04 BWA-MEM 572.72 638.40 11.47 5.82 6.34 9.00 SNAP 103.73 161.52 55.71 40.09 40.59 1.25 ERR050083 Unpaired Bowtie 2 389.27 427.48 9.81 3.16 3.39 7.03 BWA-MEM 347.80 388.07 11.58 5.84 6.08 4.11 SNAP 130.60 167.95 28.60 40.94 41.20 0.63 Paired Bowtie 2 887.30 979.60 10.40 3.18 3.84 20.84 BWA-MEM 872.05 978.52 12.21 5.87 6.57 11.88 SNAP 182.57 275.43 50.86 34.34 34.99 1.87 Table 1: Overhead of the Qtip tool. Measured as the increase in running time (left) and peak memory footprint (right) from when the aligner runs by itself (“Time”) to when the aligner runs in combination with Qtip (“+Qtip”). “% inc” columns give the percent increase. Times are in minutes and memory footprints are in gigabytes. Adds 9.3 - 12.2% time overhead for Bowtie 2, BWA-MEM, >50% for SNAP Adds 6 - 21 % peak memory overhead for Bowtie 2 4 - 12% for BWA-MEM, <2% for SNAP
  27. Thank you Funding: • NSF: IIS-1349906 • Sloan foundation: BR2014-008

    Looking for a postdoc! [email protected] Or chat with me. • Rafa Irizarry (Dana Farber) • Abhi Nellore (OHSU) • Alexis Battle (JHU) • Mike Schatz (JHU)