Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Advances in pan-genomics for addressing reference bias

Ben Langmead
February 11, 2021

Advances in pan-genomics for addressing reference bias

Ben Langmead

February 11, 2021
Tweet

More Decks by Ben Langmead

Other Decks in Research

Transcript

  1. Ben Langmead Associate Professor, JHU Computer Science langmea@cs.jhu.edu, langmead-lab.org, @BenLangmead

    Stanford Biostatistics Seminar February 11, 2021 Advances in pan-genomics for addressing reference bias
  2. Today 1. References & reference bias 2. Graphs for fighting

    reference bias 2a. Graphs can include too much 3. Many linear references for fighting bias 4. Indexing reference panels Outline Our work FORGe Reference flow FM index & r- index
  3. Human genome Image: Russ London, https://commons.wikimedia.org/wiki/File:Wellcome_genome_bookcase.png

  4. Human genome Image: Abizar Lakdawalla Human Genome Project yielded a

    single reference genome (haplotype) https://en.wikipedia.org/wiki/Ploidy#Diploid
  5. More variants 1000 Genomes Project Consortium, Auton, A., Brooks, L.

    D., Durbin, R. M., Garrison, E. P., Kang, H. M., … Abecasis, G. R. (2015). A global reference for human genetic variation. Nature, 526(7571), 68–74. AFR EAS AMR EUR SAS
  6. More genomes

  7. More genomes @khmiga @aphillippy Karen Miga Adam Phillippy Let’s finish

    the human genome The Telomere-to-Telomere (T2T) consortium is an open, community-based effort to generate the first complete assembly of a human genome. https://www.slideshare.net/GenomeInABottle/how-giab-fits-in-the-rest-of-the-world-telomere-to-telomere-consortium https://github.com/nanopore-wgs-consortium/chm13
  8. CTCAAACTCCTGACCTTTGGTGATCCACCCGCCTNGGCCTTC Read: Reference genome: >MT dna:chromosome chromosome:GRCh37:MT:1:16569:1 GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTT CGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTC GCAGTATCTGTCTTTGATTCCTGCCTCATCCTATTATTTATCGCACCTACGTTCAATATT

    ACAGGCGAACATACTTACTAAAGTGTGTTAATTAATTAATGCTTGTAGGACATAATAATA ACAATTGAATGTCTGCACAGCCACTTTCCACACAGACATCATAACAAAAAATTTCCACCA AACCCCCCCTCCCCCGCTTCTGGCCACAGCACTTAAACACATCTCTGCCAAACCCCAAAA ACAAAGAACCCTAACACCAGCCTAACCAGATTTCAAATTTTATCTTTTGGCGGTATGCAC TTTTAACAGTCACCCCCCAACTAACACATTATTTTCCCCTCCCACTCCCATACTACTAAT CTCATCAATACAACCCCCGCCCATCCTACCCAGCACACACACACCGCTGCTAACCCCATA CCCCGAACCAACCAAACCCCAAAGACACCCCCCACAGTTTATGTAGCTTACCTCCTCAAA GCAATACACTGACCCGCTCAAACTCCTGGATTTTGGATCCACCCAGCGCCTTGGCCTAAA CTAGCCTTTCTATTAGCTCTTAGTAAGATTACACATGCAAGCATCCCCGTTCCAGTGAGT TCACCCTCTAAATCACCACGATCAAAAGGAACAAGCATCAAGCACGCAGCAATGCAGCTC AAAACGCTTAGCCTAGCCACACCCCCACGGGAAACAGCAGTGATTAACCTTTAGCAATAA ACGAAAGTTTAACTAAGCTATACTAACCCCAGGGTTGGTCAATTTCGTGCCAGCCACCGC GGTCACACGATTAACCCAAGTCAATAGAAGCCGGCGTAAAGAGTGTTTTAGATCACCCCC TCCCCAATAAAGCTAAAACTCACCTGAGTTGTAAAAAACTCCAGTTGACACAAAATAGAC TACGAAAGTGGCTTTAACATATCTGAACACACAATAGCTAAGACCCAAACTGGGATTAGA TACCCCACTATGCTTAGCCCTAAACCTCAACAGTTAAATCAACAAAACTGCTCGCCAGAA CACTACGAGCCACAGCTTAAAACTCAAAGGACCTGGCGGTGCTTCATATCCCTCTAGAGG AGCCTGTTCTGTAATCGATAAACCCCGATCAACCTCACCACCTCTTGCTCAGCCTATATA CCGCCATCTTCAGCAAACCCTGATGAAGGCTACAAAGTAAGCGCAAGTACCCACGTAAAG ACGTTAGGTCAAGGTGTAGCCCATGAGGTGGCAAGAAATGGGCTACATTTTCTACCCCAG AAAACTACGATAGCCCTTATGAAACTTAAGGGTCGAAGGTGGATTTAGCAGTAAACTAAG AGTAGAGTGCTTAGTTGAACAGGGCCCTGAAGCGCGTACACACCGCCCGTCACCCTCCTC AAGTATACTTCAAAGGACATTTAACTAAAACCCCTACGCATTTATATAGAGGAGACAAGT CGTAACCTCAAACTCCTGCCTTTGGTGATCCACCCGCCTTGGCCTACCTGCATAATGAAG AAGCACCCAACTTACACTTAGGAGATTTCAACTTAACTTGACCGCTCTGAGCTAAACCTA GCCCCAAACCCACTCCACCTTACTACCAGACAACCTTAGCCAAACCATTTACCCAAATAA AGTATAGGCGATAGAAATTGAAACCTGGCGCAATAGATATAGTACCGCAAGGGAAAGATG AAAAATTATAACCAAGCATAATATAGCAAGGACTAACCCCTATACCTTCTGCATAATGAA TTAACTAGAAATAACTTTGCAAGGAGAGCCAAAGCTAAGACCCCCGAAACCAGACGAGCT Alignment x billions x million
  9. CTCAAACTCCTGACCTTTGGTGATCCACCCGCCTNGGCCTTC Read: Reference genome: >MT dna:chromosome chromosome:GRCh37:MT:1:16569:1 GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTT CGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTC GCAGTATCTGTCTTTGATTCCTGCCTCATCCTATTATTTATCGCACCTACGTTCAATATT

    ACAGGCGAACATACTTACTAAAGTGTGTTAATTAATTAATGCTTGTAGGACATAATAATA ACAATTGAATGTCTGCACAGCCACTTTCCACACAGACATCATAACAAAAAATTTCCACCA AACCCCCCCTCCCCCGCTTCTGGCCACAGCACTTAAACACATCTCTGCCAAACCCCAAAA ACAAAGAACCCTAACACCAGCCTAACCAGATTTCAAATTTTATCTTTTGGCGGTATGCAC TTTTAACAGTCACCCCCCAACTAACACATTATTTTCCCCTCCCACTCCCATACTACTAAT CTCATCAATACAACCCCCGCCCATCCTACCCAGCACACACACACCGCTGCTAACCCCATA CCCCGAACCAACCAAACCCCAAAGACACCCCCCACAGTTTATGTAGCTTACCTCCTCAAA GCAATACACTGACCCGCTCAAACTCCTGGATTTTGGATCCACCCAGCGCCTTGGCCTAAA CTAGCCTTTCTATTAGCTCTTAGTAAGATTACACATGCAAGCATCCCCGTTCCAGTGAGT TCACCCTCTAAATCACCACGATCAAAAGGAACAAGCATCAAGCACGCAGCAATGCAGCTC AAAACGCTTAGCCTAGCCACACCCCCACGGGAAACAGCAGTGATTAACCTTTAGCAATAA ACGAAAGTTTAACTAAGCTATACTAACCCCAGGGTTGGTCAATTTCGTGCCAGCCACCGC GGTCACACGATTAACCCAAGTCAATAGAAGCCGGCGTAAAGAGTGTTTTAGATCACCCCC TCCCCAATAAAGCTAAAACTCACCTGAGTTGTAAAAAACTCCAGTTGACACAAAATAGAC TACGAAAGTGGCTTTAACATATCTGAACACACAATAGCTAAGACCCAAACTGGGATTAGA TACCCCACTATGCTTAGCCCTAAACCTCAACAGTTAAATCAACAAAACTGCTCGCCAGAA CACTACGAGCCACAGCTTAAAACTCAAAGGACCTGGCGGTGCTTCATATCCCTCTAGAGG AGCCTGTTCTGTAATCGATAAACCCCGATCAACCTCACCACCTCTTGCTCAGCCTATATA CCGCCATCTTCAGCAAACCCTGATGAAGGCTACAAAGTAAGCGCAAGTACCCACGTAAAG ACGTTAGGTCAAGGTGTAGCCCATGAGGTGGCAAGAAATGGGCTACATTTTCTACCCCAG AAAACTACGATAGCCCTTATGAAACTTAAGGGTCGAAGGTGGATTTAGCAGTAAACTAAG AGTAGAGTGCTTAGTTGAACAGGGCCCTGAAGCGCGTACACACCGCCCGTCACCCTCCTC AAGTATACTTCAAAGGACATTTAACTAAAACCCCTACGCATTTATATAGAGGAGACAAGT CGTAACCTCAAACTCCTGCCTTTGGTGATCCACCCGCCTTGGCCTACCTGCATAATGAAG AAGCACCCAACTTACACTTAGGAGATTTCAACTTAACTTGACCGCTCTGAGCTAAACCTA GCCCCAAACCCACTCCACCTTACTACCAGACAACCTTAGCCAAACCATTTACCCAAATAA AGTATAGGCGATAGAAATTGAAACCTGGCGCAATAGATATAGTACCGCAAGGGAAAGATG AAAAATTATAACCAAGCATAATATAGCAAGGACTAACCCCTATACCTTCTGCATAATGAA TTAACTAGAAATAACTTTGCAAGGAGAGCCAAAGCTAAGACCCCCGAAACCAGACGAGCT Alignment CTCAAAGACCTGACCTTTGGTGATCCACCC-----GCCTNGGCCTTC |||||| |||| |||| ||||||||| |||| ||||| CTCAAACTCCTGGATTTTG--GATCCACCCAGCTGGCCTTGGCCTAA Candidate 1: Candidate 2: CTCAAACTCCTGACCTTTGGTGATCCACCCGCCTNGGCCTTC |||||||||||| ||||||||||||||||||||| ||||| | CTCAAACTCCTG-CCTTTGGTGATCCACCCGCCTTGGCCTAC Read Reference Read Reference
  10. Reference bias Tendency to miss or misalign reads containing non-reference

    alleles REF ALT No bias Biased against Ref:
  11. Reference bias REF ALT Gene 1 (slight bias -> PAT)

    Gene 2 (strong bias -> MAT) MAT PAT Confounder in allele-specific analyses
  12. Reference bias Degner, J. F., Marioni, J. C., Pai, A.

    A., Pickrell, J. K., Nkadori, E., Gilad, Y., & Pritchard, J. K. (2009). Effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing data. Bioinformatics, 25(24), 3207–3212 Confounder in allele-specific analyses
  13. Reference bias Pritt, J., Chen, N. C., Langmead, B. (2018).

    FORGe: prioritizing variants for graph genomes. Genome biology, 19(1), 220. (Poor coverage in MHC region) Confounder in hypervariable regions
  14. Reference bias Wulfridge, P., Langmead, B., Feinberg, A. P., &

    Hansen, K. D. (2019). Analyzing whole genome bisulfite sequencing data from highly divergent genotypes. Nucleic acids research, 47(19), e117. Confounder when comparing inbred strains
  15. Why attack reference bias? https://www.pbs.org/newshour/science/genetic-research-has-a-white-bias-and-it-may-be-hurting-everyones-health “By not including diversity we

    are missing out on great opportunities to make novel discoveries and to be more inclusive of world populations," [Esteban] Burchard said.
  16. Why attack reference bias? 1000 Genomes Project Consortium, Auton, A.,

    Brooks, L. D., Durbin, R. M., Garrison, E. P., Kang, H. M., … Abecasis, G. R. (2015). A global reference for human genetic variation. Nature, 526(7571), 68–74. To avoid a world where diagnostics & therapeutics are differentially effective by population AFR EAS AMR EUR SAS
  17. Pangenomics "Variation graphs... which encode the genetic variation within a

    population as a graph, have been proposed as a solution to the reference bias [problem]." Quote: Sirén, Jouni. "Indexing variation graphs." In 2017 Proceedings of the nineteenth workshop on algorithm engineering and experiments (ALENEX), pp. 13-27. Society for Industrial and Applied Mathematics, 2017. Image: Garrison, E., Sirén, J., Novak, A. M., Hickey, G., Eizenga, J. M., Dawson, E.T., … Durbin, R. (2018). Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nature biotechnology, 36(9), 875–879.
  18. Pangenomics Schneeberger, K., Hagmann, J., Ossowski, S., Warthmann, N., Gesing,

    S., Kohlbacher, O., & Weigel, D. (2009). Simultaneous alignment of short reads against multiple genomes. Genome biology, 10(9), R98. GenomeMapper 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 ERG Satya RV, Zavaljevski N, Reifman J. A new strategy to reduce allelic bias in RNA-Seq readmapping. Nucleic Acids Res. 2012 Sep;40(16):e127. GCSA VG/GCSA2 HISAT2 BWBBLE Sirén, J, Välimäki, N, and Mäkinen, V. Indexing graphs for path queries with applications in genome research. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 11(2):375–388, 2014. Huang, L., Popic, V., & Batzoglou, S. (2013). Short read alignment with populations of genomes. Bioinformatics (Oxford, England), 29(13), i361–i370. Garrison, E., Sirén, J., Novak, A. M., Hickey, G., Eizenga, J. M., Dawson, E. T., … Durbin, R. (2018). Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nature biotechnology, 36(9), 875–879. Kim, D., Paggi, J. M., Park, C., Bennett, C., & Salzberg, S. L. (2019). Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nature biotechnology, 37(8), 907–915. + more!
  19. Pangenomics Catalog Map Goal: An inclusive picture for answering relatedness

    questions Goal: answer "where do I match?" questions Is more variation always better? vs
  20. Is more better? Adding variants to the reference can remove

    undesirable penalties in alignment score It also adds ambiguity to the reference, confusing the aligner
  21. FORGe C(G) = ∑ ⟨s,j⟩∈G p(⟨s, l⟩) U(G) = ∑

    ⟨s,j⟩∈G 1 fG (s) Population coverage: High allele frequency gets high priority Uniqueness: Variants adding more copies of existing k-mers get low priority Find the Optimal Reference Genome Pritt, J., Chen, N. C., Langmead, B. (2018). FORGe: prioritizing variants for graph genomes. Genome biology, 19(1), 220.
  22. FORGe H(G) = ∑ ⟨s,j⟩∈G p(⟨s, j⟩) fG (s) Hybrid:

    Product of previous measures Pritt, J., Chen, N. C., Langmead, B. (2018). FORGe: prioritizing variants for graph genomes. Genome biology, 19(1), 220.
  23. VCF FORGe FASTA VCF FASTA VCF FASTA 0% 2% 4%

    6% 8% 10% 15% 20% 100% 30% ... + m ore variants + m ore variants Kim, D., Paggi, J. M., Park, C., Bennett, C., & Salzberg, S. L. (2019). Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nature biotechnology, 37(8), 907–915. HISAT2 indexes:
  24. FORGe Pritt, J., Chen, N. C., Langmead, B. (2018). FORGe:

    prioritizing variants for graph genomes. Genome biology, 19(1), 220.
  25. FORGe Pritt, J., Chen, N. C., Langmead, B. (2018). FORGe:

    prioritizing variants for graph genomes. Genome biology, 19(1), 220.
  26. FORGe Pritt, J., Chen, N. C., Langmead, B. (2018). FORGe:

    prioritizing variants for graph genomes. Genome biology, 19(1), 220.
  27. FORGe Pritt, J., Chen, N. C., Langmead, B. (2018). FORGe:

    prioritizing variants for graph genomes. Genome biology, 19(1), 220.
  28. FORGe Pritt, J., Chen, N. C., Langmead, B. (2018). FORGe:

    prioritizing variants for graph genomes. Genome biology, 19(1), 220.
  29. FORGe Pritt, J., Chen, N. C., Langmead, B. (2018). FORGe:

    prioritizing variants for graph genomes. Genome biology, 19(1), 220.
  30. FORGe Pritt, J., Chen, N. C., Langmead, B. (2018). FORGe:

    prioritizing variants for graph genomes. Genome biology, 19(1), 220.
  31. FORGe Pritt, J., Chen, N. C., Langmead, B. (2018). FORGe:

    prioritizing variants for graph genomes. Genome biology, 19(1), 220. 30% 30% 0% 70% 1.431 1.432 1.433 Mappings (Billions) 0.3 0.4 0.5 0.6 0.7 0 2 4 6 8 10 20 30 40 50 60 70 % Variants Reference Bias HISAT2 Auto SNVs + Indels SNVs Only (b) Bias avoidance saturates at ~10% of variants
  32. FORGe • By modeling variants, we can balance pros and

    cons • Even a modest # of variants can alleviate bias, approaching accuracy of ideal, personalized genome • Peak accuracy is at ~10% of variants, about a ≥5% allele frequency cutoff Pritt, J., Chen, N. C., Langmead, B. (2018). FORGe: prioritizing variants for graph genomes. Genome biology, 19(1), 220. • Similar result (in cow) from another group:
  33. FORGe • By modeling variants, we can balance pros and

    cons • Even a modest # of variants can alleviate bias, approaching accuracy of ideal, personalized genome • Peak accuracy is at ~10% of variants, about a ≥5% allele frequency cutoff Pritt, J., Chen, N. C., Langmead, B. (2018). FORGe: prioritizing variants for graph genomes. Genome biology, 19(1), 220. • Similar result (in cow) from another group: Is more variation always better?
  34. FORGe • By modeling variants, we can balance pros and

    cons • Even a modest # of variants can alleviate bias, approaching accuracy of ideal, personalized genome • Peak accuracy is at ~10% of variants, about a ≥5% allele frequency cutoff Pritt, J., Chen, N. C., Langmead, B. (2018). FORGe: prioritizing variants for graph genomes. Genome biology, 19(1), 220. • Similar result (in cow) from another group: Is more variation always better? not necessarily (when using a graph)
  35. Today 1. References & reference bias 2. Graphs for fighting

    reference bias 2a. Graphs can include too much 3. Many linear references for fighting bias 4. Indexing reference panels Outline Our work FORGe Reference flow FM index & r-index
  36. Reference flow GRCh38 EUR AFR EAS SAS AMR Read Aligned

    uniquely? No Chen NC, Solomon B, Mun T, Iyer S, Langmead B. Reference flow: reducing reference bias using multiple population genomes. Genome Biol. 2021 Jan 4;22(1):8. Final Alignments Take best; lift back to GRC Yes
  37. Reference flow GRCh38 EUR AFR EAS SAS AMR Read Aligned

    uniquely? No Yes Final Alignments Take best; lift back to GRC bt2 bt2 bt2 bt2 bt2 bt2 Chen NC, Solomon B, Mun T, Iyer S, Langmead B. Reference flow: reducing reference bias using multiple population genomes. Genome Biol. 2021 Jan 4;22(1):8.
  38. Reference flow Super population of simulated individual Simulation using human

    chromosome 21; 100 individuals, 2 million reads per individual Reference flow achieves nearly the same % correct alignments as personalized reference; improvement over linear, major-allele & vg
  39. Reference flow Super population of simulated individual Reference flow avoids

    nearly as much bias as vg no bias More reference bias
  40. Reference flow • Align to multiple linear reference genomes, selected

    to cover the genotype space • Similar accuracy/bias as vg, at fraction of time (18%) and memory footprint (14%) • Simple wrapper around existing aligner • But misses many rare alleles when used with a small number of references Chen NC, Solomon B, Mun T, Iyer S, Langmead B. Reference flow: reducing reference bias using multiple population genomes. Genome Biol. 2021 Jan 4;22(1):8.
  41. Today 1. References & reference bias 2. Graphs for fighting

    reference bias 2a. Graphs can include too much 3. Many linear references for fighting bias 4. Indexing reference panels Outline Our work FORGe Reference flow FM index & r-index
  42. FM Index $ a b a a b a a

    $ a b a a b a a b a $ a b a b a $ a b a a b a a b a $ b a $ a b a a b a a b a $ a T All rotations Sort BWT(T) Last column Burrows-Wheeler Matrix a b a a b a $ a b b a $ a a FM index behind Bowtie & BWA consists of Burrows- Wheeler Transform (BWT), plus auxiliary structures BWT reorders the letters according to alphabetical order of their right contexts in T (e.g. genome)
  43. FM Index BWT gathers “like” characters (sharing right context) into

    runs E.g. for a text where rectangle appears many times, the ectangle tends to be preceded by r T rectangular_rectangle_divided_into_rectangles$ BWT(T) sedrotttleeeei_lrrrdlnnnv_duggaaaita__$ecccngi
  44. FM Index BWT gathers “like” characters (sharing right context) into

    runs E.g. for a text where rectangle appears many times, the ectangle tends to be preceded by r These rs come together in a BWT run T rectangular_rectangle_divided_into_rectangles$ BWT(T) sedrotttleeeei_lrrrdlnnnv_duggaaaita__$ecccngi
  45. FM Index T Tomorrow_and_tomorrow_and_tomorrow$ 1.09 BWT(T) w$wwdd__nnoooaattTmmmrrrrrrooo__ooo 2.33 T It_was_the_best_of_times_it_was_the_worst_of_times$

    1.00 BWT(T) s$esttssfftteww_hhmmbootttt_ii__woeeaaressIi_______ 1.76 T in_the_jingle_jangle_morning_Ill_come_following_you$ 1.04 BWT(T) u_gleeeengj_mlhl_nnnnt$nwj__lggIolo_iiiiarfcmylo_oo_ 1.30 BWT gathers “like” characters (sharing right context) into runs: rrrrrr When T is more repetitive, BWT runs are longer & fewer Avg. run length
  46. FM Index T Tomorrow_and_tomorrow_and_tomorrow$ 1.09 BWT(T) w$wwdd__nnoooaattTmmmrrrrrrooo__ooo 2.33 T It_was_the_best_of_times_it_was_the_worst_of_times$

    1.00 BWT(T) s$esttssfftteww_hhmmbootttt_ii__woeeaaressIi_______ 1.76 T in_the_jingle_jangle_morning_Ill_come_following_you$ 1.04 BWT(T) u_gleeeengj_mlhl_nnnnt$nwj__lggIolo_iiiiarfcmylo_oo_ 1.30 BWT gathers “like” characters (sharing right context) into runs: rrrrrr When T is more repetitive, BWT runs are longer & fewer Avg. run length
  47. FM Index T Tomorrow_and_tomorrow_and_tomorrow$ 1.09 BWT(T) w$wwdd__nnoooaattTmmmrrrrrrooo__ooo 2.33 T It_was_the_best_of_times_it_was_the_worst_of_times$

    1.00 BWT(T) s$esttssfftteww_hhmmbootttt_ii__woeeaaressIi_______ 1.76 T in_the_jingle_jangle_morning_Ill_come_following_you$ 1.04 BWT(T) u_gleeeengj_mlhl_nnnnt$nwj__lggIolo_iiiiarfcmylo_oo_ 1.30 BWT gathers “like” characters (sharing right context) into runs: rrrrrr When T is more repetitive, BWT runs are longer & fewer Avg. run length
  48. FM Index T Tomorrow_and_tomorrow_and_tomorrow$ 1.09 BWT(T) w$wwdd__nnoooaattTmmmrrrrrrooo__ooo 2.33 T It_was_the_best_of_times_it_was_the_worst_of_times$

    1.00 BWT(T) s$esttssfftteww_hhmmbootttt_ii__woeeaaressIi_______ 1.76 T in_the_jingle_jangle_morning_Ill_come_following_you$ 1.04 BWT(T) u_gleeeengj_mlhl_nnnnt$nwj__lggIolo_iiiiarfcmylo_oo_ 1.30 BWT gathers “like” characters (sharing right context) into runs: rrrrrr When T is more repetitive, BWT runs are longer & fewer Avg. run length
  49. FM Index # genomes 1 6,072 M 3,264 M 2

    12,144 M 3,282 M 3 18,217 M 3,386 M 4 24,408 M 3,423 M 5 30,480 M 3,436 M 6 36,671 M 3,449 M n r As we index more diploid genomes, (total length) grows linearly while (total # BWT runs) grows sublinearly n r From 1000 Genomes project phase-3 callset Kuhnle A, Mun T, Boucher C, Gagie T, Langmead B, Manzini G. Efficient Construction of a Complete Index for Pan-Genomics Read Alignment. J Comput Biol. 2020 Apr;27(4):500-513.
  50. From FM Index to r-index Count Locate Space Time Space

    Time FM Index (2000) RLFM Index (2005) r-index (2018) Where is total reference length, is query-string length, is total # BWT runs n m r O(n) O(r) O(r) O(m) O(m) O(m) (log factors omitted) O(n) O(n) O(r) O(m + occ) O(m + occ) O(m + occ) FM: Ferragina P, and Manzini M. Opportunistic data structures with applications. Proceedings of 41st FOCS. IEEE, 2000.
  51. From FM Index to r-index Count Locate Space Time Space

    Time FM Index (2000) RLFM Index (2005) r-index (2018) Where is total reference length, is query-string length, is total # BWT runs n m r O(n) O(r) O(r) O(m) O(m) O(m) (log factors omitted) O(n) O(n) O(r) O(m + occ) O(m + occ) O(m + occ) RLFM: Mäkinen V, and Navarro G. Succinct suffix arrays based on run-length encoding. Annual Symposium on CPM. Springer, Berlin, Heidelberg. 2005. pp45–56. FM: Ferragina P, and Manzini M. Opportunistic data structures with applications. Proceedings of 41st FOCS. IEEE, 2000.
  52. From FM Index to r-index Count Locate Space Time Space

    Time FM Index (2000) RLFM Index (2005) r-index (2018) Where is total reference length, is query-string length, is total # BWT runs n m r O(n) O(r) O(r) O(m) O(m) O(m) (log factors omitted) O(n) O(n) O(r) O(m + occ) O(m + occ) O(m + occ) r-index: Gagie T, Navarro G, and Prezza P. Optimal-time text indexing in BWT-runs bounded space. Proceedings of 29th SODA, ACM-SIAM. 2018. pp1459—1477. RLFM: Mäkinen V, and Navarro G. Succinct suffix arrays based on run-length encoding. Annual Symposium on CPM. Springer, Berlin, Heidelberg. 2005. pp45–56. FM: Ferragina P, and Manzini M. Opportunistic data structures with applications. Proceedings of 41st FOCS. IEEE, 2000.
  53. r-index Index many human genomes with similar queries & speed

    as FM Index, in space; sublinear in # & length of genomes O(r) Gonzalo Navarro Nicola Prezza Gagie T, Navarro G, and Prezza P. Optimal- time text indexing in BWT-runs bounded space. Proceedings of 29th SODA, ACM- SIAM. 2018. pp1459—1477. Christina Boucher Travis Gagie Alan Kuhnle Giovanni Manzini Kuhnle A, Mun T, Boucher C, Gagie T, Langmead B, Manzini G. Efficient Construction of a Complete Index for Pan-Genomics Read Alignment. J Comput Biol. 2020 Apr;27(4):500-513.
  54. Panel alignment with r-index For larger collections, index is smaller

    than that of Bowtie and compressed competitors Kuhnle A, Mun T, Boucher C, Gagie T, Langmead B, Manzini G. Efficient Construction of a Complete Index for Pan-Genomics Read Alignment. J Comput Biol. 2020 Apr;27(4):500-513. (chr19s from 1000 Genomes Project)
  55. Panel alignment with r-index For larger collections, query time is

    faster than Bowtie Kuhnle A, Mun T, Boucher C, Gagie T, Langmead B, Manzini G. Efficient Construction of a Complete Index for Pan-Genomics Read Alignment. J Comput Biol. 2020 Apr;27(4):500-513. (chr19s from 1000 Genomes Project)
  56.          

          0 50000 100000 150000 200000 0 20000 40000 60000 Total Length of Collection (MB) Indexing Peak Mem. (MB) 1KG LRA forward + reverse complement Handles many human genome assemblies! Panel alignment with r-index Kuhnle A, Mun T, Boucher C, Gagie T, Langmead B, Manzini G. Efficient Construction of a Complete Index for Pan-Genomics Read Alignment. J Comput Biol. 2020 Apr;27(4):500-513. (whole human genomes)
  57. Future of r-index • Alignment to collection (panel) of linear

    references • Fast genotyping with respect to panel • Fast online matching statistics Rossi M, Oliva M, Langmead B, Gagie T, Boucher C. MONI: A Pangenomics Index for Finding MEMs. Accepted, RECOMB 2021. Boucher C, Gagie T, I T, Köppl D, Langmead B, Manzini G, Navarro G, Pacheco A, Rossi M. PHONI: Streamed Matching Statistics with Multi-Genome References. Accepted, DCC 2021.
  58. Conclusions for Practitioners

  59. Conclusions for Practitioners • If avoiding reference bias is the

    goal, that’s how we should evaluate • A , not a • With multiple references, fast & familiar linear aligners have comparable benefits to (current) graph aligners • New assemblies are coming, but we lack good ways to put them in common coordinates. Need methods that let genomes be linear
  60. • Pangenome graphs suffer from ambiguity that comes with adding

    many rare variants • Is this a failing of the method? Conclusions for Methods • Can index & queries be made frequency-aware, representing rare variation while understanding it is rare? 11% 89% 1% 1% 4% 94% 🤔
  61. Thank you! And thanks to the team: Jacob Pritt Nae-Chyun

    Chen Taher Mun Brad Solomon NSF: IIS-1349906 DBI-2029552 NIH: R01GM118568 R01HG011392 Christina Boucher Travis Gagie Alan Kuhnle Sheila Iyer Giovanni Manzini + MONI & PHONI teams
  62. Photo: Elizabeth Colantuoni Thank you