Advances in pan-genomics for addressing reference bias

Ben Langmead Associate Professor, JHU Computer Science [email protected], langmead-lab.org, @BenLangmead
Stanford Biostatistics Seminar February 11, 2021 Advances in pan-genomics for addressing reference bias

Today 1. References & reference bias 2. Graphs for fighting
reference bias 2a. Graphs can include too much 3. Many linear references for fighting bias 4. Indexing reference panels Outline Our work FORGe Reference flow FM index & r- index

Human genome Image: Russ London, https://commons.wikimedia.org/wiki/File:Wellcome_genome_bookcase.png

Human genome Image: Abizar Lakdawalla Human Genome Project yielded a
single reference genome (haplotype) https://en.wikipedia.org/wiki/Ploidy#Diploid

More variants 1000 Genomes Project Consortium, Auton, A., Brooks, L.
D., Durbin, R. M., Garrison, E. P., Kang, H. M., … Abecasis, G. R. (2015). A global reference for human genetic variation. Nature, 526(7571), 68–74. AFR EAS AMR EUR SAS

More genomes

More genomes @khmiga @aphillippy Karen Miga Adam Phillippy Let’s finish
the human genome The Telomere-to-Telomere (T2T) consortium is an open, community-based effort to generate the first complete assembly of a human genome. https://www.slideshare.net/GenomeInABottle/how-giab-fits-in-the-rest-of-the-world-telomere-to-telomere-consortium https://github.com/nanopore-wgs-consortium/chm13

CTCAAACTCCTGACCTTTGGTGATCCACCCGCCTNGGCCTTC Read: Reference genome: >MT dna:chromosome chromosome:GRCh37:MT:1:16569:1 GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTT CGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTC GCAGTATCTGTCTTTGATTCCTGCCTCATCCTATTATTTATCGCACCTACGTTCAATATT
ACAGGCGAACATACTTACTAAAGTGTGTTAATTAATTAATGCTTGTAGGACATAATAATA ACAATTGAATGTCTGCACAGCCACTTTCCACACAGACATCATAACAAAAAATTTCCACCA AACCCCCCCTCCCCCGCTTCTGGCCACAGCACTTAAACACATCTCTGCCAAACCCCAAAA ACAAAGAACCCTAACACCAGCCTAACCAGATTTCAAATTTTATCTTTTGGCGGTATGCAC TTTTAACAGTCACCCCCCAACTAACACATTATTTTCCCCTCCCACTCCCATACTACTAAT CTCATCAATACAACCCCCGCCCATCCTACCCAGCACACACACACCGCTGCTAACCCCATA CCCCGAACCAACCAAACCCCAAAGACACCCCCCACAGTTTATGTAGCTTACCTCCTCAAA GCAATACACTGACCCGCTCAAACTCCTGGATTTTGGATCCACCCAGCGCCTTGGCCTAAA CTAGCCTTTCTATTAGCTCTTAGTAAGATTACACATGCAAGCATCCCCGTTCCAGTGAGT TCACCCTCTAAATCACCACGATCAAAAGGAACAAGCATCAAGCACGCAGCAATGCAGCTC AAAACGCTTAGCCTAGCCACACCCCCACGGGAAACAGCAGTGATTAACCTTTAGCAATAA ACGAAAGTTTAACTAAGCTATACTAACCCCAGGGTTGGTCAATTTCGTGCCAGCCACCGC GGTCACACGATTAACCCAAGTCAATAGAAGCCGGCGTAAAGAGTGTTTTAGATCACCCCC TCCCCAATAAAGCTAAAACTCACCTGAGTTGTAAAAAACTCCAGTTGACACAAAATAGAC TACGAAAGTGGCTTTAACATATCTGAACACACAATAGCTAAGACCCAAACTGGGATTAGA TACCCCACTATGCTTAGCCCTAAACCTCAACAGTTAAATCAACAAAACTGCTCGCCAGAA CACTACGAGCCACAGCTTAAAACTCAAAGGACCTGGCGGTGCTTCATATCCCTCTAGAGG AGCCTGTTCTGTAATCGATAAACCCCGATCAACCTCACCACCTCTTGCTCAGCCTATATA CCGCCATCTTCAGCAAACCCTGATGAAGGCTACAAAGTAAGCGCAAGTACCCACGTAAAG ACGTTAGGTCAAGGTGTAGCCCATGAGGTGGCAAGAAATGGGCTACATTTTCTACCCCAG AAAACTACGATAGCCCTTATGAAACTTAAGGGTCGAAGGTGGATTTAGCAGTAAACTAAG AGTAGAGTGCTTAGTTGAACAGGGCCCTGAAGCGCGTACACACCGCCCGTCACCCTCCTC AAGTATACTTCAAAGGACATTTAACTAAAACCCCTACGCATTTATATAGAGGAGACAAGT CGTAACCTCAAACTCCTGCCTTTGGTGATCCACCCGCCTTGGCCTACCTGCATAATGAAG AAGCACCCAACTTACACTTAGGAGATTTCAACTTAACTTGACCGCTCTGAGCTAAACCTA GCCCCAAACCCACTCCACCTTACTACCAGACAACCTTAGCCAAACCATTTACCCAAATAA AGTATAGGCGATAGAAATTGAAACCTGGCGCAATAGATATAGTACCGCAAGGGAAAGATG AAAAATTATAACCAAGCATAATATAGCAAGGACTAACCCCTATACCTTCTGCATAATGAA TTAACTAGAAATAACTTTGCAAGGAGAGCCAAAGCTAAGACCCCCGAAACCAGACGAGCT Alignment x billions x million

CTCAAACTCCTGACCTTTGGTGATCCACCCGCCTNGGCCTTC Read: Reference genome: >MT dna:chromosome chromosome:GRCh37:MT:1:16569:1 GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTT CGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTC GCAGTATCTGTCTTTGATTCCTGCCTCATCCTATTATTTATCGCACCTACGTTCAATATT
ACAGGCGAACATACTTACTAAAGTGTGTTAATTAATTAATGCTTGTAGGACATAATAATA ACAATTGAATGTCTGCACAGCCACTTTCCACACAGACATCATAACAAAAAATTTCCACCA AACCCCCCCTCCCCCGCTTCTGGCCACAGCACTTAAACACATCTCTGCCAAACCCCAAAA ACAAAGAACCCTAACACCAGCCTAACCAGATTTCAAATTTTATCTTTTGGCGGTATGCAC TTTTAACAGTCACCCCCCAACTAACACATTATTTTCCCCTCCCACTCCCATACTACTAAT CTCATCAATACAACCCCCGCCCATCCTACCCAGCACACACACACCGCTGCTAACCCCATA CCCCGAACCAACCAAACCCCAAAGACACCCCCCACAGTTTATGTAGCTTACCTCCTCAAA GCAATACACTGACCCGCTCAAACTCCTGGATTTTGGATCCACCCAGCGCCTTGGCCTAAA CTAGCCTTTCTATTAGCTCTTAGTAAGATTACACATGCAAGCATCCCCGTTCCAGTGAGT TCACCCTCTAAATCACCACGATCAAAAGGAACAAGCATCAAGCACGCAGCAATGCAGCTC AAAACGCTTAGCCTAGCCACACCCCCACGGGAAACAGCAGTGATTAACCTTTAGCAATAA ACGAAAGTTTAACTAAGCTATACTAACCCCAGGGTTGGTCAATTTCGTGCCAGCCACCGC GGTCACACGATTAACCCAAGTCAATAGAAGCCGGCGTAAAGAGTGTTTTAGATCACCCCC TCCCCAATAAAGCTAAAACTCACCTGAGTTGTAAAAAACTCCAGTTGACACAAAATAGAC TACGAAAGTGGCTTTAACATATCTGAACACACAATAGCTAAGACCCAAACTGGGATTAGA TACCCCACTATGCTTAGCCCTAAACCTCAACAGTTAAATCAACAAAACTGCTCGCCAGAA CACTACGAGCCACAGCTTAAAACTCAAAGGACCTGGCGGTGCTTCATATCCCTCTAGAGG AGCCTGTTCTGTAATCGATAAACCCCGATCAACCTCACCACCTCTTGCTCAGCCTATATA CCGCCATCTTCAGCAAACCCTGATGAAGGCTACAAAGTAAGCGCAAGTACCCACGTAAAG ACGTTAGGTCAAGGTGTAGCCCATGAGGTGGCAAGAAATGGGCTACATTTTCTACCCCAG AAAACTACGATAGCCCTTATGAAACTTAAGGGTCGAAGGTGGATTTAGCAGTAAACTAAG AGTAGAGTGCTTAGTTGAACAGGGCCCTGAAGCGCGTACACACCGCCCGTCACCCTCCTC AAGTATACTTCAAAGGACATTTAACTAAAACCCCTACGCATTTATATAGAGGAGACAAGT CGTAACCTCAAACTCCTGCCTTTGGTGATCCACCCGCCTTGGCCTACCTGCATAATGAAG AAGCACCCAACTTACACTTAGGAGATTTCAACTTAACTTGACCGCTCTGAGCTAAACCTA GCCCCAAACCCACTCCACCTTACTACCAGACAACCTTAGCCAAACCATTTACCCAAATAA AGTATAGGCGATAGAAATTGAAACCTGGCGCAATAGATATAGTACCGCAAGGGAAAGATG AAAAATTATAACCAAGCATAATATAGCAAGGACTAACCCCTATACCTTCTGCATAATGAA TTAACTAGAAATAACTTTGCAAGGAGAGCCAAAGCTAAGACCCCCGAAACCAGACGAGCT Alignment CTCAAAGACCTGACCTTTGGTGATCCACCC-----GCCTNGGCCTTC |||||| |||| |||| ||||||||| |||| ||||| CTCAAACTCCTGGATTTTG--GATCCACCCAGCTGGCCTTGGCCTAA Candidate 1: Candidate 2: CTCAAACTCCTGACCTTTGGTGATCCACCCGCCTNGGCCTTC |||||||||||| ||||||||||||||||||||| ||||| | CTCAAACTCCTG-CCTTTGGTGATCCACCCGCCTTGGCCTAC Read Reference Read Reference

Reference bias Tendency to miss or misalign reads containing non-reference
alleles REF ALT No bias Biased against Ref:

Reference bias REF ALT Gene 1 (slight bias -> PAT)
Gene 2 (strong bias -> MAT) MAT PAT Confounder in allele-specific analyses

Reference bias Degner, J. F., Marioni, J. C., Pai, A.
A., Pickrell, J. K., Nkadori, E., Gilad, Y., & Pritchard, J. K. (2009). Eﬀect of read-mapping biases on detecting allele-specific expression from RNA-sequencing data. Bioinformatics, 25(24), 3207–3212 Confounder in allele-specific analyses

Reference bias Pritt, J., Chen, N. C., Langmead, B. (2018).
FORGe: prioritizing variants for graph genomes. Genome biology, 19(1), 220. (Poor coverage in MHC region) Confounder in hypervariable regions

Reference bias Wulfridge, P., Langmead, B., Feinberg, A. P., &
Hansen, K. D. (2019). Analyzing whole genome bisulfite sequencing data from highly divergent genotypes. Nucleic acids research, 47(19), e117. Confounder when comparing inbred strains

Why attack reference bias? https://www.pbs.org/newshour/science/genetic-research-has-a-white-bias-and-it-may-be-hurting-everyones-health “By not including diversity we
are missing out on great opportunities to make novel discoveries and to be more inclusive of world populations," [Esteban] Burchard said.

Why attack reference bias? 1000 Genomes Project Consortium, Auton, A.,
Brooks, L. D., Durbin, R. M., Garrison, E. P., Kang, H. M., … Abecasis, G. R. (2015). A global reference for human genetic variation. Nature, 526(7571), 68–74. To avoid a world where diagnostics & therapeutics are diﬀerentially eﬀective by population AFR EAS AMR EUR SAS

Pangenomics "Variation graphs... which encode the genetic variation within a
population as a graph, have been proposed as a solution to the reference bias [problem]." Quote: Sirén, Jouni. "Indexing variation graphs." In 2017 Proceedings of the nineteenth workshop on algorithm engineering and experiments (ALENEX), pp. 13-27. Society for Industrial and Applied Mathematics, 2017. Image: Garrison, E., Sirén, J., Novak, A. M., Hickey, G., Eizenga, J. M., Dawson, E.T., … Durbin, R. (2018). Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nature biotechnology, 36(9), 875–879.

Pangenomics Schneeberger, K., Hagmann, J., Ossowski, S., Warthmann, N., Gesing,
S., Kohlbacher, O., & Weigel, D. (2009). Simultaneous alignment of short reads against multiple genomes. Genome biology, 10(9), R98. GenomeMapper 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 ERG Satya RV, Zavaljevski N, Reifman J. A new strategy to reduce allelic bias in RNA-Seq readmapping. Nucleic Acids Res. 2012 Sep;40(16):e127. GCSA VG/GCSA2 HISAT2 BWBBLE Sirén, J, Välimäki, N, and Mäkinen, V. Indexing graphs for path queries with applications in genome research. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 11(2):375–388, 2014. Huang, L., Popic, V., & Batzoglou, S. (2013). Short read alignment with populations of genomes. Bioinformatics (Oxford, England), 29(13), i361–i370. Garrison, E., Sirén, J., Novak, A. M., Hickey, G., Eizenga, J. M., Dawson, E. T., … Durbin, R. (2018). Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nature biotechnology, 36(9), 875–879. Kim, D., Paggi, J. M., Park, C., Bennett, C., & Salzberg, S. L. (2019). Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nature biotechnology, 37(8), 907–915. + more!

Pangenomics Catalog Map Goal: An inclusive picture for answering relatedness
questions Goal: answer "where do I match?" questions Is more variation always better? vs

Is more better? Adding variants to the reference can remove
undesirable penalties in alignment score It also adds ambiguity to the reference, confusing the aligner

FORGe C(G) = ∑ ⟨s,j⟩∈G p(⟨s, l⟩) U(G) = ∑
⟨s,j⟩∈G 1 fG (s) Population coverage: High allele frequency gets high priority Uniqueness: Variants adding more copies of existing k-mers get low priority Find the Optimal Reference Genome Pritt, J., Chen, N. C., Langmead, B. (2018). FORGe: prioritizing variants for graph genomes. Genome biology, 19(1), 220.

FORGe H(G) = ∑ ⟨s,j⟩∈G p(⟨s, j⟩) fG (s) Hybrid:
Product of previous measures Pritt, J., Chen, N. C., Langmead, B. (2018). FORGe: prioritizing variants for graph genomes. Genome biology, 19(1), 220.

VCF FORGe FASTA VCF FASTA VCF FASTA 0% 2% 4%
6% 8% 10% 15% 20% 100% 30% ... + m ore variants + m ore variants Kim, D., Paggi, J. M., Park, C., Bennett, C., & Salzberg, S. L. (2019). Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nature biotechnology, 37(8), 907–915. HISAT2 indexes:

FORGe Pritt, J., Chen, N. C., Langmead, B. (2018). FORGe:
prioritizing variants for graph genomes. Genome biology, 19(1), 220.

FORGe Pritt, J., Chen, N. C., Langmead, B. (2018). FORGe:
prioritizing variants for graph genomes. Genome biology, 19(1), 220. 30% 30% 0% 70% 1.431 1.432 1.433 Mappings (Billions) 0.3 0.4 0.5 0.6 0.7 0 2 4 6 8 10 20 30 40 50 60 70 % Variants Reference Bias HISAT2 Auto SNVs + Indels SNVs Only (b) Bias avoidance saturates at ~10% of variants

FORGe • By modeling variants, we can balance pros and
cons • Even a modest # of variants can alleviate bias, approaching accuracy of ideal, personalized genome • Peak accuracy is at ~10% of variants, about a ≥5% allele frequency cutoﬀ Pritt, J., Chen, N. C., Langmead, B. (2018). FORGe: prioritizing variants for graph genomes. Genome biology, 19(1), 220. • Similar result (in cow) from another group:

cons • Even a modest # of variants can alleviate bias, approaching accuracy of ideal, personalized genome • Peak accuracy is at ~10% of variants, about a ≥5% allele frequency cutoﬀ Pritt, J., Chen, N. C., Langmead, B. (2018). FORGe: prioritizing variants for graph genomes. Genome biology, 19(1), 220. • Similar result (in cow) from another group: Is more variation always better?

cons • Even a modest # of variants can alleviate bias, approaching accuracy of ideal, personalized genome • Peak accuracy is at ~10% of variants, about a ≥5% allele frequency cutoﬀ Pritt, J., Chen, N. C., Langmead, B. (2018). FORGe: prioritizing variants for graph genomes. Genome biology, 19(1), 220. • Similar result (in cow) from another group: Is more variation always better? not necessarily (when using a graph)

reference bias 2a. Graphs can include too much 3. Many linear references for fighting bias 4. Indexing reference panels Outline Our work FORGe Reference flow FM index & r-index

Reference flow GRCh38 EUR AFR EAS SAS AMR Read Aligned
uniquely? No Chen NC, Solomon B, Mun T, Iyer S, Langmead B. Reference flow: reducing reference bias using multiple population genomes. Genome Biol. 2021 Jan 4;22(1):8. Final Alignments Take best; lift back to GRC Yes

Reference flow GRCh38 EUR AFR EAS SAS AMR Read Aligned
uniquely? No Yes Final Alignments Take best; lift back to GRC bt2 bt2 bt2 bt2 bt2 bt2 Chen NC, Solomon B, Mun T, Iyer S, Langmead B. Reference flow: reducing reference bias using multiple population genomes. Genome Biol. 2021 Jan 4;22(1):8.

Reference flow Super population of simulated individual Simulation using human
chromosome 21; 100 individuals, 2 million reads per individual Reference flow achieves nearly the same % correct alignments as personalized reference; improvement over linear, major-allele & vg

Reference flow Super population of simulated individual Reference flow avoids
nearly as much bias as vg no bias More reference bias

Reference flow • Align to multiple linear reference genomes, selected
to cover the genotype space • Similar accuracy/bias as vg, at fraction of time (18%) and memory footprint (14%) • Simple wrapper around existing aligner • But misses many rare alleles when used with a small number of references Chen NC, Solomon B, Mun T, Iyer S, Langmead B. Reference flow: reducing reference bias using multiple population genomes. Genome Biol. 2021 Jan 4;22(1):8.

reference bias 2a. Graphs can include too much 3. Many linear references for fighting bias 4. Indexing reference panels Outline Our work FORGe Reference flow FM index & r-index

FM Index $ a b a a b a a
$ a b a a b a a b a $ a b a b a $ a b a a b a a b a $ b a $ a b a a b a a b a $ a T All rotations Sort BWT(T) Last column Burrows-Wheeler Matrix a b a a b a $ a b b a $ a a FM index behind Bowtie & BWA consists of Burrows- Wheeler Transform (BWT), plus auxiliary structures BWT reorders the letters according to alphabetical order of their right contexts in T (e.g. genome)

FM Index BWT gathers “like” characters (sharing right context) into
runs E.g. for a text where rectangle appears many times, the ectangle tends to be preceded by r T rectangular_rectangle_divided_into_rectangles$ BWT(T) sedrotttleeeei_lrrrdlnnnv_duggaaaita__$ecccngi

FM Index BWT gathers “like” characters (sharing right context) into
runs E.g. for a text where rectangle appears many times, the ectangle tends to be preceded by r These rs come together in a BWT run T rectangular_rectangle_divided_into_rectangles$ BWT(T) sedrotttleeeei_lrrrdlnnnv_duggaaaita__$ecccngi

FM Index T Tomorrow_and_tomorrow_and_tomorrow$ 1.09 BWT(T) w$wwdd__nnoooaattTmmmrrrrrrooo__ooo 2.33 T It_was_the_best_of_times_it_was_the_worst_of_times$
1.00 BWT(T) s$esttssfftteww_hhmmbootttt_ii__woeeaaressIi_______ 1.76 T in_the_jingle_jangle_morning_Ill_come_following_you$ 1.04 BWT(T) u_gleeeengj_mlhl_nnnnt$nwj__lggIolo_iiiiarfcmylo_oo_ 1.30 BWT gathers “like” characters (sharing right context) into runs: rrrrrr When T is more repetitive, BWT runs are longer & fewer Avg. run length

FM Index # genomes 1 6,072 M 3,264 M 2
12,144 M 3,282 M 3 18,217 M 3,386 M 4 24,408 M 3,423 M 5 30,480 M 3,436 M 6 36,671 M 3,449 M n r As we index more diploid genomes, (total length) grows linearly while (total # BWT runs) grows sublinearly n r From 1000 Genomes project phase-3 callset Kuhnle A, Mun T, Boucher C, Gagie T, Langmead B, Manzini G. Eﬃcient Construction of a Complete Index for Pan-Genomics Read Alignment. J Comput Biol. 2020 Apr;27(4):500-513.

From FM Index to r-index Count Locate Space Time Space
Time FM Index (2000) RLFM Index (2005) r-index (2018) Where is total reference length, is query-string length, is total # BWT runs n m r O(n) O(r) O(r) O(m) O(m) O(m) (log factors omitted) O(n) O(n) O(r) O(m + occ) O(m + occ) O(m + occ) FM: Ferragina P, and Manzini M. Opportunistic data structures with applications. Proceedings of 41st FOCS. IEEE, 2000.

Time FM Index (2000) RLFM Index (2005) r-index (2018) Where is total reference length, is query-string length, is total # BWT runs n m r O(n) O(r) O(r) O(m) O(m) O(m) (log factors omitted) O(n) O(n) O(r) O(m + occ) O(m + occ) O(m + occ) RLFM: Mäkinen V, and Navarro G. Succinct suﬃx arrays based on run-length encoding. Annual Symposium on CPM. Springer, Berlin, Heidelberg. 2005. pp45–56. FM: Ferragina P, and Manzini M. Opportunistic data structures with applications. Proceedings of 41st FOCS. IEEE, 2000.

Time FM Index (2000) RLFM Index (2005) r-index (2018) Where is total reference length, is query-string length, is total # BWT runs n m r O(n) O(r) O(r) O(m) O(m) O(m) (log factors omitted) O(n) O(n) O(r) O(m + occ) O(m + occ) O(m + occ) r-index: Gagie T, Navarro G, and Prezza P. Optimal-time text indexing in BWT-runs bounded space. Proceedings of 29th SODA, ACM-SIAM. 2018. pp1459—1477. RLFM: Mäkinen V, and Navarro G. Succinct suﬃx arrays based on run-length encoding. Annual Symposium on CPM. Springer, Berlin, Heidelberg. 2005. pp45–56. FM: Ferragina P, and Manzini M. Opportunistic data structures with applications. Proceedings of 41st FOCS. IEEE, 2000.

r-index Index many human genomes with similar queries & speed
as FM Index, in space; sublinear in # & length of genomes O(r) Gonzalo Navarro Nicola Prezza Gagie T, Navarro G, and Prezza P. Optimal- time text indexing in BWT-runs bounded space. Proceedings of 29th SODA, ACM- SIAM. 2018. pp1459—1477. Christina Boucher Travis Gagie Alan Kuhnle Giovanni Manzini Kuhnle A, Mun T, Boucher C, Gagie T, Langmead B, Manzini G. Eﬃcient Construction of a Complete Index for Pan-Genomics Read Alignment. J Comput Biol. 2020 Apr;27(4):500-513.

Panel alignment with r-index For larger collections, index is smaller
than that of Bowtie and compressed competitors Kuhnle A, Mun T, Boucher C, Gagie T, Langmead B, Manzini G. Eﬃcient Construction of a Complete Index for Pan-Genomics Read Alignment. J Comput Biol. 2020 Apr;27(4):500-513. (chr19s from 1000 Genomes Project)

Panel alignment with r-index For larger collections, query time is
faster than Bowtie Kuhnle A, Mun T, Boucher C, Gagie T, Langmead B, Manzini G. Eﬃcient Construction of a Complete Index for Pan-Genomics Read Alignment. J Comput Biol. 2020 Apr;27(4):500-513. (chr19s from 1000 Genomes Project)

0 50000 100000 150000 200000 0 20000 40000 60000 Total Length of Collection (MB) Indexing Peak Mem. (MB) 1KG LRA forward + reverse complement Handles many human genome assemblies! Panel alignment with r-index Kuhnle A, Mun T, Boucher C, Gagie T, Langmead B, Manzini G. Eﬃcient Construction of a Complete Index for Pan-Genomics Read Alignment. J Comput Biol. 2020 Apr;27(4):500-513. (whole human genomes)

Future of r-index • Alignment to collection (panel) of linear
references • Fast genotyping with respect to panel • Fast online matching statistics Rossi M, Oliva M, Langmead B, Gagie T, Boucher C. MONI: A Pangenomics Index for Finding MEMs. Accepted, RECOMB 2021. Boucher C, Gagie T, I T, Köppl D, Langmead B, Manzini G, Navarro G, Pacheco A, Rossi M. PHONI: Streamed Matching Statistics with Multi-Genome References. Accepted, DCC 2021.

Conclusions for Practitioners

Conclusions for Practitioners • If avoiding reference bias is the
goal, that’s how we should evaluate • A , not a • With multiple references, fast & familiar linear aligners have comparable benefits to (current) graph aligners • New assemblies are coming, but we lack good ways to put them in common coordinates. Need methods that let genomes be linear

• Pangenome graphs suﬀer from ambiguity that comes with adding
many rare variants • Is this a failing of the method? Conclusions for Methods • Can index & queries be made frequency-aware, representing rare variation while understanding it is rare? 11% 89% 1% 1% 4% 94% 🤔

Thank you! And thanks to the team: Jacob Pritt Nae-Chyun
Chen Taher Mun Brad Solomon NSF: IIS-1349906 DBI-2029552 NIH: R01GM118568 R01HG011392 Christina Boucher Travis Gagie Alan Kuhnle Sheila Iyer Giovanni Manzini + MONI & PHONI teams

Photo: Elizabeth Colantuoni Thank you

Advances in pan-genomics for addressing referen...

Advances in pan-genomics for addressing reference bias

More Decks by Ben Langmead

Other Decks in Research

Featured

Transcript