$30 off During Our Annual Pro Sale. View Details »

Advances in pan-genomics for addressing reference bias

Ben Langmead
February 11, 2021

Advances in pan-genomics for addressing reference bias

Ben Langmead

February 11, 2021
Tweet

More Decks by Ben Langmead

Other Decks in Research

Transcript

  1. Ben Langmead
    Associate Professor, JHU Computer Science
    [email protected], langmead-lab.org, @BenLangmead
    Stanford Biostatistics Seminar
    February 11, 2021
    Advances in pan-genomics for
    addressing reference bias

    View Slide

  2. Today
    1. References & reference bias
    2. Graphs for fighting reference bias
    2a. Graphs can include too much
    3. Many linear references for fighting bias
    4. Indexing reference panels
    Outline Our work
    FORGe
    Reference flow
    FM index & r-
    index

    View Slide

  3. Human genome
    Image: Russ London, https://commons.wikimedia.org/wiki/File:Wellcome_genome_bookcase.png

    View Slide

  4. Human genome
    Image: Abizar Lakdawalla
    Human Genome Project yielded a
    single reference genome (haplotype)
    https://en.wikipedia.org/wiki/Ploidy#Diploid

    View Slide

  5. More variants
    1000 Genomes Project Consortium, Auton, A., Brooks, L. D., Durbin, R. M.,
    Garrison, E. P., Kang, H. M., … Abecasis, G. R. (2015). A global reference for
    human genetic variation. Nature, 526(7571), 68–74.
    AFR
    EAS
    AMR
    EUR
    SAS

    View Slide

  6. More genomes

    View Slide

  7. More genomes
    @khmiga @aphillippy
    Karen Miga Adam Phillippy
    Let’s finish the human genome
    The Telomere-to-Telomere (T2T) consortium is an
    open, community-based effort to generate the
    first complete assembly of a human genome.
    https://www.slideshare.net/GenomeInABottle/how-giab-fits-in-the-rest-of-the-world-telomere-to-telomere-consortium
    https://github.com/nanopore-wgs-consortium/chm13

    View Slide

  8. CTCAAACTCCTGACCTTTGGTGATCCACCCGCCTNGGCCTTC
    Read:
    Reference genome:
    >MT dna:chromosome chromosome:GRCh37:MT:1:16569:1
    GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTT
    CGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTC
    GCAGTATCTGTCTTTGATTCCTGCCTCATCCTATTATTTATCGCACCTACGTTCAATATT
    ACAGGCGAACATACTTACTAAAGTGTGTTAATTAATTAATGCTTGTAGGACATAATAATA
    ACAATTGAATGTCTGCACAGCCACTTTCCACACAGACATCATAACAAAAAATTTCCACCA
    AACCCCCCCTCCCCCGCTTCTGGCCACAGCACTTAAACACATCTCTGCCAAACCCCAAAA
    ACAAAGAACCCTAACACCAGCCTAACCAGATTTCAAATTTTATCTTTTGGCGGTATGCAC
    TTTTAACAGTCACCCCCCAACTAACACATTATTTTCCCCTCCCACTCCCATACTACTAAT
    CTCATCAATACAACCCCCGCCCATCCTACCCAGCACACACACACCGCTGCTAACCCCATA
    CCCCGAACCAACCAAACCCCAAAGACACCCCCCACAGTTTATGTAGCTTACCTCCTCAAA
    GCAATACACTGACCCGCTCAAACTCCTGGATTTTGGATCCACCCAGCGCCTTGGCCTAAA
    CTAGCCTTTCTATTAGCTCTTAGTAAGATTACACATGCAAGCATCCCCGTTCCAGTGAGT
    TCACCCTCTAAATCACCACGATCAAAAGGAACAAGCATCAAGCACGCAGCAATGCAGCTC
    AAAACGCTTAGCCTAGCCACACCCCCACGGGAAACAGCAGTGATTAACCTTTAGCAATAA
    ACGAAAGTTTAACTAAGCTATACTAACCCCAGGGTTGGTCAATTTCGTGCCAGCCACCGC
    GGTCACACGATTAACCCAAGTCAATAGAAGCCGGCGTAAAGAGTGTTTTAGATCACCCCC
    TCCCCAATAAAGCTAAAACTCACCTGAGTTGTAAAAAACTCCAGTTGACACAAAATAGAC
    TACGAAAGTGGCTTTAACATATCTGAACACACAATAGCTAAGACCCAAACTGGGATTAGA
    TACCCCACTATGCTTAGCCCTAAACCTCAACAGTTAAATCAACAAAACTGCTCGCCAGAA
    CACTACGAGCCACAGCTTAAAACTCAAAGGACCTGGCGGTGCTTCATATCCCTCTAGAGG
    AGCCTGTTCTGTAATCGATAAACCCCGATCAACCTCACCACCTCTTGCTCAGCCTATATA
    CCGCCATCTTCAGCAAACCCTGATGAAGGCTACAAAGTAAGCGCAAGTACCCACGTAAAG
    ACGTTAGGTCAAGGTGTAGCCCATGAGGTGGCAAGAAATGGGCTACATTTTCTACCCCAG
    AAAACTACGATAGCCCTTATGAAACTTAAGGGTCGAAGGTGGATTTAGCAGTAAACTAAG
    AGTAGAGTGCTTAGTTGAACAGGGCCCTGAAGCGCGTACACACCGCCCGTCACCCTCCTC
    AAGTATACTTCAAAGGACATTTAACTAAAACCCCTACGCATTTATATAGAGGAGACAAGT
    CGTAACCTCAAACTCCTGCCTTTGGTGATCCACCCGCCTTGGCCTACCTGCATAATGAAG
    AAGCACCCAACTTACACTTAGGAGATTTCAACTTAACTTGACCGCTCTGAGCTAAACCTA
    GCCCCAAACCCACTCCACCTTACTACCAGACAACCTTAGCCAAACCATTTACCCAAATAA
    AGTATAGGCGATAGAAATTGAAACCTGGCGCAATAGATATAGTACCGCAAGGGAAAGATG
    AAAAATTATAACCAAGCATAATATAGCAAGGACTAACCCCTATACCTTCTGCATAATGAA
    TTAACTAGAAATAACTTTGCAAGGAGAGCCAAAGCTAAGACCCCCGAAACCAGACGAGCT
    Alignment
    x billions
    x million

    View Slide

  9. CTCAAACTCCTGACCTTTGGTGATCCACCCGCCTNGGCCTTC
    Read:
    Reference genome:
    >MT dna:chromosome chromosome:GRCh37:MT:1:16569:1
    GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTT
    CGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTC
    GCAGTATCTGTCTTTGATTCCTGCCTCATCCTATTATTTATCGCACCTACGTTCAATATT
    ACAGGCGAACATACTTACTAAAGTGTGTTAATTAATTAATGCTTGTAGGACATAATAATA
    ACAATTGAATGTCTGCACAGCCACTTTCCACACAGACATCATAACAAAAAATTTCCACCA
    AACCCCCCCTCCCCCGCTTCTGGCCACAGCACTTAAACACATCTCTGCCAAACCCCAAAA
    ACAAAGAACCCTAACACCAGCCTAACCAGATTTCAAATTTTATCTTTTGGCGGTATGCAC
    TTTTAACAGTCACCCCCCAACTAACACATTATTTTCCCCTCCCACTCCCATACTACTAAT
    CTCATCAATACAACCCCCGCCCATCCTACCCAGCACACACACACCGCTGCTAACCCCATA
    CCCCGAACCAACCAAACCCCAAAGACACCCCCCACAGTTTATGTAGCTTACCTCCTCAAA
    GCAATACACTGACCCGCTCAAACTCCTGGATTTTGGATCCACCCAGCGCCTTGGCCTAAA
    CTAGCCTTTCTATTAGCTCTTAGTAAGATTACACATGCAAGCATCCCCGTTCCAGTGAGT
    TCACCCTCTAAATCACCACGATCAAAAGGAACAAGCATCAAGCACGCAGCAATGCAGCTC
    AAAACGCTTAGCCTAGCCACACCCCCACGGGAAACAGCAGTGATTAACCTTTAGCAATAA
    ACGAAAGTTTAACTAAGCTATACTAACCCCAGGGTTGGTCAATTTCGTGCCAGCCACCGC
    GGTCACACGATTAACCCAAGTCAATAGAAGCCGGCGTAAAGAGTGTTTTAGATCACCCCC
    TCCCCAATAAAGCTAAAACTCACCTGAGTTGTAAAAAACTCCAGTTGACACAAAATAGAC
    TACGAAAGTGGCTTTAACATATCTGAACACACAATAGCTAAGACCCAAACTGGGATTAGA
    TACCCCACTATGCTTAGCCCTAAACCTCAACAGTTAAATCAACAAAACTGCTCGCCAGAA
    CACTACGAGCCACAGCTTAAAACTCAAAGGACCTGGCGGTGCTTCATATCCCTCTAGAGG
    AGCCTGTTCTGTAATCGATAAACCCCGATCAACCTCACCACCTCTTGCTCAGCCTATATA
    CCGCCATCTTCAGCAAACCCTGATGAAGGCTACAAAGTAAGCGCAAGTACCCACGTAAAG
    ACGTTAGGTCAAGGTGTAGCCCATGAGGTGGCAAGAAATGGGCTACATTTTCTACCCCAG
    AAAACTACGATAGCCCTTATGAAACTTAAGGGTCGAAGGTGGATTTAGCAGTAAACTAAG
    AGTAGAGTGCTTAGTTGAACAGGGCCCTGAAGCGCGTACACACCGCCCGTCACCCTCCTC
    AAGTATACTTCAAAGGACATTTAACTAAAACCCCTACGCATTTATATAGAGGAGACAAGT
    CGTAACCTCAAACTCCTGCCTTTGGTGATCCACCCGCCTTGGCCTACCTGCATAATGAAG
    AAGCACCCAACTTACACTTAGGAGATTTCAACTTAACTTGACCGCTCTGAGCTAAACCTA
    GCCCCAAACCCACTCCACCTTACTACCAGACAACCTTAGCCAAACCATTTACCCAAATAA
    AGTATAGGCGATAGAAATTGAAACCTGGCGCAATAGATATAGTACCGCAAGGGAAAGATG
    AAAAATTATAACCAAGCATAATATAGCAAGGACTAACCCCTATACCTTCTGCATAATGAA
    TTAACTAGAAATAACTTTGCAAGGAGAGCCAAAGCTAAGACCCCCGAAACCAGACGAGCT
    Alignment
    CTCAAAGACCTGACCTTTGGTGATCCACCC-----GCCTNGGCCTTC
    |||||| |||| |||| ||||||||| |||| |||||
    CTCAAACTCCTGGATTTTG--GATCCACCCAGCTGGCCTTGGCCTAA
    Candidate 1:
    Candidate 2:
    CTCAAACTCCTGACCTTTGGTGATCCACCCGCCTNGGCCTTC
    |||||||||||| ||||||||||||||||||||| ||||| |
    CTCAAACTCCTG-CCTTTGGTGATCCACCCGCCTTGGCCTAC
    Read
    Reference
    Read
    Reference

    View Slide

  10. Reference bias
    Tendency to miss or misalign reads containing
    non-reference alleles
    REF ALT
    No bias
    Biased against
    Ref:

    View Slide

  11. Reference bias
    REF ALT
    Gene 1
    (slight bias -> PAT)
    Gene 2
    (strong bias -> MAT)
    MAT
    PAT
    Confounder in allele-specific analyses

    View Slide

  12. Reference bias
    Degner, J. F., Marioni, J. C., Pai, A. A., Pickrell, J. K., Nkadori, E., Gilad, Y., &
    Pritchard, J. K. (2009). Effect of read-mapping biases on detecting allele-specific
    expression from RNA-sequencing data. Bioinformatics, 25(24), 3207–3212
    Confounder in allele-specific analyses

    View Slide

  13. Reference bias
    Pritt, J., Chen, N. C., Langmead, B. (2018). FORGe: prioritizing variants
    for graph genomes. Genome biology, 19(1), 220.
    (Poor coverage in MHC region)
    Confounder in hypervariable regions

    View Slide

  14. Reference bias
    Wulfridge, P., Langmead, B., Feinberg, A. P., & Hansen, K. D. (2019).
    Analyzing whole genome bisulfite sequencing data from highly divergent
    genotypes. Nucleic acids research, 47(19), e117.
    Confounder when comparing inbred strains

    View Slide

  15. Why attack reference bias?
    https://www.pbs.org/newshour/science/genetic-research-has-a-white-bias-and-it-may-be-hurting-everyones-health
    “By not including diversity
    we are missing out on great
    opportunities to make
    novel discoveries and to be
    more inclusive of world
    populations," [Esteban]
    Burchard said.

    View Slide

  16. Why attack reference bias?
    1000 Genomes Project Consortium, Auton, A., Brooks, L. D., Durbin, R. M., Garrison, E. P., Kang, H. M.,
    … Abecasis, G. R. (2015). A global reference for human genetic variation. Nature, 526(7571), 68–74.
    To avoid a world where diagnostics & therapeutics
    are differentially effective by population
    AFR
    EAS
    AMR
    EUR
    SAS

    View Slide

  17. Pangenomics
    "Variation graphs... which encode the genetic variation within a
    population as a graph, have been proposed as a solution to the
    reference bias [problem]."
    Quote: Sirén, Jouni. "Indexing variation graphs."
    In 2017 Proceedings of the nineteenth workshop
    on algorithm engineering and experiments
    (ALENEX), pp. 13-27. Society for Industrial and
    Applied Mathematics, 2017.
    Image: Garrison, E., Sirén, J., Novak, A. M., Hickey,
    G., Eizenga, J. M., Dawson, E.T., … Durbin, R.
    (2018). Variation graph toolkit improves read
    mapping by representing genetic variation in the
    reference. Nature biotechnology, 36(9), 875–879.

    View Slide

  18. Pangenomics
    Schneeberger, K., Hagmann, J., Ossowski, S., Warthmann, N., Gesing,
    S., Kohlbacher, O., & Weigel, D. (2009). Simultaneous alignment of
    short reads against multiple genomes. Genome biology, 10(9), R98.
    GenomeMapper
    2009
    2010
    2011
    2012
    2013
    2014
    2015
    2016
    2017
    2018
    2019
    ERG
    Satya RV, Zavaljevski N, Reifman J. A new strategy to reduce allelic
    bias in RNA-Seq readmapping. Nucleic Acids Res. 2012
    Sep;40(16):e127.
    GCSA
    VG/GCSA2
    HISAT2
    BWBBLE
    Sirén, J, Välimäki, N, and Mäkinen, V. Indexing graphs for path queries
    with applications in genome research. IEEE/ACM Transactions on
    Computational Biology and Bioinformatics, 11(2):375–388, 2014.
    Huang, L., Popic, V., & Batzoglou, S. (2013). Short read alignment with
    populations of genomes. Bioinformatics (Oxford, England), 29(13),
    i361–i370.
    Garrison, E., Sirén, J., Novak, A. M., Hickey, G., Eizenga, J. M., Dawson, E.
    T., … Durbin, R. (2018). Variation graph toolkit improves read
    mapping by representing genetic variation in the reference. Nature
    biotechnology, 36(9), 875–879.
    Kim, D., Paggi, J. M., Park, C., Bennett, C., & Salzberg, S. L. (2019).
    Graph-based genome alignment and genotyping with HISAT2 and
    HISAT-genotype. Nature biotechnology, 37(8), 907–915.
    + more!

    View Slide

  19. Pangenomics
    Catalog Map
    Goal: An inclusive
    picture for answering
    relatedness questions
    Goal: answer "where do I
    match?" questions
    Is more variation
    always better?
    vs

    View Slide

  20. Is more better?
    Adding variants to the reference can remove
    undesirable penalties in alignment score
    It also adds ambiguity to the reference,
    confusing the aligner

    View Slide

  21. FORGe
    C(G) = ∑
    ⟨s,j⟩∈G
    p(⟨s, l⟩)
    U(G) = ∑
    ⟨s,j⟩∈G
    1
    fG
    (s)
    Population coverage:
    High allele frequency
    gets high priority
    Uniqueness: Variants
    adding more copies of
    existing k-mers get low
    priority
    Find the Optimal Reference Genome
    Pritt, J., Chen, N. C., Langmead, B. (2018). FORGe: prioritizing variants
    for graph genomes. Genome biology, 19(1), 220.

    View Slide

  22. FORGe
    H(G) = ∑
    ⟨s,j⟩∈G
    p(⟨s, j⟩)
    fG
    (s)
    Hybrid: Product of
    previous measures
    Pritt, J., Chen, N. C., Langmead, B. (2018). FORGe: prioritizing variants
    for graph genomes. Genome biology, 19(1), 220.

    View Slide

  23. VCF
    FORGe
    FASTA VCF
    FASTA VCF
    FASTA
    0% 2% 4% 6% 8%
    10%
    15%
    20%
    100%
    30% ...
    +
    m
    ore
    variants
    +
    m
    ore
    variants
    Kim, D., Paggi, J. M., Park, C., Bennett, C., & Salzberg, S. L. (2019). Graph-based genome alignment
    and genotyping with HISAT2 and HISAT-genotype. Nature biotechnology, 37(8), 907–915.
    HISAT2
    indexes:

    View Slide

  24. FORGe
    Pritt, J., Chen, N. C., Langmead, B. (2018). FORGe: prioritizing variants
    for graph genomes. Genome biology, 19(1), 220.

    View Slide

  25. FORGe
    Pritt, J., Chen, N. C., Langmead, B. (2018). FORGe: prioritizing variants
    for graph genomes. Genome biology, 19(1), 220.

    View Slide

  26. FORGe
    Pritt, J., Chen, N. C., Langmead, B. (2018). FORGe: prioritizing variants
    for graph genomes. Genome biology, 19(1), 220.

    View Slide

  27. FORGe
    Pritt, J., Chen, N. C., Langmead, B. (2018). FORGe: prioritizing variants
    for graph genomes. Genome biology, 19(1), 220.

    View Slide

  28. FORGe
    Pritt, J., Chen, N. C., Langmead, B. (2018). FORGe: prioritizing variants
    for graph genomes. Genome biology, 19(1), 220.

    View Slide

  29. FORGe
    Pritt, J., Chen, N. C., Langmead, B. (2018). FORGe: prioritizing variants
    for graph genomes. Genome biology, 19(1), 220.

    View Slide

  30. FORGe
    Pritt, J., Chen, N. C., Langmead, B. (2018). FORGe: prioritizing variants
    for graph genomes. Genome biology, 19(1), 220.

    View Slide

  31. FORGe
    Pritt, J., Chen, N. C., Langmead, B. (2018). FORGe: prioritizing variants
    for graph genomes. Genome biology, 19(1), 220.
    30%
    30%
    0%
    70%
    1.431 1.432 1.433
    Mappings (Billions)
    0.3
    0.4
    0.5
    0.6
    0.7
    0 2 4 6 8 10 20 30 40 50 60 70
    % Variants
    Reference Bias
    HISAT2 Auto SNVs + Indels SNVs Only
    (b)
    Bias avoidance saturates at ~10% of variants

    View Slide

  32. FORGe
    • By modeling variants, we can balance pros and cons
    • Even a modest # of variants can alleviate bias,
    approaching accuracy of ideal, personalized genome
    • Peak accuracy is at ~10% of variants, about a ≥5%
    allele frequency cutoff
    Pritt, J., Chen, N. C., Langmead, B. (2018). FORGe: prioritizing variants
    for graph genomes. Genome biology, 19(1), 220.
    • Similar result (in cow)
    from another group:

    View Slide

  33. FORGe
    • By modeling variants, we can balance pros and cons
    • Even a modest # of variants can alleviate bias,
    approaching accuracy of ideal, personalized genome
    • Peak accuracy is at ~10% of variants, about a ≥5%
    allele frequency cutoff
    Pritt, J., Chen, N. C., Langmead, B. (2018). FORGe: prioritizing variants
    for graph genomes. Genome biology, 19(1), 220.
    • Similar result (in cow)
    from another group:
    Is more variation
    always better?

    View Slide

  34. FORGe
    • By modeling variants, we can balance pros and cons
    • Even a modest # of variants can alleviate bias,
    approaching accuracy of ideal, personalized genome
    • Peak accuracy is at ~10% of variants, about a ≥5%
    allele frequency cutoff
    Pritt, J., Chen, N. C., Langmead, B. (2018). FORGe: prioritizing variants
    for graph genomes. Genome biology, 19(1), 220.
    • Similar result (in cow)
    from another group:
    Is more variation
    always better?
    not necessarily
    (when using a graph)

    View Slide

  35. Today
    1. References & reference bias
    2. Graphs for fighting reference bias
    2a. Graphs can include too much
    3. Many linear references for fighting bias
    4. Indexing reference panels
    Outline Our work
    FORGe
    Reference flow
    FM index &
    r-index

    View Slide

  36. Reference flow
    GRCh38
    EUR
    AFR
    EAS
    SAS
    AMR
    Read
    Aligned
    uniquely?
    No
    Chen NC, Solomon B, Mun T, Iyer S, Langmead B. Reference flow: reducing reference
    bias using multiple population genomes. Genome Biol. 2021 Jan 4;22(1):8.
    Final
    Alignments
    Take best; lift
    back to GRC
    Yes

    View Slide

  37. Reference flow
    GRCh38
    EUR
    AFR
    EAS
    SAS
    AMR
    Read
    Aligned
    uniquely?
    No
    Yes
    Final
    Alignments
    Take best; lift
    back to GRC
    bt2 bt2
    bt2
    bt2
    bt2
    bt2
    Chen NC, Solomon B, Mun T, Iyer S, Langmead B. Reference flow: reducing reference
    bias using multiple population genomes. Genome Biol. 2021 Jan 4;22(1):8.

    View Slide

  38. Reference flow
    Super population of
    simulated individual
    Simulation using human chromosome 21; 100 individuals, 2 million reads per individual
    Reference flow achieves nearly the
    same % correct alignments as
    personalized reference; improvement
    over linear, major-allele & vg

    View Slide

  39. Reference flow
    Super population of
    simulated individual
    Reference flow avoids
    nearly as much bias as vg
    no bias
    More
    reference
    bias

    View Slide

  40. Reference flow
    • Align to multiple linear reference genomes,
    selected to cover the genotype space
    • Similar accuracy/bias as vg, at fraction of time
    (18%) and memory footprint (14%)
    • Simple wrapper around existing aligner
    • But misses many rare alleles when used with a
    small number of references
    Chen NC, Solomon B, Mun T, Iyer S, Langmead B. Reference flow: reducing reference
    bias using multiple population genomes. Genome Biol. 2021 Jan 4;22(1):8.

    View Slide

  41. Today
    1. References & reference bias
    2. Graphs for fighting reference bias
    2a. Graphs can include too much
    3. Many linear references for fighting bias
    4. Indexing reference panels
    Outline Our work
    FORGe
    Reference flow
    FM index &
    r-index

    View Slide

  42. FM Index
    $ a b a a b a
    a $ a b a a b
    a a b a $ a b
    a b a $ a b a
    a b a a b a $
    b a $ a b a a
    b a a b a $ a
    T All rotations
    Sort
    BWT(T)
    Last column
    Burrows-Wheeler
    Matrix
    a b a a b a $ a b b a $ a a
    FM index behind Bowtie & BWA consists of Burrows-
    Wheeler Transform (BWT), plus auxiliary structures
    BWT reorders the letters according to alphabetical
    order of their right contexts in T
    (e.g. genome)

    View Slide

  43. FM Index
    BWT gathers “like” characters (sharing right context)
    into runs
    E.g. for a text where rectangle appears many times,
    the ectangle tends to be preceded by r
    T rectangular_rectangle_divided_into_rectangles$
    BWT(T) sedrotttleeeei_lrrrdlnnnv_duggaaaita__$ecccngi

    View Slide

  44. FM Index
    BWT gathers “like” characters (sharing right context)
    into runs
    E.g. for a text where rectangle appears many times,
    the ectangle tends to be preceded by r
    These rs come together in a BWT run
    T rectangular_rectangle_divided_into_rectangles$
    BWT(T) sedrotttleeeei_lrrrdlnnnv_duggaaaita__$ecccngi

    View Slide

  45. FM Index
    T Tomorrow_and_tomorrow_and_tomorrow$ 1.09
    BWT(T) w$wwdd__nnoooaattTmmmrrrrrrooo__ooo 2.33
    T It_was_the_best_of_times_it_was_the_worst_of_times$ 1.00
    BWT(T) s$esttssfftteww_hhmmbootttt_ii__woeeaaressIi_______ 1.76
    T in_the_jingle_jangle_morning_Ill_come_following_you$ 1.04
    BWT(T) u_gleeeengj_mlhl_nnnnt$nwj__lggIolo_iiiiarfcmylo_oo_ 1.30
    BWT gathers “like” characters (sharing right context)
    into runs: rrrrrr
    When T is more repetitive, BWT runs are longer & fewer
    Avg. run
    length

    View Slide

  46. FM Index
    T Tomorrow_and_tomorrow_and_tomorrow$ 1.09
    BWT(T) w$wwdd__nnoooaattTmmmrrrrrrooo__ooo 2.33
    T It_was_the_best_of_times_it_was_the_worst_of_times$ 1.00
    BWT(T) s$esttssfftteww_hhmmbootttt_ii__woeeaaressIi_______ 1.76
    T in_the_jingle_jangle_morning_Ill_come_following_you$ 1.04
    BWT(T) u_gleeeengj_mlhl_nnnnt$nwj__lggIolo_iiiiarfcmylo_oo_ 1.30
    BWT gathers “like” characters (sharing right context)
    into runs: rrrrrr
    When T is more repetitive, BWT runs are longer & fewer
    Avg. run
    length

    View Slide

  47. FM Index
    T Tomorrow_and_tomorrow_and_tomorrow$ 1.09
    BWT(T) w$wwdd__nnoooaattTmmmrrrrrrooo__ooo 2.33
    T It_was_the_best_of_times_it_was_the_worst_of_times$ 1.00
    BWT(T) s$esttssfftteww_hhmmbootttt_ii__woeeaaressIi_______ 1.76
    T in_the_jingle_jangle_morning_Ill_come_following_you$ 1.04
    BWT(T) u_gleeeengj_mlhl_nnnnt$nwj__lggIolo_iiiiarfcmylo_oo_ 1.30
    BWT gathers “like” characters (sharing right context)
    into runs: rrrrrr
    When T is more repetitive, BWT runs are longer & fewer
    Avg. run
    length

    View Slide

  48. FM Index
    T Tomorrow_and_tomorrow_and_tomorrow$ 1.09
    BWT(T) w$wwdd__nnoooaattTmmmrrrrrrooo__ooo 2.33
    T It_was_the_best_of_times_it_was_the_worst_of_times$ 1.00
    BWT(T) s$esttssfftteww_hhmmbootttt_ii__woeeaaressIi_______ 1.76
    T in_the_jingle_jangle_morning_Ill_come_following_you$ 1.04
    BWT(T) u_gleeeengj_mlhl_nnnnt$nwj__lggIolo_iiiiarfcmylo_oo_ 1.30
    BWT gathers “like” characters (sharing right context)
    into runs: rrrrrr
    When T is more repetitive, BWT runs are longer & fewer
    Avg. run
    length

    View Slide

  49. FM Index
    # genomes
    1 6,072 M 3,264 M
    2 12,144 M 3,282 M
    3 18,217 M 3,386 M
    4 24,408 M 3,423 M
    5 30,480 M 3,436 M
    6 36,671 M 3,449 M
    n r
    As we index more diploid genomes, (total length) grows
    linearly while (total # BWT runs) grows sublinearly
    n
    r
    From 1000
    Genomes project
    phase-3 callset
    Kuhnle A, Mun T, Boucher C, Gagie T, Langmead B, Manzini G. Efficient Construction of a
    Complete Index for Pan-Genomics Read Alignment. J Comput Biol. 2020 Apr;27(4):500-513.

    View Slide

  50. From FM Index to r-index
    Count Locate
    Space Time Space Time
    FM Index (2000)
    RLFM Index (2005)
    r-index (2018)
    Where is total reference length, is
    query-string length, is total # BWT runs
    n m
    r
    O(n)
    O(r)
    O(r)
    O(m)
    O(m)
    O(m)
    (log factors
    omitted)
    O(n)
    O(n)
    O(r)
    O(m + occ)
    O(m + occ)
    O(m + occ)
    FM: Ferragina P, and Manzini M. Opportunistic data structures with applications. Proceedings of
    41st FOCS. IEEE, 2000.

    View Slide

  51. From FM Index to r-index
    Count Locate
    Space Time Space Time
    FM Index (2000)
    RLFM Index (2005)
    r-index (2018)
    Where is total reference length, is
    query-string length, is total # BWT runs
    n m
    r
    O(n)
    O(r)
    O(r)
    O(m)
    O(m)
    O(m)
    (log factors
    omitted)
    O(n)
    O(n)
    O(r)
    O(m + occ)
    O(m + occ)
    O(m + occ)
    RLFM: Mäkinen V, and Navarro G. Succinct suffix arrays based on run-length encoding. Annual
    Symposium on CPM. Springer, Berlin, Heidelberg. 2005. pp45–56.
    FM: Ferragina P, and Manzini M. Opportunistic data structures with applications. Proceedings of
    41st FOCS. IEEE, 2000.

    View Slide

  52. From FM Index to r-index
    Count Locate
    Space Time Space Time
    FM Index (2000)
    RLFM Index (2005)
    r-index (2018)
    Where is total reference length, is
    query-string length, is total # BWT runs
    n m
    r
    O(n)
    O(r)
    O(r)
    O(m)
    O(m)
    O(m)
    (log factors
    omitted)
    O(n)
    O(n)
    O(r)
    O(m + occ)
    O(m + occ)
    O(m + occ)
    r-index: Gagie T, Navarro G, and Prezza P. Optimal-time text indexing in BWT-runs bounded space.
    Proceedings of 29th SODA, ACM-SIAM. 2018. pp1459—1477.
    RLFM: Mäkinen V, and Navarro G. Succinct suffix arrays based on run-length encoding. Annual
    Symposium on CPM. Springer, Berlin, Heidelberg. 2005. pp45–56.
    FM: Ferragina P, and Manzini M. Opportunistic data structures with applications. Proceedings of
    41st FOCS. IEEE, 2000.

    View Slide

  53. r-index
    Index many human genomes with similar queries & speed as
    FM Index, in space; sublinear in # & length of genomes
    O(r)
    Gonzalo
    Navarro
    Nicola
    Prezza
    Gagie T, Navarro G, and Prezza P. Optimal-
    time text indexing in BWT-runs bounded
    space. Proceedings of 29th SODA, ACM-
    SIAM. 2018. pp1459—1477.
    Christina
    Boucher
    Travis
    Gagie
    Alan
    Kuhnle
    Giovanni
    Manzini
    Kuhnle A, Mun T, Boucher C, Gagie T, Langmead B,
    Manzini G. Efficient Construction of a Complete
    Index for Pan-Genomics Read Alignment. J Comput
    Biol. 2020 Apr;27(4):500-513.

    View Slide

  54. Panel alignment with r-index
    For larger collections, index is
    smaller than that of Bowtie and
    compressed competitors
    Kuhnle A, Mun T, Boucher C, Gagie T, Langmead B, Manzini G. Efficient Construction of a
    Complete Index for Pan-Genomics Read Alignment. J Comput Biol. 2020 Apr;27(4):500-513.
    (chr19s from 1000
    Genomes Project)

    View Slide

  55. Panel alignment with r-index
    For larger collections, query
    time is faster than Bowtie
    Kuhnle A, Mun T, Boucher C, Gagie T, Langmead B, Manzini G. Efficient Construction of a
    Complete Index for Pan-Genomics Read Alignment. J Comput Biol. 2020 Apr;27(4):500-513.
    (chr19s from 1000
    Genomes Project)

    View Slide















  56. 0
    50000
    100000
    150000
    200000
    0 20000 40000 60000
    Total Length of Collection (MB)
    Indexing Peak Mem. (MB)
    1KG
    LRA
    forward + reverse complement
    Handles many human
    genome assemblies!
    Panel alignment with r-index
    Kuhnle A, Mun T, Boucher C, Gagie T, Langmead B, Manzini G. Efficient Construction of a
    Complete Index for Pan-Genomics Read Alignment. J Comput Biol. 2020 Apr;27(4):500-513.
    (whole human
    genomes)

    View Slide

  57. Future of r-index
    • Alignment to collection (panel) of linear references
    • Fast genotyping with respect to panel
    • Fast online matching statistics
    Rossi M, Oliva M, Langmead B, Gagie T, Boucher C. MONI: A
    Pangenomics Index for Finding MEMs. Accepted, RECOMB 2021.
    Boucher C, Gagie T, I T, Köppl D, Langmead B, Manzini G, Navarro G,
    Pacheco A, Rossi M. PHONI: Streamed Matching Statistics with
    Multi-Genome References. Accepted, DCC 2021.

    View Slide

  58. Conclusions for Practitioners

    View Slide

  59. Conclusions for Practitioners
    • If avoiding reference bias is the goal, that’s how we
    should evaluate
    • A , not a
    • With multiple references, fast & familiar linear aligners
    have comparable benefits to (current) graph aligners
    • New assemblies are coming, but we lack good
    ways to put them in common coordinates.
    Need methods that let genomes be linear

    View Slide

  60. • Pangenome graphs suffer from ambiguity that comes
    with adding many rare variants
    • Is this a failing of the method?
    Conclusions for Methods
    • Can index & queries be made frequency-aware,
    representing rare variation while understanding it is rare?
    11%
    89%
    1%
    1%
    4%
    94%
    🤔

    View Slide

  61. Thank you! And thanks to the team:
    Jacob Pritt Nae-Chyun
    Chen
    Taher Mun Brad
    Solomon
    NSF:
    IIS-1349906
    DBI-2029552
    NIH:
    R01GM118568
    R01HG011392
    Christina
    Boucher
    Travis
    Gagie
    Alan
    Kuhnle
    Sheila
    Iyer
    Giovanni
    Manzini
    + MONI & PHONI teams

    View Slide

  62. Photo: Elizabeth Colantuoni
    Thank you

    View Slide