$30 off During Our Annual Pro Sale. View Details »

PAG XXIV: How to Compare and Cluster Every Known Genome in about an Hour

Sergey Koren
January 12, 2016

PAG XXIV: How to Compare and Cluster Every Known Genome in about an Hour

Given a massive collection of sequences, it is infeasible to perform pairwise alignment for basic tasks like sequence clustering and search. To address this problem, we demonstrate that the MinHash technique, first applied to clustering web pages, can be applied to biological sequences with similar effect, and extend this idea to include biologically relevant distance and significance measures. Our new tool, Mash, uses MinHash locality-sensitive hashing to reduce large sequences to a representative sketch and rapidly estimate pairwise distances between genomes or metagenomes. Using Mash, we explored several use cases, including a 5,000-fold size reduction and clustering of all 55,000 NCBI RefSeq genomes in 46 CPU hours. The resulting 93 MB sketch database includes all RefSeq genomes, effectively delineates known species boundaries, reconstructs approximate phylogenies, and can be searched in seconds using assembled genomes or raw sequencing runs from Illumina, Pacific Biosciences, and Oxford Nanopore. For metagenomics, Mash scales to thousands of samples and can replicate Human Microbiome Project and Global Ocean Survey results in a fraction of the time. Other potential applications include any problem where an approximate, global sequence distance is acceptable, e.g. to triage and cluster sequence data, assign species labels to unknown genomes, quickly identify mis- tracked samples, and search massive genomic databases. In addition, the Mash distance metric is based on simple set intersections, which are compatible with homomorphic encryption schemes. To facilitate integration with other software, Mash is implemented as a lightweight C++ toolkit and freely released under a BSD license at
https://github.com/marbl/mash

Sergey Koren

January 12, 2016
Tweet

More Decks by Sergey Koren

Other Decks in Science

Transcript

  1. How to compare and cluster every
    known genome in about an hour
    Sergey Koren, @sergekoren
    Genome Informatics Section, NHGRI

    View Slide

  2. Why MinHash?
    ! Large compression
    !  3 Gbp primate genome
    !  8 kB vs. 750 MB
    !  5 Tbp of samples
    !  71 MB vs. 1.25 TB
    ! Fast comparisons
    !  Cluster all of RefSeq
    !  46 CPU hours
    !  Linear search of RefSeq
    !  1 CPU second
    Primary overhead is in the sketching, comparisons are instantaneous
    Assembling large genomes with single-molecule
    sequencing and locality sensitive hashing. Berlin et al.
    (2015)

    View Slide

  3. A
    GGATT$
    TGACG$
    GTACT$
    .....$
    $
    h#
    S(A)$=${42,$64,$82,$128,$139}$
    What is a sketch?
    On the resemblance and containment of documents. Broder (1997)
    mash sketch A.fasta

    View Slide

  4. A
    B
    S(A)$=${42,$64,$82,$128,$139}$
    S(A∪B)$=${42,$64,$66,$82,$87}$
    S(B)$=${66,$82,$87,$104,$127}$
    Estimating Jaccard with MinHash
    On the resemblance and containment of documents. Broder (1997)

    View Slide

  5. Estimating Jaccard with MinHash
    A
    B
    S(A∪B)$=${42,$64,$66,$82,$87}$
    On the resemblance and containment of documents. Broder (1997)
    mash dist A.msh B.msh

    View Slide

  6. Mash distance correlates with ANI
    All-pairs comparison of 500 Escherichia genomes
    RMSE=0.00274
    s=1,000
    k=21
    0.0 0.05 0.1
    1–ANI
    Mash D
    0.05
    0.0
    0.1

    View Slide

  7. Unsupervised database clustering
    RefSeq = ~1.5 billion distances in 46 CPU h, sketches <100 MB, linear search in 1s

    View Slide

  8. Whole-genome phylogeny
    Each genome = 1,000 values, fasta to phylogeny in <30m on a laptop
    b) Mash
    cheeked gibbon
    que
    acaque
    y
    nkey
    nosed monkey
    armoset
    d squirrel monkey
    tarsier
    mur
    greater galago
    Chimpanzee
    Bonobo
    Human
    Gorilla
    Orangutan
    Northern white-cheeked gibbon
    Rhesus macaque
    Crab-eating macaque
    Olive baboon
    Green monkey
    Proboscis monkey
    Golden snub-nosed monkey
    Common marmoset
    Black-capped squirrel monkey
    Philippine tarsier
    Gray mouse lemur
    Northern greater galago

    View Slide

  9. Metagenome sample clustering
    ! 888 HMP and MetaHIT samples (s=10,000, k=21)
    Sketch: 4.4 CPU hours (assemblies), 279 CPU hours (reads); Clustering <1s

    View Slide

  10. Database search
    !  Discriminates between B. anthracis and B. cereus
    !  Bloom filter to remove single-copy k-mers
    !  Can be used to index/search SRA
    !  Read mapping on the way (cf. Heng Li’s minimap)
    Strain Tech Size Time LCA
    Zaire ebolavirus MinION 7.3 9Mbp 2.43s Zaire ebolavirus
    E. coli K12 MinION 7.3 46Mbp 11.45s E. coli
    K. pneumoniae ATCC BAA-2146 MinION 7.3 87Mbp 20.03s K. pneumoniae
    S. aureus SASCBU26 MinION 7.0 231Mbp 50.23s S. aureus
    B. anthracis Ames MinION 7.3 176Mbp 38.68s B. anthracis
    B. cereus ATCC 10987 MinION 7.3 266Mbp 58.07s B. cereus ATCC 10987
    mash dist –u RefSeq.msh A.fast(aq)

    View Slide

  11. POC disease surveillance

    View Slide

  12. Sequencing as a sensor
    hint.fm/wind
    NOAA NDFD

    View Slide

  13. Sequencing as a sensor
    The rise of a digital immune system. Schatz and Phillippy (2012)

    View Slide

  14. Mash preprint on bioRxiv
    ! Comments welcome
    ! http://mash.readthedocs.org
    ! examples and RefSeq database
    ! Fast distance estimation
    !  Database search
    !  Rapid species assignment
    !  Very large guide trees
    !  Sample quality control
    !  Metagenome sample clustering
    Fast genome and metagenome distance
    estimation using MinHash. Ondov et al.

    View Slide

  15. Acknowledgements
    ! Mash
    !  Brian Ondov
    !  Todd Treangen
    !  Adam Phillippy
    ! Canu
    !  Adam Phillippy
    !  Brian Walenz
    ! NHGRI
    ! Postdocs wanted!
    !  Genome Informatics Section
    !  Assembly
    !  Structural variation
    !  Infectious disease
    !  Undiagnosed disease
    !  http://www.genome.gov/27563366
    /MarBL

    View Slide

  16. PUBLIC DOMAIN NOTICE
    This presentation is "United States Government Work" under the
    terms of the United States Copyright Act. It was written as part of
    the authors' official duties for the United States Government and
    thus cannot be copyrighted. This presentation is freely available to
    the public for use without a copyright notice. Restrictions cannot
    be placed on its present or future use.
    Although all reasonable efforts have been taken to ensure the
    accuracy and reliability of the presentation and associated data,
    the National Human Genome Research Institute (NHGRI),
    National Institutes of Health (NIH) and the U.S. Government do
    not and cannot warrant the performance or results that may be
    obtained based on this presentation or data. NHGRI, NIH and the
    U.S. Government disclaim all warranties as to performance,
    merchantability or fitness for any particular purpose. Please cite
    the authors in any work or product based on this material.

    View Slide