PAG XXIV: How to Compare and Cluster Every Known Genome in about an Hour

281a319254be3f6c15e18bad345426e2?s=47 Sergey Koren
January 12, 2016

PAG XXIV: How to Compare and Cluster Every Known Genome in about an Hour

Given a massive collection of sequences, it is infeasible to perform pairwise alignment for basic tasks like sequence clustering and search. To address this problem, we demonstrate that the MinHash technique, first applied to clustering web pages, can be applied to biological sequences with similar effect, and extend this idea to include biologically relevant distance and significance measures. Our new tool, Mash, uses MinHash locality-sensitive hashing to reduce large sequences to a representative sketch and rapidly estimate pairwise distances between genomes or metagenomes. Using Mash, we explored several use cases, including a 5,000-fold size reduction and clustering of all 55,000 NCBI RefSeq genomes in 46 CPU hours. The resulting 93 MB sketch database includes all RefSeq genomes, effectively delineates known species boundaries, reconstructs approximate phylogenies, and can be searched in seconds using assembled genomes or raw sequencing runs from Illumina, Pacific Biosciences, and Oxford Nanopore. For metagenomics, Mash scales to thousands of samples and can replicate Human Microbiome Project and Global Ocean Survey results in a fraction of the time. Other potential applications include any problem where an approximate, global sequence distance is acceptable, e.g. to triage and cluster sequence data, assign species labels to unknown genomes, quickly identify mis- tracked samples, and search massive genomic databases. In addition, the Mash distance metric is based on simple set intersections, which are compatible with homomorphic encryption schemes. To facilitate integration with other software, Mash is implemented as a lightweight C++ toolkit and freely released under a BSD license at
https://github.com/marbl/mash

281a319254be3f6c15e18bad345426e2?s=128

Sergey Koren

January 12, 2016
Tweet

Transcript

  1. How to compare and cluster every known genome in about

    an hour Sergey Koren, @sergekoren Genome Informatics Section, NHGRI
  2. Why MinHash? ! Large compression !  3 Gbp primate genome ! 

    8 kB vs. 750 MB !  5 Tbp of samples !  71 MB vs. 1.25 TB ! Fast comparisons !  Cluster all of RefSeq !  46 CPU hours !  Linear search of RefSeq !  1 CPU second Primary overhead is in the sketching, comparisons are instantaneous Assembling large genomes with single-molecule sequencing and locality sensitive hashing. Berlin et al. (2015)
  3. A GGATT$ TGACG$ GTACT$ .....$ $ h# S(A)$=${42,$64,$82,$128,$139}$ What is

    a sketch? On the resemblance and containment of documents. Broder (1997) mash sketch A.fasta
  4. A B S(A)$=${42,$64,$82,$128,$139}$ S(A∪B)$=${42,$64,$66,$82,$87}$ S(B)$=${66,$82,$87,$104,$127}$ Estimating Jaccard with MinHash On

    the resemblance and containment of documents. Broder (1997)
  5. Estimating Jaccard with MinHash A B S(A∪B)$=${42,$64,$66,$82,$87}$ On the resemblance

    and containment of documents. Broder (1997) mash dist A.msh B.msh
  6. Mash distance correlates with ANI All-pairs comparison of 500 Escherichia

    genomes RMSE=0.00274 s=1,000 k=21 0.0 0.05 0.1 1–ANI Mash D 0.05 0.0 0.1
  7. Unsupervised database clustering RefSeq = ~1.5 billion distances in 46

    CPU h, sketches <100 MB, linear search in 1s
  8. Whole-genome phylogeny Each genome = 1,000 values, fasta to phylogeny

    in <30m on a laptop b) Mash cheeked gibbon que acaque y nkey nosed monkey armoset d squirrel monkey tarsier mur greater galago Chimpanzee Bonobo Human Gorilla Orangutan Northern white-cheeked gibbon Rhesus macaque Crab-eating macaque Olive baboon Green monkey Proboscis monkey Golden snub-nosed monkey Common marmoset Black-capped squirrel monkey Philippine tarsier Gray mouse lemur Northern greater galago
  9. Metagenome sample clustering ! 888 HMP and MetaHIT samples (s=10,000, k=21)

    Sketch: 4.4 CPU hours (assemblies), 279 CPU hours (reads); Clustering <1s
  10. Database search !  Discriminates between B. anthracis and B. cereus

    !  Bloom filter to remove single-copy k-mers !  Can be used to index/search SRA !  Read mapping on the way (cf. Heng Li’s minimap) Strain Tech Size Time LCA Zaire ebolavirus MinION 7.3 9Mbp 2.43s Zaire ebolavirus E. coli K12 MinION 7.3 46Mbp 11.45s E. coli K. pneumoniae ATCC BAA-2146 MinION 7.3 87Mbp 20.03s K. pneumoniae S. aureus SASCBU26 MinION 7.0 231Mbp 50.23s S. aureus B. anthracis Ames MinION 7.3 176Mbp 38.68s B. anthracis B. cereus ATCC 10987 MinION 7.3 266Mbp 58.07s B. cereus ATCC 10987 mash dist –u RefSeq.msh A.fast(aq)
  11. POC disease surveillance

  12. Sequencing as a sensor hint.fm/wind NOAA NDFD

  13. Sequencing as a sensor The rise of a digital immune

    system. Schatz and Phillippy (2012)
  14. Mash preprint on bioRxiv ! Comments welcome ! http://mash.readthedocs.org ! examples and RefSeq

    database ! Fast distance estimation !  Database search !  Rapid species assignment !  Very large guide trees !  Sample quality control !  Metagenome sample clustering Fast genome and metagenome distance estimation using MinHash. Ondov et al.
  15. Acknowledgements ! Mash !  Brian Ondov !  Todd Treangen !  Adam

    Phillippy ! Canu !  Adam Phillippy !  Brian Walenz ! NHGRI ! Postdocs wanted! !  Genome Informatics Section !  Assembly !  Structural variation !  Infectious disease !  Undiagnosed disease !  http://www.genome.gov/27563366 /MarBL
  16. PUBLIC DOMAIN NOTICE This presentation is "United States Government Work"

    under the terms of the United States Copyright Act. It was written as part of the authors' official duties for the United States Government and thus cannot be copyrighted. This presentation is freely available to the public for use without a copyright notice. Restrictions cannot be placed on its present or future use. Although all reasonable efforts have been taken to ensure the accuracy and reliability of the presentation and associated data, the National Human Genome Research Institute (NHGRI), National Institutes of Health (NIH) and the U.S. Government do not and cannot warrant the performance or results that may be obtained based on this presentation or data. NHGRI, NIH and the U.S. Government disclaim all warranties as to performance, merchantability or fitness for any particular purpose. Please cite the authors in any work or product based on this material.