PAG XXIV: How to Compare and Cluster Every Known Genome in about an Hour

How to compare and cluster every known genome in about
an hour Sergey Koren, @sergekoren Genome Informatics Section, NHGRI

Why MinHash? ! Large compression !  3 Gbp primate genome ! 
8 kB vs. 750 MB !  5 Tbp of samples !  71 MB vs. 1.25 TB ! Fast comparisons !  Cluster all of RefSeq !  46 CPU hours !  Linear search of RefSeq !  1 CPU second Primary overhead is in the sketching, comparisons are instantaneous Assembling large genomes with single-molecule sequencing and locality sensitive hashing. Berlin et al. (2015)

A GGATT$ TGACG$ GTACT$ .....$ $ h# S(A)$=${42,$64,$82,$128,$139}$ What is
a sketch? On the resemblance and containment of documents. Broder (1997) mash sketch A.fasta

A B S(A)$=${42,$64,$82,$128,$139}$ S(A∪B)$=${42,$64,$66,$82,$87}$ S(B)$=${66,$82,$87,$104,$127}$ Estimating Jaccard with MinHash On
the resemblance and containment of documents. Broder (1997)

Estimating Jaccard with MinHash A B S(A∪B)$=${42,$64,$66,$82,$87}$ On the resemblance
and containment of documents. Broder (1997) mash dist A.msh B.msh

Mash distance correlates with ANI All-pairs comparison of 500 Escherichia
genomes RMSE=0.00274 s=1,000 k=21 0.0 0.05 0.1 1–ANI Mash D 0.05 0.0 0.1

Unsupervised database clustering RefSeq = ~1.5 billion distances in 46
CPU h, sketches <100 MB, linear search in 1s

Whole-genome phylogeny Each genome = 1,000 values, fasta to phylogeny
in <30m on a laptop b) Mash cheeked gibbon que acaque y nkey nosed monkey armoset d squirrel monkey tarsier mur greater galago Chimpanzee Bonobo Human Gorilla Orangutan Northern white-cheeked gibbon Rhesus macaque Crab-eating macaque Olive baboon Green monkey Proboscis monkey Golden snub-nosed monkey Common marmoset Black-capped squirrel monkey Philippine tarsier Gray mouse lemur Northern greater galago

Metagenome sample clustering ! 888 HMP and MetaHIT samples (s=10,000, k=21)
Sketch: 4.4 CPU hours (assemblies), 279 CPU hours (reads); Clustering <1s

Database search !  Discriminates between B. anthracis and B. cereus
!  Bloom filter to remove single-copy k-mers !  Can be used to index/search SRA !  Read mapping on the way (cf. Heng Li’s minimap) Strain Tech Size Time LCA Zaire ebolavirus MinION 7.3 9Mbp 2.43s Zaire ebolavirus E. coli K12 MinION 7.3 46Mbp 11.45s E. coli K. pneumoniae ATCC BAA-2146 MinION 7.3 87Mbp 20.03s K. pneumoniae S. aureus SASCBU26 MinION 7.0 231Mbp 50.23s S. aureus B. anthracis Ames MinION 7.3 176Mbp 38.68s B. anthracis B. cereus ATCC 10987 MinION 7.3 266Mbp 58.07s B. cereus ATCC 10987 mash dist –u RefSeq.msh A.fast(aq)

POC disease surveillance

Sequencing as a sensor hint.fm/wind NOAA NDFD

Sequencing as a sensor The rise of a digital immune
system. Schatz and Phillippy (2012)

Mash preprint on bioRxiv ! Comments welcome ! http://mash.readthedocs.org ! examples and RefSeq
database ! Fast distance estimation !  Database search !  Rapid species assignment !  Very large guide trees !  Sample quality control !  Metagenome sample clustering Fast genome and metagenome distance estimation using MinHash. Ondov et al.

Acknowledgements ! Mash !  Brian Ondov !  Todd Treangen !  Adam
Phillippy ! Canu !  Adam Phillippy !  Brian Walenz ! NHGRI ! Postdocs wanted! !  Genome Informatics Section !  Assembly !  Structural variation !  Infectious disease !  Undiagnosed disease !  http://www.genome.gov/27563366 /MarBL

PUBLIC DOMAIN NOTICE This presentation is "United States Government Work"
under the terms of the United States Copyright Act. It was written as part of the authors' official duties for the United States Government and thus cannot be copyrighted. This presentation is freely available to the public for use without a copyright notice. Restrictions cannot be placed on its present or future use. Although all reasonable efforts have been taken to ensure the accuracy and reliability of the presentation and associated data, the National Human Genome Research Institute (NHGRI), National Institutes of Health (NIH) and the U.S. Government do not and cannot warrant the performance or results that may be obtained based on this presentation or data. NHGRI, NIH and the U.S. Government disclaim all warranties as to performance, merchantability or fitness for any particular purpose. Please cite the authors in any work or product based on this material.

PAG XXIV: How to Compare and Cluster Every Know...

PAG XXIV: How to Compare and Cluster Every Known Genome in about an Hour

Sergey Koren

More Decks by Sergey Koren

Other Decks in Science

Featured

Transcript

How to compare and cluster every known genome in about

Why MinHash? ! Large compression !  3 Gbp primate genome !

A GGATT$ TGACG$ GTACT$ .....$ $ h# S(A)$=${42,$64,$82,$128,$139}$ What is

A B S(A)$=${42,$64,$82,$128,$139}$ S(A∪B)$=${42,$64,$66,$82,$87}$ S(B)$=${66,$82,$87,$104,$127}$ Estimating Jaccard with MinHash On

Estimating Jaccard with MinHash A B S(A∪B)$=${42,$64,$66,$82,$87}$ On the resemblance

Mash distance correlates with ANI All-pairs comparison of 500 Escherichia

Unsupervised database clustering RefSeq = ~1.5 billion distances in 46

Whole-genome phylogeny Each genome = 1,000 values, fasta to phylogeny

Metagenome sample clustering ! 888 HMP and MetaHIT samples (s=10,000, k=21)

Database search !  Discriminates between B. anthracis and B. cereus

POC disease surveillance

Sequencing as a sensor hint.fm/wind NOAA NDFD

Sequencing as a sensor The rise of a digital immune

Mash preprint on bioRxiv ! Comments welcome ! http://mash.readthedocs.org ! examples and RefSeq

Acknowledgements ! Mash !  Brian Ondov !  Todd Treangen !  Adam

PUBLIC DOMAIN NOTICE This presentation is "United States Government Work"