Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Phylogenetics and Barcoding

Phylogenetics and Barcoding

Slides from presentation at Smithsonian NHM on 01-18-2012 covering use of UCE loci for phylogenomics and use of massively parallel DNA sequencing for barcoding (or metabarcoding) applications.

Brant Faircloth

January 29, 2012
Tweet

More Decks by Brant Faircloth

Other Decks in Science

Transcript

  1. The problem value of MPS You The Machine = 7

    days of 8 plates per day containing 5 mplex loci Illumina HiSeq 2000 100 PE Run 100X Depth - Each Amplicon Friday, January 27, 12
  2. The value of MPS 1.22 Years of: 95 samples per

    plate 5 multiplexed loci per plate 8 plates per day 7 days per week Friday, January 27, 12
  3. MPS Costs 3730xl 454 (Ti) HiSeq 2000 0 250 500

    750 1000 1,000.0000 26.0000 0.0001 Millions of reads per run 3730xl 454 (Ti) HiSeq 2000 0 375 750 1125 1500 $0.10 $15.00 $1,500.00 Cost per Megabase Source: Glenn 2011 Mol. Ecol. Res. Friday, January 27, 12
  4. The problem value of MPS You The Machine Illumina HiSeq

    2000 100 PE Run 100X Depth - Each Amplicon Friday, January 27, 12
  5. PCR vs. Capture 1 10 100 1000 10000 100000 1000000

    PCR Biotinylated Oligos MySelect 25k SureSelect 55k MySelect 200k Log(targets) Capture-based Friday, January 27, 12
  6. UCE Discovery 16. L. Shen, K. L. Rock, Proc. Natl.

    Acad. Sci. U.S.A. 101, 3035 (2004). 17. S. P. Schoenberger et al., J. Immunol. 161, 3808 (1998). 18. M. L. Albert, B. Sauter, N. Bhardwaj, Nature 392, 86 (1998). 19. M. Bellone et al., J. Immunol. 159, 5391 (1997). 20. J. W. Yewdell, C. C. Norbury, J. R. Bennink, Adv. Immunol. 73, 1 (1999). 21. A. Serna, M. C. Ramirez, A. Soukhanova, L. J. Sigal, J. Immunol. 171, 5668 (2003). 22. N. P. Restifo et al., J. Immunol. 154, 4414 (1995). 23. We thank B. Buschling, D. Tokarchick, and A. Schell for technical assistance. We are grateful to M. Epler and S. Tevethia for their generous gift of Db- NP 366-374 tetramers. This work was supported in part by a Wellcome Prize Traveling Fellowship and U.S. Public Health Service grants, and NIH grant AI- 056094-01 to C.C.N. Supporting Online Material www.sciencemag.org/cgi/conten DC1 Materials and Methods Figs. S1 and S2 References and Notes 3 February 2004; accepted 14 A Ultraconserved Elements in the Human Genome Gill Bejerano,1* Michael Pheasant,3 Igor Makunin,3 Stuart Stephen,3 W. James Kent,1 John S. Mattick,3 David Haussler2* There are 481 segments longer than 200 base pairs (bp) that are absolutely conserved (100% identity with no insertions or deletions) between orthologous regions of the human, rat, and mouse genomes. Nearly all of these segments are also conserved in the chicken and dog genomes, with an average of 95 and 99% identity, respectively. Many are also significantly conserved in fish. These ultraconserved elements of the human genome are most often located either overlapping exons in genes involved in RNA processing or in introns or nearby genes involved in the regulation of transcription and development. Along with more than 5000 sequences of over 100 bp that are absolutely conserved among the three sequenced mammals, these represent a class of genetic elements whose functions and evolutionary origins are yet to be determined, but which are more highly conserved between these species than are proteins and appear to be essential for the ontogeny of mammals and other vertebrates. Although only about 1.2% of the human genome appears to code for proteins (1–3), it has been estimated that as much as 5% is more conserved than would be expected from neutral evolution since the split with rodents, and hence may be under negative served with orthologous segments in rodents: those showing 100% identity and with no insertions or deletions in their alignment with the mouse and rat. Exclusive of ribosomal RNA (rRNA) regions, there are 481 such segments longer than 200 bp that we call with the dog genome, w using reads from the N Biotechnology Informatio chive (477/481 ϭ 99.2 aligning at an average ty).Thus, it appears that ultraconserved elements der extreme negative sele cies for more than 300 some of them for at least As expected, the ultra exhibit almost no natur human population. Only bases examined in the ments (excluding the first each element) are at nucleotide polymorphism SNP database (dbSNP) ( much DNA, we would validated sites, so validat represented by 20-fold (P unvalidated SNPs we fo likely errors in the unv dbSNP (table S2b). Th bases exhibit very few d chimp genome as well, single base changes where a Phred quality score at l expected number would Bejerano et al. 2003, Science Friday, January 27, 12
  7. Many vertebrate alignments 28-Way vertebrate alignment and conservation track in

    the UCSC Genome Browser Webb Miller,1,11 Kate Rosenbloom,2 Ross C. Hardison,1 Minmei Hou,1 James Taylor,3 Brian Raney,2 Richard Burhans,1 David C. King,1 Robert Baertsch,2 Daniel Blankenberg,1 Sergei L. Kosakovsky Pond,4 Anton Nekrutenko,1 Belinda Giardine,1 Robert S. Harris,1 Svitlana Tyekucheva,1 Mark Diekhans,2 Thomas H. Pringle,5 William J. Murphy,6 Arthur Lesk,1 George M. Weinstock,7 Kerstin Lindblad-Toh,8 Richard A. Gibbs,7 Eric S. Lander,8 Adam Siepel,9 David Haussler,2,10 and W. James Kent2 1Center for Comparative Genomics and Bioinformatics, Penn State University, University Park, Pennsylvania 16802, USA; 2Center for Biomolecular Science and Engineering, University of California, Santa Cruz, California 95064, USA; 3Courant Institute, New York University, New York, New York 10012, USA; 4Antiviral Research Center, University of California at San Diego, San Diego, California 92103, USA; 5Sperling Foundation, Eugene, Oregon 97405, USA; 6Department of Veterinary Integrative Biosciences, Texas A&M University, College Station, Texas 77843, USA; 7Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA; 8Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA; 9Department of Biological Statistics and Computational Biology, Cornell University, Ithaca, New York 14853, USA; 10Howard Hughes Medical Institute, Santa Cruz, California 95060, USA This article describes a set of alignments of 28 vertebrate genome sequences that is provided by the UCSC Genome Browser. The alignments can be viewed on the Human Genome Browser (March 2006 assembly) at http://genome.ucsc.edu, downloaded in bulk by anonymous FTP from http://hgdownload.cse.ucsc.edu/goldenPath/ hg18/multiz28way, or analyzed with the Galaxy server at http://g2.bx.psu.edu. This article illustrates the power of this resource for exploring vertebrate and mammalian evolution, using three examples. First, we present several vignettes involving insertions and deletions within protein-coding regions, including a look at some human-specific indels. Then we study the extent to which start codons and stop codons in the human sequence are conserved in other species, showing that start codons are in general more poorly conserved than stop codons. Finally, an Resource Cold Spring Harbor Laboratory Press on October 19, 2011 - Published by genome.cshlp.org Downloaded from Miller et al. 2007, Genome Research Friday, January 27, 12
  8. UCEs resolve mammal phylogeny John E. McCormack,1,8 Brant C. Faircloth,2

    Nicholas G. Crawford,3 Patricia Adair Gowaty,4,5 Robb T. Brumfield1,6 & Travis C. Glenn7 1 Museum of Natural Science, Louisiana State University, Baton Rouge, LA 70803; 2 Department of Ecology and Evolutionary Biology, University of California, Los Angeles, CA 90095; 3 Department of Biology, Boston University, Boston, MA 02215; 4 Smithsonian Tropical Research Institute, MRC 0580-11 Unit 9100, Box 0948, DPO, AA 34002-9998, USA; 5 Institute of the 6 McCormack et al. 2011, Genome Research Friday, January 27, 12
  9. Buffer Short UCEs Conserved Regions ≤ 120 bp 60 bp

    30 bp 30 bp { 120 bp Friday, January 27, 12
  10. Tile Long UCEs Conserved Regions ≥ 180 bp { 4

    probes Friday, January 27, 12
  11. Assemble reads → contigs Loc1 Loc2 Loc3 Loc4 Loc5 Loc6

    Loc7 “Throw out” reads not matching expected locus Friday, January 27, 12
  12. RH: ULTRACONSERVED ELEMENTS ANCHOR GENETIC MARKERS Ultraconserved Elements Anchor Thousands

    of Genetic Markers Spanning Multiple Evolutionary Timescales Brant C. Faircloth1*, John E. McCormack2, Nicholas G. Crawford3, Michael G. Harvey2,4, Robb T. Brumfield2,4, and Travis C. Glenn5 1 Department of Ecology and Evolutionary Biology, University of California, Los Angeles, CA 90095 2 Museum of Natural Science, Louisiana State University, Baton Rouge, LA 70803 3 Faircloth et al. 2012, Systematic Biology Friday, January 27, 12
  13. Variation Distance from center of alignment Frequency of variable positions

    0.02 0.04 0.06 0.08 0.10 −300 −200 −100 0 100 200 300 Friday, January 27, 12
  14. Why not exons? taxon matches 0 10000 20000 30000 40000

    hg19 chinese korean venter chimpanzee gorilla orangutan macaque marmoset tarsier tree shrew rat mouse kangaroo rat guinea pig rabbit horse dog bat alpaca cow hedgehog sloth elephant tenrec opossum platypus zebra finch anole Friday, January 27, 12
  15. Why not exons? taxon uce 0 500 1000 1500 hg19

    chinese korean venter chimpanzee gorilla orangutan macaque marmoset tarsier tree shrew rat mouse kangaroo rat guinea pig rabbit horse dog bat alpaca cow hedgehog sloth elephant tenrec opossum platypus zebra finch anole Friday, January 27, 12
  16. Room to grow UCE Loci McCormack et al. 2011 Faircloth

    et al. 2012 Stephen et al. 2008 + Faircloth et al. 2012 2,386 5,599 12,147 All tetrapods? Friday, January 27, 12
  17. Other taxa Evolutionarily conserved elements in vertebrate, insect, worm, and

    yeast genomes Adam Siepel,1,6 Gill Bejerano,1 Jakob S. Pedersen,1 Angie S. Hinrichs,1 Minmei Hou,3 Kate Rosenbloom,1 Hiram Clawson,1 John Spieth,4 LaDeana W. Hillier,4 Stephen Richards,5 George M. Weinstock,5 Richard K. Wilson,4 Richard A. Gibbs,5 W. James Kent,1 Webb Miller,3 and David Haussler1,2 1Center for Biomolecular Science and Engineering, 2Howard Hughes Medical Institute, University of California, Santa Cruz, Santa Cruz, California 95064, USA; 3Center for Comparative Genomics and Bioinformatics, Pennsylvania State University, University Park, Pennsylvania 16802, USA; 4Genome Sequencing Center, Washington University School of Medicine, St. Louis, Missouri 63108, USA; 5Human Genome Sequencing Center, Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas 77030, USA We have conducted a comprehensive search for conserved elements in vertebrate genomes, using genome-wide multiple alignments of five vertebrate species (human, mouse, rat, chicken, and Fugu rubripes). Parallel searches have been performed with multiple alignments of four insect species (three species of Drosophila and Anopheles gambiae), two species of Caenorhabditis, and seven species of Saccharomyces. Conserved elements were identified with a computer program called phastCons, which is based on a two-state phylogenetic hidden Markov model (phylo-HMM). PhastCons works by fitting a phylo-HMM to the data by maximum likelihood, subject to constraints designed to calibrate the model across species groups, and then predicting conserved elements based on this model. The predicted elements cover roughly 3%–8% of the human genome (depending on the details of the calibration procedure) and substantially higher fractions of the more compact Drosophila melanogaster (37%–53%), Caenorhabditis elegans (18%–37%), and Saccharaomyces cerevisiae (47%–68%) genomes. From yeasts to vertebrates, in order of increasing genome size and general biological complexity, increasing fractions of conserved bases are found to lie outside of the exons of known protein-coding genes. In all groups, the most highly conserved elements (HCEs), by Article Cold Spring Harbor Laboratory Press on January 9, 2012 - Published by genome.cshlp.org Downloaded from Siepel et al. 2005, Genome Research Friday, January 27, 12
  18. Fish Targeted sequencing of ultra-conserved elements untangles relationships of ancient

    teleost fish lineages Michael E. Alfaro*§ 1, Brant C. Faircloth* 1, Laurie Sorenson 1, Francesco Santini 1, Jonathan Chang 1 Alfaro et al., In prep. Friday, January 27, 12
  19. MPS Output v. Costs 3730xl 454 (Ti) HiSeq 2000 0

    250 500 750 1000 1,000.0000 26.0000 0.0001 Millions of reads per run 3730xl 454 (Ti) HiSeq 2000 0 375 750 1125 1500 $0.10 $15.00 $1,500.00 Cost per Megabase Source: Glenn 2011 Mol. Ecol. Res. Friday, January 27, 12
  20. “Voucher” generation A B C D E I J F

    G H L M N K &c Friday, January 27, 12
  21. Thousands of complex mixtures A B C D E F

    G H I J K &c No Cloning! Friday, January 27, 12
  22. Sequence Tagging CAGCAA CAGCAT CCGTAG Source: Binladen et al. 2007

    PLoS One Meyer et al. 2007 NAR Hamady et al. 2008 Nature Methods Friday, January 27, 12
  23. Not enough tags 1 10 100 1000 10000 Log(tags) Tags

    Available Tags Needed Friday, January 27, 12
  24. Approach 1: More tags ! " # $ % &

    ' " ! " " " " " " # #$ ! " " " " " $ %& &$ $ " " " " % #&& '& && ' " " " & $(& &)( #' * ( " " ' &)(% ()& %# &* % ( " () !&+* +!& &%' ') &' $ ( *+,-./01 21/3-4 567189+:1, Faircloth and Glenn 2011 Unpublished edittag - https://github.com/faircloth-lab/edittag/ Friday, January 27, 12
  25. Plate A Plate B Plate C Outer Tag 1 Outer

    Tag 2 Outer Tag 3 Friday, January 27, 12
  26. Sample 1 Inner Tag 1 Sample 6 Inner Tag 6

    Sample 12 Outer Tag 12 Friday, January 27, 12
  27. Approach 1 + Approach 2 164 Outer * 7,198 Inner

    = 1,180,472 samples 164 Outer * 7 Inner = 1,148 samples Low High Friday, January 27, 12
  28. Plant DNA barcodes and a community phylogeny of a tropical

    forest dynamics plot in Panama W. John Kressa,1, David L. Ericksona, F. Andrew Jonesb,c, Nathan G. Swensond, Rolando Perezb, Oris Sanjurb, and Eldredge Berminghamb aDepartment of Botany, MRC-166, National Museum of Natural History, Smithsonian Institution, P.O. Box 37012, Washington, DC 20013-7012; bSmithsonian Tropical Research Institute, P.O. Box 0843-03092, Balboa Anco ´ n, Republic of Panama ´; cImperial College London, Silwood Park Campus, Buckhurst Road, Ascot, Berkshire SL5 7PY, United Kingdom; and dCenter for Tropical Forest Science - Asia Program, Harvard University Herbaria, 22 Divinity Avenue, Cambridge, MA 02138 Communicated by Daniel H. Janzen, University of Pennsylvania, Philadelphia, PA, September 3, 2009 (received for review May 13, 2009) The assembly of DNA barcode libraries is particularly relevant within species-rich natural communities for which accurate species identifications will enable detailed ecological forensic studies. In addition, well-resolved molecular phylogenies derived from these DNA barcode sequences have the potential to improve investiga- tions of the mechanisms underlying community assembly and functional trait evolution. To date, no studies have effectively applied DNA barcodes sensu strictu in this manner. In this report, we demonstrate that a three-locus DNA barcode when applied to 296 species of woody trees, shrubs, and palms found within the 50-ha Forest Dynamics Plot on Barro Colorado Island (BCI), Panama, resulted in >98% correct identifications. These DNA barcode se- quences are also used to reconstruct a robust community phylog- eny employing a supermatrix method for 281 of the 296 plant pling: the conserved coding locus will easily align over all taxa in a community sample to establish deep phylogenetic branches whereas the hypervariable region of the DNA barcode will align more easily within nested subsets of closely related species and permit relationships to be inferred among the terminal branches of the tree. In this respect a supermatrix design (8, 9) is ideal for using a mixture of coding genes and intergenic spacers for phylogenetic reconstruction across the broadest evolutionary distances, as in the construction of community phylogenies (10). We define a supermatrix as a phylogenetic matrix that may contain a high incidence of missing data and the data content for any one taxon is stochastic (11) (Fig. S1). Confidence of correct sequence alignment is critical in building such complex matrices and Kress et al. 2009, PNAS Friday, January 27, 12
  29. Subterranean Composition/Competition 50 ha Barro Colorado Island plot Soil samples

    from Funding: SI Scholarly Studies Friday, January 27, 12
  30. Use of MPS (meta-) barcoding • Determining root composition of

    soil cores by comparing reads against BCI plant barcode database Friday, January 27, 12
  31. Rare Common Many Pathogenic Fungi Few Pathogenic Fungi Funding: NSF

    Dimensions of Biodiversity 50 ha Barro Colorado Island plot Friday, January 27, 12
  32. Use of MPS (meta-) barcoding • Generating “voucher” barcode library

    of pathogenic fungi using MPS • Determining fungi infecting trees using MPS Friday, January 27, 12
  33. Caveats • Current reads on Illumina < 300 bp •

    Current reads on 454 < 500 bp • 600-800+ bp reads soon (FLX+) • Can’t always sequence entire amplicon • Ideally want short, informative barcodes • Trial and error to optimize read depth per sample within and across plates Friday, January 27, 12
  34. Collaborators • UCE Phylogenetics • John McCormack (Occidental College) •

    Travis Glenn (UGA) • Robb Brumfield (LSU) • Nick Crawford (Boston U.) • Mike Alfaro (UCLA) Friday, January 27, 12
  35. Collaborators • Root (meta-) barcoding • Andy Jones (Oregon State

    U.) • Steve Hubbell (UCLA/STRI) • Scott Mangan (STRI) • Ben Turner (STRI) • Jeff Wolf (UCLA) Friday, January 27, 12
  36. Collaborators • Pathogenic fungi (DimBio) • Steve Hubbell (UCLA/STRI) •

    Greg Gilbert (UCSC) • Travis Glenn (UGA) • Megan Saunders (UCSC) Friday, January 27, 12
  37. Thanks • Frontiers in Phylogenetics Program, NMNH • Biodiversity Genomics

    Initiative • SI Scholarly Studies Program (root funding) • Mike Braun • Noor White Friday, January 27, 12