Sequence Capture Overview, Biological Applications, &c.

Sequence Capture Overview, Biological Applications, &c.

Slide presentation supporting sequence capture workshop at Smithsonian MSC, January 16-20, 2012.


Brant Faircloth

January 29, 2012


  1. Biology by Seqcap Brant C. Faircloth University of California -

    Los Angeles
  2. The problem value of MPS You The Machine = 7

    days of 8 plates per day containing 5 mplex loci Illumina HiSeq 2000 100 PE Run 100X Depth - Each Amplicon
  3. The value of MPS 1.22 Years of: 95 samples per

    plate 5 multiplexed loci per plate 8 plates per day 7 days per week
  4. The value of MPS •Massively parallel ( = powerful) •Clonal

  5. MPS Costs 3730xl 454 (Ti) HiSeq 2000 0 250 500

    750 1000 1,000.0000 26.0000 0.0001 Millions of reads per run 3730xl 454 (Ti) HiSeq 2000 0 375 750 1125 1500 $0.10 $15.00 $1,500.00 Cost per Megabase Source: Glenn 2011 Mol. Ecol. Res.
  6. Harnessing the Power • Sequence capture (today’s talk) • Genome

    sequencing • Amplicon generation and sequencing • RAD-tag and RAD-tag-like approaches • MIPS, EPIC, etc.
  7. • Library Prep and Sequence Tagging • Sequence Capture •

    Uses of Seqcap [Overview] • Uses of Seqcap [Specific Approaches] • Computational Issues Outline
  8. Anatomy of a Library li•brar•y /ˈlīˌbrerē/ (Randomly) sheared DNA having

    attached sequencing adapters
  9. Adapter A synthetic DNA sequence for binding DNA to the

    sequencing substrate, which differs by platform
  10. Adapter Binds to substrate DNA

  11. Read 1

  12. Read 2

  13. Preparing a library

  14. DNA Extraction

  15. DNA Extraction

  16. DNA Extraction

  17. Random Shearing Cut Sites

  18. End-repair P OH P OH

  19. +A +A (Adenine)

  20. Ligate Adapters A T A T T T

  21. Limited Cycle PCR T A 8 - 16 Cycles To

    minimize PCR bias
  22. Sequencing (Illumina) Substrate

  23. Sequence Tagging

  24. Distributing Output Genome Sequencing Biology

  25. The Problem Fred Harriet Griselda Joachín

  26. The Problem Fred Joachín Harriet Griselda ? ? ? ¿Qué?

  27. The Solution Fred Harriet Griselda Joachín FRED! FRED! GRISELDA! GRISELDA!

  28. Sequence Tagging CAGCAA CAGCAT CCGTAG Source: Binladen et al. 2007

    PLoS One Meyer et al. 2007 NAR Hamady et al. 2008 Nature Methods
  29. Problem with many tags C GCAA CCCGTAG Insertion Deletion Source:

    Adey et al. 2010 Genome Biology Faircloth and Glenn 2011 Unpublished Substitutions ✓
  30. Edit Distance Tags Insertions Deletions Substitutions ✓ ✓ ✓

  31. Edit tag distance TATGCG CGAGTT - 5 Required Edit Distance

    = 2 * (Errors) + 1 Correctable Errors = (Edit Distance - 1) / 2
  32. Numbers of tags ! " # $ % & '

    " ! " " " " " " # #$ ! " " " " " $ %& &$ $ " " " " % #&& '& && ' " " " & $(& &)( #' * ( " " ' &)(% ()& %# &* % ( " () !&+* +!& &%' ') &' $ ( *+,-./01 21/3-4 567189+:1, Source: Faircloth and Glenn 2011 Unpublished
  33. Sequence Capture [Overview]

  34. Sequence Capture Genome

  35. The Gist Biotin

  36. Probes Match Targets

  37. Hybridize Probe to Target

  38. Bind to Streptavidin Streptavidin

  39. Dissolve RNA Probe Streptavidin

  40. None
  41. Sequencing Capacity

  42. Uses of Seqcap [Overview]

  43. Genomic Expanses We can target a big region(s) (>10-15 Kb)

  44. Exome Genome Exons We can target ALL transcribed genes

  45. Targeted Gene Sequencing Exons We can target “genes” coding for

  46. X X We can enrich cDNA for particular transcripts cDNA

  47. Efficient PCR Substitute THOUSANDS of small regions { { {

  48. Challenging (PCR) Templates Substitution Insertion Primer Primer

  49. Challenging (PCR) Templates Substitution Insertion

  50. Primordial Soup

  51. Primordial Soup

  52. Primordial Soup

  53. Off-target Capture + ≠

  54. Off-target Capture + =

  55. Reads → contigs +

  56. Off-target Capture What you want

  57. Uses of Seqcap [Specific Approaches]

  58. The problem value of MPS You The Machine = 7

    days of 8 plates per day containing 5 mplex loci Illumina HiSeq 2000 100 PE Run 100X Depth - Each Amplicon
  59. What we can do... Machine Capacity Human Capacity

  60. What we want to do... Machine Capacity Human Capacity

  61. PCR is slow •Laborious •Expensive •“Universal” ? PCR-based

  62. Phylogen(omics|etics) • Want many loci • Meet sequencing output •

    Give representative sample of genome • Want alignable loci • Want “universal” loci
  63. Seqcap targets many loci THOUSANDS of small regions { {

    { {
  64. PCR vs. Capture 1 10 100 1000 10000 100000 1000000

    PCR Biotinylated Oligos MySelect 25k SureSelect 55k MySelect 200k SureSelect Exome 450k Log(targets)
  65. Many loci Alignable loci “Universal loci” ✓ ? ?

  66. Phylogen(omics|etics)

  67. UCE Discovery 16. L. Shen, K. L. Rock, Proc. Natl.

    Acad. Sci. U.S.A. 101, 3035 (2004). 17. S. P. Schoenberger et al., J. Immunol. 161, 3808 (1998). 18. M. L. Albert, B. Sauter, N. Bhardwaj, Nature 392, 86 (1998). 19. M. Bellone et al., J. Immunol. 159, 5391 (1997). 20. J. W. Yewdell, C. C. Norbury, J. R. Bennink, Adv. Immunol. 73, 1 (1999). 21. A. Serna, M. C. Ramirez, A. Soukhanova, L. J. Sigal, J. Immunol. 171, 5668 (2003). 22. N. P. Restifo et al., J. Immunol. 154, 4414 (1995). 23. We thank B. Buschling, D. Tokarchick, and A. Schell for technical assistance. We are grateful to M. Epler and S. Tevethia for their generous gift of Db- NP 366-374 tetramers. This work was supported in part by a Wellcome Prize Traveling Fellowship and U.S. Public Health Service grants, and NIH grant AI- 056094-01 to C.C.N. Supporting Online Material DC1 Materials and Methods Figs. S1 and S2 References and Notes 3 February 2004; accepted 14 A Ultraconserved Elements in the Human Genome Gill Bejerano,1* Michael Pheasant,3 Igor Makunin,3 Stuart Stephen,3 W. James Kent,1 John S. Mattick,3 David Haussler2* There are 481 segments longer than 200 base pairs (bp) that are absolutely conserved (100% identity with no insertions or deletions) between orthologous regions of the human, rat, and mouse genomes. Nearly all of these segments are also conserved in the chicken and dog genomes, with an average of 95 and 99% identity, respectively. Many are also significantly conserved in fish. These ultraconserved elements of the human genome are most often located either overlapping exons in genes involved in RNA processing or in introns or nearby genes involved in the regulation of transcription and development. Along with more than 5000 sequences of over 100 bp that are absolutely conserved among the three sequenced mammals, these represent a class of genetic elements whose functions and evolutionary origins are yet to be determined, but which are more highly conserved between these species than are proteins and appear to be essential for the ontogeny of mammals and other vertebrates. Although only about 1.2% of the human genome appears to code for proteins (1–3), it has been estimated that as much as 5% is more conserved than would be expected served with orthologous segments in rodents: those showing 100% identity and with no insertions or deletions in their alignment with the mouse and rat. Exclusive of ribosomal with the dog genome, w using reads from the N Biotechnology Informatio chive (477/481 ϭ 99.2% aligning at an average ty).Thus, it appears that ultraconserved elements der extreme negative sele cies for more than 300 some of them for at least As expected, the ultra exhibit almost no natur human population. Only bases examined in the ments (excluding the first each element) are at nucleotide polymorphism SNP database (dbSNP) ( much DNA, we would validated sites, so validat represented by 20-fold (P unvalidated SNPs we fo likely errors in the unv dbSNP (table S2b). Th bases exhibit very few d chimp genome as well, single base changes where Bejerano et al. 2003, Science
  68. human v. mouse v. rat

  69. Genome-Genome Alignment

  70. 100% conserved = UCEs Conserved Conserved ≥ 200 bp

  71. Many vertebrate alignments 28-Way vertebrate alignment and conservation track in

    the UCSC Genome Browser Webb Miller,1,11 Kate Rosenbloom,2 Ross C. Hardison,1 Minmei Hou,1 James Taylor,3 Brian Raney,2 Richard Burhans,1 David C. King,1 Robert Baertsch,2 Daniel Blankenberg,1 Sergei L. Kosakovsky Pond,4 Anton Nekrutenko,1 Belinda Giardine,1 Robert S. Harris,1 Svitlana Tyekucheva,1 Mark Diekhans,2 Thomas H. Pringle,5 William J. Murphy,6 Arthur Lesk,1 George M. Weinstock,7 Kerstin Lindblad-Toh,8 Richard A. Gibbs,7 Eric S. Lander,8 Adam Siepel,9 David Haussler,2,10 and W. James Kent2 1Center for Comparative Genomics and Bioinformatics, Penn State University, University Park, Pennsylvania 16802, USA; 2Center for Biomolecular Science and Engineering, University of California, Santa Cruz, California 95064, USA; 3Courant Institute, New York University, New York, New York 10012, USA; 4Antiviral Research Center, University of California at San Diego, San Diego, California 92103, USA; 5Sperling Foundation, Eugene, Oregon 97405, USA; 6Department of Veterinary Integrative Biosciences, Texas A&M University, College Station, Texas 77843, USA; 7Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA; 8Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA; 9Department of Biological Statistics and Computational Biology, Cornell University, Ithaca, New York 14853, USA; 10Howard Hughes Medical Institute, Santa Cruz, California 95060, USA This article describes a set of alignments of 28 vertebrate genome sequences that is provided by the UCSC Genome Browser. The alignments can be viewed on the Human Genome Browser (March 2006 assembly) at, downloaded in bulk by anonymous FTP from hg18/multiz28way, or analyzed with the Galaxy server at This article illustrates the power of this resource for exploring vertebrate and mammalian evolution, using three examples. First, we present several vignettes involving insertions and deletions within protein-coding regions, including a look at some human-specific indels. Then we study the extent to which start codons and stop codons in the human sequence are conserved in Resource Cold Spring Harbor Laboratory Press on October 19, 2011 - Published by Downloaded from Miller et al. 2007, Genome Research
  72. Genome-Genome Alignment

  73. Many conserved areas Conserved Conserved

  74. Can we capture? Conserved Conserved

  75. Interested in flanks Conserved ✓ ✓

  76. Primarily for SNPs Conserved A/C

  77. Secondarily for phylogeny Conserved

  78. How we started... Genome-Genome Alignment

  79. Define “UCEs” smaller Conserved Conserved ≥ 60 bp

  80. Identifying UCEs Conserved Conserved BLAST ≥ 60 bp

  81. Buffer Short UCEs Conserved Regions ≤ 120 bp 60 bp

    30 bp 30 bp { 120 bp
  82. Tile Long UCEs Conserved Regions ≥ 180 bp { 4

  83. Add Probes to DNA DNA Tuatara Probes

  84. Bind probes to DNA DNA Probe(s) bound to DNA

  85. Wash non-target DNA away DNA Loc1 Loc2 Loc3 Loc4 Loc5

    Loc6 Loc7
  86. Assemble reads → contigs Loc1 Loc2 Loc3 Loc4 Loc5 Loc6

    Loc7 “Throw out” reads not matching expected locus
  87. Align Loc1 Loc2 Loc3 Loc4 Loc5 Loc6 Loc7

  88. Concatenated analysis Loc1 Loc2 Loc3 Loc4 Loc5 Loc6 Loc7 +

  89. Gene tree → Species tree Loc1 Loc2 Loc3 Loc5 Loc6

    Loc4 +
  90. Not limited to UCEs • Target SNPs • Target genes

    • Target exons • Target exome • Target gene regulators • etc.
  91. Genotype by sequencing A/C Probe 1 Probe 2

  92. Computational Issues

  93. The problem value of MPS You The Machine Illumina HiSeq

    2000 100 PE Run 100X Depth - Each Amplicon
  94. Storage 250 GB per run 16 GB per lane

  95. Everyone needs some programming skills Someone needs many programming skills

  96. Design probes { Can be “easy” if company does it

    or Can require skill to slice and design
  97. Group reads Loc1 Loc2 Loc3 Loc4 Loc5 Loc6 Loc7 All

    custom software
  98. Assemble reads Loc1 Loc2 Loc3 Loc4 Loc5 Loc6 Loc7 velvet,

    bowtie, bwa, etc.
  99. Align this many contigs Loc1 Loc2 Loc3 Loc4 Loc5 Loc6

    Loc7 All custom software
  100. Prepare large analyses Loc1 Loc2 Loc3 Loc4 Loc5 Loc6 Loc7

    Loc1 Loc2 Loc3 Loc4 Custom software to build NEXUS Custom software to generate species, gene, and bootstrap trees
  101. You also need computational power

  102. Machine Capacity Data Processing

  103. And the power should be flexible

  104. Assemble reads Loc1 Loc2 Loc3 Loc4 Loc5 Loc6 Loc7 requires

    lots of RAM
  105. Loc1 Loc2 Loc3 Loc4 requires many cores Gene tree →

    Species tree
  106. Concatenated analysis long running process Loc1 Loc2 Loc3 Loc4 Loc5

    Loc6 Loc7
  107. Webtools/services help

  108. More goodies... Almost all “under-construction”

  109. More goodies... Faircloth et al. Syst Biol doi: 10.1093/sysbio/sys004 pmid:

    22232343 McCormack et al. Genome Res doi: 10.1101/gr.125864.111 pmid: 22207614