Sequence Capture Overview, Biological Applications, &c.

Biology by Seqcap Brant C. Faircloth University of California -
Los Angeles

The problem value of MPS You The Machine = 7
days of 8 plates per day containing 5 mplex loci Illumina HiSeq 2000 100 PE Run 100X Depth - Each Amplicon

The value of MPS 1.22 Years of: 95 samples per
plate 5 multiplexed loci per plate 8 plates per day 7 days per week

The value of MPS •Massively parallel ( = powerful) •Clonal
•Cheap

MPS Costs 3730xl 454 (Ti) HiSeq 2000 0 250 500
750 1000 1,000.0000 26.0000 0.0001 Millions of reads per run 3730xl 454 (Ti) HiSeq 2000 0 375 750 1125 1500 $0.10 $15.00 $1,500.00 Cost per Megabase Source: Glenn 2011 Mol. Ecol. Res.

Harnessing the Power • Sequence capture (today’s talk) • Genome
sequencing • Amplicon generation and sequencing • RAD-tag and RAD-tag-like approaches • MIPS, EPIC, etc.

• Library Prep and Sequence Tagging • Sequence Capture •
Uses of Seqcap [Overview] • Uses of Seqcap [Speciﬁc Approaches] • Computational Issues Outline

Anatomy of a Library li•brar•y /ˈlīˌbrerē/ (Randomly) sheared DNA having
attached sequencing adapters

Adapter A synthetic DNA sequence for binding DNA to the
sequencing substrate, which differs by platform

Adapter Binds to substrate DNA

Preparing a library

DNA Extraction

Random Shearing Cut Sites

End-repair P OH P OH

+A +A (Adenine)

Ligate Adapters A T A T T T

Limited Cycle PCR T A 8 - 16 Cycles To
minimize PCR bias

Sequencing (Illumina) Substrate

Sequence Tagging

Distributing Output Genome Sequencing Biology

The Problem Fred Harriet Griselda Joachín

The Problem Fred Joachín Harriet Griselda ? ? ? ¿Qué?

The Solution Fred Harriet Griselda Joachín FRED! FRED! GRISELDA! GRISELDA!
HARRIET! HARRIET! JOACHÍN JOACHÍN

Sequence Tagging CAGCAA CAGCAT CCGTAG Source: Binladen et al. 2007
PLoS One Meyer et al. 2007 NAR Hamady et al. 2008 Nature Methods

Problem with many tags C GCAA CCCGTAG Insertion Deletion Source:
Adey et al. 2010 Genome Biology Faircloth and Glenn 2011 Unpublished Substitutions ✓

Edit Distance Tags Insertions Deletions Substitutions ✓ ✓ ✓

Edit tag distance TATGCG CGAGTT - 5 Required Edit Distance
= 2 * (Errors) + 1 Correctable Errors = (Edit Distance - 1) / 2

Numbers of tags ! " # $ % & '
" ! " " " " " " # #$ ! " " " " " $ %& &$ $ " " " " % #&& '& && ' " " " & $(& &)( #' * ( " " ' &)(% ()& %# &* % ( " () !&+* +!& &%' ') &' $ ( *+,-./01 21/3-4 567189+:1, Source: Faircloth and Glenn 2011 Unpublished

Sequence Capture [Overview]

Sequence Capture Genome

The Gist Biotin

Probes Match Targets

Hybridize Probe to Target

Bind to Streptavidin Streptavidin

Dissolve RNA Probe Streptavidin

Sequencing Capacity

Uses of Seqcap [Overview]

Genomic Expanses We can target a big region(s) (>10-15 Kb)
{

Exome Genome Exons We can target ALL transcribed genes

Targeted Gene Sequencing Exons We can target “genes” coding for
RNAs

X X We can enrich cDNA for particular transcripts cDNA
Enrichment

Efﬁcient PCR Substitute THOUSANDS of small regions { { {
{

Challenging (PCR) Templates Substitution Insertion Primer Primer

Challenging (PCR) Templates Substitution Insertion

Primordial Soup

Off-target Capture + ≠

Off-target Capture + =

Reads → contigs +

Off-target Capture What you want

Uses of Seqcap [Speciﬁc Approaches]

The problem value of MPS You The Machine = 7
days of 8 plates per day containing 5 mplex loci Illumina HiSeq 2000 100 PE Run 100X Depth - Each Amplicon

What we can do... Machine Capacity Human Capacity

What we want to do... Machine Capacity Human Capacity

PCR is slow •Laborious •Expensive •“Universal” ? PCR-based

Phylogen(omics|etics) • Want many loci • Meet sequencing output •
Give representative sample of genome • Want alignable loci • Want “universal” loci

Seqcap targets many loci THOUSANDS of small regions { {
{ {

PCR vs. Capture 1 10 100 1000 10000 100000 1000000
PCR Biotinylated Oligos MySelect 25k SureSelect 55k MySelect 200k SureSelect Exome 450k Log(targets)

Many loci Alignable loci “Universal loci” ✓ ? ?

Phylogen(omics|etics)

UCE Discovery 16. L. Shen, K. L. Rock, Proc. Natl.
Acad. Sci. U.S.A. 101, 3035 (2004). 17. S. P. Schoenberger et al., J. Immunol. 161, 3808 (1998). 18. M. L. Albert, B. Sauter, N. Bhardwaj, Nature 392, 86 (1998). 19. M. Bellone et al., J. Immunol. 159, 5391 (1997). 20. J. W. Yewdell, C. C. Norbury, J. R. Bennink, Adv. Immunol. 73, 1 (1999). 21. A. Serna, M. C. Ramirez, A. Soukhanova, L. J. Sigal, J. Immunol. 171, 5668 (2003). 22. N. P. Restifo et al., J. Immunol. 154, 4414 (1995). 23. We thank B. Buschling, D. Tokarchick, and A. Schell for technical assistance. We are grateful to M. Epler and S. Tevethia for their generous gift of Db- NP 366-374 tetramers. This work was supported in part by a Wellcome Prize Traveling Fellowship and U.S. Public Health Service grants, and NIH grant AI- 056094-01 to C.C.N. Supporting Online Material www.sciencemag.org/cgi/conten DC1 Materials and Methods Figs. S1 and S2 References and Notes 3 February 2004; accepted 14 A Ultraconserved Elements in the Human Genome Gill Bejerano,1* Michael Pheasant,3 Igor Makunin,3 Stuart Stephen,3 W. James Kent,1 John S. Mattick,3 David Haussler2* There are 481 segments longer than 200 base pairs (bp) that are absolutely conserved (100% identity with no insertions or deletions) between orthologous regions of the human, rat, and mouse genomes. Nearly all of these segments are also conserved in the chicken and dog genomes, with an average of 95 and 99% identity, respectively. Many are also signiﬁcantly conserved in ﬁsh. These ultraconserved elements of the human genome are most often located either overlapping exons in genes involved in RNA processing or in introns or nearby genes involved in the regulation of transcription and development. Along with more than 5000 sequences of over 100 bp that are absolutely conserved among the three sequenced mammals, these represent a class of genetic elements whose functions and evolutionary origins are yet to be determined, but which are more highly conserved between these species than are proteins and appear to be essential for the ontogeny of mammals and other vertebrates. Although only about 1.2% of the human genome appears to code for proteins (1–3), it has been estimated that as much as 5% is more conserved than would be expected served with orthologous segments in rodents: those showing 100% identity and with no insertions or deletions in their alignment with the mouse and rat. Exclusive of ribosomal with the dog genome, w using reads from the N Biotechnology Informatio chive (477/481 ϭ 99.2% aligning at an average ty).Thus, it appears that ultraconserved elements der extreme negative sele cies for more than 300 some of them for at least As expected, the ultra exhibit almost no natur human population. Only bases examined in the ments (excluding the first each element) are at nucleotide polymorphism SNP database (dbSNP) ( much DNA, we would validated sites, so validat represented by 20-fold (P unvalidated SNPs we fo likely errors in the unv dbSNP (table S2b). Th bases exhibit very few d chimp genome as well, single base changes where Bejerano et al. 2003, Science

human v. mouse v. rat

Genome-Genome Alignment

100% conserved = UCEs Conserved Conserved ≥ 200 bp

Many vertebrate alignments 28-Way vertebrate alignment and conservation track in
the UCSC Genome Browser Webb Miller,1,11 Kate Rosenbloom,2 Ross C. Hardison,1 Minmei Hou,1 James Taylor,3 Brian Raney,2 Richard Burhans,1 David C. King,1 Robert Baertsch,2 Daniel Blankenberg,1 Sergei L. Kosakovsky Pond,4 Anton Nekrutenko,1 Belinda Giardine,1 Robert S. Harris,1 Svitlana Tyekucheva,1 Mark Diekhans,2 Thomas H. Pringle,5 William J. Murphy,6 Arthur Lesk,1 George M. Weinstock,7 Kerstin Lindblad-Toh,8 Richard A. Gibbs,7 Eric S. Lander,8 Adam Siepel,9 David Haussler,2,10 and W. James Kent2 1Center for Comparative Genomics and Bioinformatics, Penn State University, University Park, Pennsylvania 16802, USA; 2Center for Biomolecular Science and Engineering, University of California, Santa Cruz, California 95064, USA; 3Courant Institute, New York University, New York, New York 10012, USA; 4Antiviral Research Center, University of California at San Diego, San Diego, California 92103, USA; 5Sperling Foundation, Eugene, Oregon 97405, USA; 6Department of Veterinary Integrative Biosciences, Texas A&M University, College Station, Texas 77843, USA; 7Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA; 8Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA; 9Department of Biological Statistics and Computational Biology, Cornell University, Ithaca, New York 14853, USA; 10Howard Hughes Medical Institute, Santa Cruz, California 95060, USA This article describes a set of alignments of 28 vertebrate genome sequences that is provided by the UCSC Genome Browser. The alignments can be viewed on the Human Genome Browser (March 2006 assembly) at http://genome.ucsc.edu, downloaded in bulk by anonymous FTP from http://hgdownload.cse.ucsc.edu/goldenPath/ hg18/multiz28way, or analyzed with the Galaxy server at http://g2.bx.psu.edu. This article illustrates the power of this resource for exploring vertebrate and mammalian evolution, using three examples. First, we present several vignettes involving insertions and deletions within protein-coding regions, including a look at some human-specific indels. Then we study the extent to which start codons and stop codons in the human sequence are conserved in Resource Cold Spring Harbor Laboratory Press on October 19, 2011 - Published by genome.cshlp.org Downloaded from Miller et al. 2007, Genome Research

Genome-Genome Alignment

Many conserved areas Conserved Conserved

Can we capture? Conserved Conserved

Interested in ﬂanks Conserved ✓ ✓

Primarily for SNPs Conserved A/C

Secondarily for phylogeny Conserved

How we started... Genome-Genome Alignment

Deﬁne “UCEs” smaller Conserved Conserved ≥ 60 bp

Identifying UCEs Conserved Conserved BLAST ≥ 60 bp

Buffer Short UCEs Conserved Regions ≤ 120 bp 60 bp
30 bp 30 bp { 120 bp

Tile Long UCEs Conserved Regions ≥ 180 bp { 4
probes

Add Probes to DNA DNA Tuatara Probes

Bind probes to DNA DNA Probe(s) bound to DNA

Wash non-target DNA away DNA Loc1 Loc2 Loc3 Loc4 Loc5
Loc6 Loc7

Assemble reads → contigs Loc1 Loc2 Loc3 Loc4 Loc5 Loc6
Loc7 “Throw out” reads not matching expected locus

Align Loc1 Loc2 Loc3 Loc4 Loc5 Loc6 Loc7

Concatenated analysis Loc1 Loc2 Loc3 Loc4 Loc5 Loc6 Loc7 +
MrBayes

Gene tree → Species tree Loc1 Loc2 Loc3 Loc5 Loc6
Loc4 +

Not limited to UCEs • Target SNPs • Target genes
• Target exons • Target exome • Target gene regulators • etc.

Genotype by sequencing A/C Probe 1 Probe 2

Computational Issues

The problem value of MPS You The Machine Illumina HiSeq
2000 100 PE Run 100X Depth - Each Amplicon

Storage 250 GB per run 16 GB per lane

Everyone needs some programming skills Someone needs many programming skills

Design probes { Can be “easy” if company does it
or Can require skill to slice and design

Group reads Loc1 Loc2 Loc3 Loc4 Loc5 Loc6 Loc7 All
custom software

Assemble reads Loc1 Loc2 Loc3 Loc4 Loc5 Loc6 Loc7 velvet,
bowtie, bwa, etc.

Align this many contigs Loc1 Loc2 Loc3 Loc4 Loc5 Loc6
Loc7 All custom software

Prepare large analyses Loc1 Loc2 Loc3 Loc4 Loc5 Loc6 Loc7
Loc1 Loc2 Loc3 Loc4 Custom software to build NEXUS Custom software to generate species, gene, and bootstrap trees

You also need computational power

Machine Capacity Data Processing

And the power should be ﬂexible

Assemble reads Loc1 Loc2 Loc3 Loc4 Loc5 Loc6 Loc7 requires
lots of RAM

Loc1 Loc2 Loc3 Loc4 requires many cores Gene tree →
Species tree

Concatenated analysis long running process Loc1 Loc2 Loc3 Loc4 Loc5
Loc6 Loc7

Webtools/services help

More goodies... http://bad-dna.org http://faircloth-lab.org http://github.com/faircloth-lab/ Almost all “under-construction”

More goodies... Faircloth et al. Syst Biol doi: 10.1093/sysbio/sys004 pmid:
22232343 McCormack et al. Genome Res doi: 10.1101/gr.125864.111 pmid: 22207614

Sequence Capture Overview, Biological Applicati...

Sequence Capture Overview, Biological Applications, &c.

More Decks by Brant Faircloth

Other Decks in Science

Featured

Transcript