a population - whether or not the organism is in an environment in which that mutation would be advantageous (environment does not cause adaptive mutations)
mutation leading to resistence of the virus to the drug is beneficial to the virus E.g., synonymous change in DNA encoding a protein Which mutations will be picked up by - genetic drift? - natural selection?
no aa change) missense (from one aa to another aa) nonsense (sense codon -> Stop) nonsynonymous (aa altering) Point mutations (nucleotide substitutions) MC1R haemoglobin
affect a single nucleotide, though often relaxed to include other types of small scale variation • SNP discovery requires resolving a new sequence relative to an existing one, so almost certainly need sequencing of some kind • Need multiple observations on the same base
international SNP consortium, ssahaSNP, ... • The quality value of the SNP base should be >= 23 • The Q value for the 5 bases on either side the SNP should be >=15 • Only one mismatch is allowed in the flanking ten bases
to (draft) reference human genome • Discard paralogous alignments • At sites where variation is observed, use a probabilistic model to evaluate whether the site is likely to be a real SNP • Incorporates error probability in base calls • Confirm by further sequencing
2Division of Dermatology, St. Louis, Missouri, USA. Correspondence should be addressed to G.T.M. (e-mail: [email protected]) or P.-Y.K. (e-mail: [email protected]). finished and working-draft quality genomic sequences, a data set representative of the typical challenges of sequence-based SNP discovery. duplicated elsewhere in the genome may give rise to false SNP pre- dictions, and the presence of such sequence paralogues points to difficulties during marker development. We devised a Bayesian15 genomic anchor ESTs candidate SNP (a) (b) anchor (c) anchor STS native EST s (d) (e) trace from DNA pool confirmed SNP (g) paralogues trace from CHM1 DNA (f) ESTs Fig. 1 Application of the POLYBAYES procedure to EST data. a, Regions of known human repeats in a genomic sequence are masked. b, Match- ing human ESTs are retrieved from dbEST and traces are re-called. c, Par- alogous ESTs are identified and discarded. d, Alignments of native EST reads are screened for candidate variable sites. e, An STS is designed for the verification of a candidate SNP. f, The uniqueness of the genomic location is determined by sequencing the STS in CHM1 (homozygous DNA). g, The presence of a SNP is analysed by sequencing the STS from pooled DNA samples. a b c d e f g
short stretch of DNA sions of the same chromosome region in different people. Most of the DNA identical in these chromosomes, but three bases are shown where curs. Each SNP has two possible alleles; the first SNP in panel a has the d T. b, Haplotypes. A haplotype is made up of a particular combination of three SNPs that are shown in panel a. For this region, most of the chr population survey turn out to have haplotypes 1–4. c, Tag SNPs. Gen three tag SNPs out of the 20 SNPs is sufficient to identify these four h uniquely. For instance, if a particular chromosome has the pattern A– three tag SNPs, this pattern matches the pattern determined for haploty
sequencing The 1000 Genomes Project Consortium* The 1000 Genomes Project aims to provide a deep characterization of human genome sequence variation as a foundation for investigating the relationship between genotype and phenotype. Here we present results of the pilot phase of the project, designed to develop and compare different strategies for genome-wide sequencing with high-throughput platforms. We undertook three projects: low-coverage whole-genome sequencing of 179 individuals from four populations; high-coverage sequencing of two mother–father–child trios; and exon-targeted sequencing of 697 individuals from seven populations. We describe the location, allele frequency and local haplotype structure of approximately 15 million single nucleotide polymorphisms, 1 million short insertions and deletions, and 20,000 structural variants, most of which were previously undescribed. We show that, because we have catalogued the vast majority of common variation, over 95% of the currently accessible variants found in any individual are present in this data set. On average, each person is found to carry approximately 250 to 300 loss-of-function variants in annotated genes and 50 to 100 variants previously implicated in inherited disorders. We demonstrate how these results can be used to inform association and functional studies. From the two trios, we directly estimate the rate of de novo germline base substitution mutations to be approximately 1028 per base pair per generation. We explore the data with regard to signatures of natural selection, and identify a marked reduction of genetic variation in the neighbourhood of genes, due to selection at linked sites. These methods and public data will support the next phase of human genetic research. Understanding the relationship between genotype and phenotype is one of the central goals in biology and medicine. The reference human genome sequence1 provides a foundation for the study of human genetics, but systematic investigation of human variation requires full knowledge of DNA sequence variation across the entire spectrum of allele frequencies and types of DNA differences. Substantial progress significantlytothegeneticarchitectureofdisease,but ithasnotyetbeen possible to study them systematically7–9. Meanwhile, advances in DNA sequencing technology have enabled the sequencing of individual genomes10–13, illuminating the gaps in the first generation of databases that contain mostly common variant sites. A much more complete catalogue of human DNA variation is a prerequisite to understand fully
are witnessing today: A revolution in medical science whose implications far surpass even the discovery of antibiotics, the first great technological triumph of the 21st century.” (Tony Blair) • "Having the genetic code is not a very important moment other than it's the beginning of what we can do with it”. (Craig Venter) • the benefits of human genome mapping will include “a new understanding of genetic contributions to human disease” and “the development of rational strategies for minimizing or preventing disease phenotypes altogether.” (Francis Collins) • “it is fair to say that the Human Genome Project has not yet directly affected the health care of most individuals.” (Francis Collins, more recently)
that over 10,000 SNPs could be genotyped in par- allel via a padlock probe scheme requiring sin- gle base gap-fills at interrogated positions and four-color readout on microarrays (27). To adapt this approach for genomic par- titioning, Shendure and colleagues explored a Anneal b Gap fill polymerization c Gap fill ligation d Exonuclease selection e Probe release f Amplification Genome Genome Probe Probe C G G A G A T G G C C C A G C C T C T C C G G G T C G G A G A T G G C C C A G C C T C T A C C G G G T C G G A G A T G G C C C A G C C T C T A C C G G G T Figure 6 Gapped molecular inversion probes. (a) Probes are designed with a target- specific sequence at the ends, and an internal sequence that is common to all MIPs. Probes hybridize to single-stranded genomic DNA, leaving a gap over the target region. The gap can range from a single nucleotide for SNP genotyping, as in References 26, 27, to several hundred nucleotides for exon desired scale, the 55,000 requ obtained as a complex mixtu by synthesis on and release fr of an Agilent microarray. Afte via 15-bp universal sequences 100-mers were converted to through a series of restriction d 70-mer MIP consisted of uniqu ing sequences flanking a comm The individual targets ranged 60 to 191 bp. With the amplifi estimate that the yield of one array is sufficient to support independent capture reactio hybridization to genomic D and circularization, and exonuc capture products were rolling converted into shotgun sequen sequenced on the Illumina Ge Analysis of the resulting dat that: (a) specificity was high, as that could be confidently mapp overlapping with one of the (b) completeness and specific as only ∼10,000 of the 55,00 detectably captured, and the a which individual targets were o over several logs; and (c) geno was high at homozygous pos at heterozygous positions, like stochastic effects with poor cap We have subsequently obse optimizations markedly impro mance of this strategy (64a). Annu. Rev. Genom. Human Genet. 2009.10:263-284. Downloaded from arjournals.annualreviews.org by EMORY UNIVERSITY on 09/29/09. For personal use only. (Turner et al. ARGHG, 2009)
Shotgun library or PCR amplified metagenomic library inserts Biotinylated probes Hybridize in solution Capture probes on strepdavidin- coated beads Wash, elute captured DNAs Amplify by PCR with common primers Sequence products Figure 8 In solution hybrid selection. Target DNA is prepared as an in vitro shotgun library, with common adaptors flanking genomic DNA fragments. The library is hybridized in solution to a set of biotinylated probes. After hybridization, biotinylated probes are captured with streptavidin beads. Beads are washed to remove any nonspecific, unbound library molecules. Multitemplate PCR with primers directed at the common adaptors is used to amplify eluted target molecules before high-throughput sequencing. Adapted from Noona et al. (50). Images reprinted with permission from AASS. Parallel capture of 29 of 35 human targets was demonstrated, with the caveat that these sequences were already known to be present in the Neanderthal library via sequencing of of this approach include the following: (a) because the RNA baits are single-stranded and present in only one orientation, a high con- centration and molar excess can drive the kinet- (Turner et al. ARGHG, 2009)
(55) and the RNA-DNA hybrid selection methods described above). However, in this section we focus on reports from several groups that apply the programmable microarray it- self as a selective substrate for solid-phase capture-by-hybridization. Fragmentation Genomic DNA Random library Repair, adaptor ligation Target capture Adaptor-ligated fragments Custom MGS array Wash, elution Selected target region Amplification with single primer pair Enriched target region Figure 9 On array hybrid selection. In vitro shotgun libraries are generated from genomic DNA, with common adapters flanking each fragment. The library is hybridized to oligos tethered on a high-density programmable microarray. Unbound molecules are washed from the array, followed by heat-based elution of specifically hybridized material. Multitemplate PCR with primers directed at the common adaptors is used to amplify eluted target molecules before high- quences designed from the referen genome to tile region(s) of interest at sity (i.e., 1 to 10 bp spacing) for i hybridization, while excluding non repetitive sequences from considerat hybridization for ∼65 h at 42◦C, an wash steps, heat-based elution at 95◦C out to recover specifically hybridized Universal primers corresponding to mon adaptors are used for PCR amp after which the target-enriched shotg can be sequenced. Albert et al. (2) designed and several capture arrays, one focused o ing 6726 discontiguous exons and ad quences from 660 genes (total targ 5 Mb), and the remainder focused on ous intervals of varying sizes at the B cus (200 kb, 500 kb, 1 Mb, 2 Mb, a with the same array format but differ ties of probe spacing. With three re the exon-focused array, sequencing da to 115 Mb of sequence generated fro richment libraries by 454 sequencin relatively consistent performance, wi 77% of reads mapping to targets, an 96% of targets overlapped by at lea For capture directed at a contiguo (200 kb to 5 Mb), the fraction of re ping to the target appeared to be with the size of the target, i.e., 14% kb target vs 64% for a 5-Mb target. given that the 200-kb target is 25-fo than the 5-Mb target, the calcula Annu. Rev. Genom. Human Genet. 2009.10:263-284. Downloaded from arjournals.annualrevie by EMORY UNIVERSITY on 09/29/09. For personal use only.
attribu exon targets single captur from withou regional cap baits, that is contributing slightly long bases compa contributed including gra instead of 1 Supplementa Effects of b Separating th Sequence coverage 350 250 150 100 50 0 10,000 6,000 8,000 2,000 4,000 0 Base position 300 200 Figure 3 Sequence coverage along a contiguous target. Shown is base-by- base sequence coverage along a typical 11-kb segment (chr4:118635000– 118646000) out of 1.7 Mb. Sequence corresponding to bait is marked in blue. Segments that had more than 40 repeat-masked bases per 170-base
ed ntify- alysis. nome - on . at ead nher- sign, hose sed TION on 1.0 to mistry ware nd nt tively are detecting structural variation, and UNLIMITED ACCESSIBILITY FIGURE 3: ADDING THE SEQUENCE INDEX TO A LIBRARY 3. A third primer in the PCR adds the Index as well as a second flow cell attachment site (P7) to the PCR product shown in step 2. P7 Index Index SP P5 Rd2 SP 2. Prepared samples are amplified via PCR using two universal primers. One primer contains an attachment site (P5) for the flow cell, while the other contains the sequencing primer sites for the index read (Index SP) and for application read 2 (Rd2 SP). DNA Insert Rd1 SP 1. During sample preparation, adapters are ligated to the DNA fragments. One adapter contains the sequencing primer site for application read 1 (Rd1 SP). 4. The indexed library is ready for sequencing using the Genome Analyzer system. Rd1 SP Index SP P5 P7 Index DNA Insert Rd2 SP Illumina Genome Analyzer System Introducing index sequences onto DNA fragments enables sequencing of 96 different samples on a single fl ow cell. This greatly increases experimental scalability, while maintaining extremely low error rates and conserving read length. HIGH-THROUGHPUT SEQUENCING Using the industry’s leading next- generation sequencing technology, the Genome Analyzer system offers proven, exceptionally high data yields and the largest number of error-free reads. Harnessing this se- quencing power in a multiplex fash- ion increases experimental through- put while reducing time and cost. This is especially useful when target- ing genomic sub-regions or studying small genomes. To make multiplexed sequencing on the Genome Analyzer available to any laboratory, Illumina offers the Multiplexing Sample Preparation Oligonucleotide Kit and the Multiplexing Sequencing Primers and PhiX Control Kit. In the multiplexed sequencing method, DNA libraries are “tagged” with a unique identifi er, or index, during sample preparation. Multiple samples are then pooled into a single lane on a fl ow cell and sequenced together in one Genome Analyzer for individual downstream analysis. Using this approach, sample identifi cation is highly accurate. APPLICATIONS Multiplexed sequencing on the Genome Analyzer can be used in a wide range of applications. For HIGHLIGHTS OF ILLUMINA MULTIPLEXED SEQUENCING Fast, High-Throughput • Strategy: Automated sequencing of 96 samples per fl ow cell Cost-Effective Method: • Multi- sample pooling improves productivity by reducing time and reagent use High-Quality Data: • Accurate maintenance of read length for unknown sequences Simplifi ed Analysis: • Automated FIGURE 1: MULTIPLEXED SEQUENCING PROCESS DNA insert A. READ 1 B. INDEX READ C. READ 2 DNA insert Index Index SP Rd2 SP Rd1 SP Sample multiplexing involves a total of three sequencing reads, including a separate index read, which is generated automatically on the Genome Analyzer equipped with the Paired-End Module. A: Application read 1 (dotted line) is generated using the Read 1 Sequencing Primer (Rd1 SP). B: The read 1 product is removed and the Index Sequencing Primer (Index SP) is annealed to the same strand to produce the 6-bp in- dex read (dotted line). C: If a paired-end read is required, the original template strand is used to regenerate the complementary strand. Then, the original strand is removed and the complementary strand acts as a template for application read 2 (dotted line), primed by the Read 2 Sequencing Primer (Rd2 SP). Pipeline Analysis software identifies the index sequence from each cluster so that the application reads can be assigned to a single sample. Hatch marks represent the flow cell surface.
ref 12 x -AA Child: 14 x ref 3 x +A 1 x +AAA Choose parsimonious child alleles from mother and father: - explain largest number of reads in M+F+C - alleles supported by >= 2 reads in each of M,C / each of F,C Remaining child alleles classified as errors Method 1 (TRIO): Non-Mendelian alleles in trios Lunter 2013