Upgrade to Pro — share decks privately, control downloads, hide ads and more …

BMMB554: Variant Calling

BMMB554: Variant Calling

Anton Nekrutenko

March 21, 2016
Tweet

More Decks by Anton Nekrutenko

Other Decks in Education

Transcript

  1. Mutations ‣ Mutations = stable changes in the genetic material

    (DNA) transmitted from parent to offspring ‣ Ultimate origin of all genetic variation
  2. Mutations and the magnitude of their effect No detectable effect

    Small effect Drastic effect E.g., synonymous change in DNA encoding a protein
  3. Good, bad, and neutral mutations Advantageous Neutral Disadvantageous A mutation

    leading to resistence of the virus to the drug is beneficial to the virus E.g., synonymous change in DNA encoding a protein
  4. Mutations: important but weak force ‣ Initially occurs in just

    one individual ‣ Takes many generations to spread through population ‣ Other processes must cause it to increase in frequency within a population
  5. Mutations occur at random - in respect to organisms in

    a population - whether or not the organism is in an environment in which that mutation would be advantageous (environment does not cause adaptive mutations)
  6. Three possible outcomes of a mutation:
 
 1) Lost 


    2) Polymorphic in a population
 3) Fixation
  7. How polymorphic is human DNA? • ~0.1% of nucleotides differ

    when a DNA sequence is compared between two humans • ~1-1.5% of nucleotides differ when a DNA sequence is compared between human and chaimpanzee
  8. Mutations: important but weak force • Initially occur in just

    one individual • Take many generations to spread through population • Other processes must cause them to increase in frequency within a population
  9. Good, bad, and neutral mutations Advantageous Neutral Disadvantageous (deleterious) A

    mutation leading to resistence of the virus to the drug is beneficial to the virus E.g., synonymous change in DNA encoding a protein Which mutations will be picked up by - genetic drift? - natural selection?
  10. Classification of point mutations: effect on a protein synonymous (silent,

    no aa change) missense (from one aa to another aa) nonsense (sense codon -> Stop) nonsynonymous (aa altering) Point mutations (nucleotide substitutions) MC1R haemoglobin
  11. Drift Kent Holsinger, UConn very small population N = 10

    small population N = 1,000 small population are more affected by chance
  12. Random Genetic Drift: Consequences ‣ Since real populations are finite

    in size, the genetic make-up of offspring will by chance differ from that of the parents’ generation ‣ This effect is stronger in small populations
  13. Forces of Evolution 1. Mutations 2. Natural selection 3. Genetic

    drift Changes in the genetic make-up of a population Evolution
  14. Discovering SNPs • Single nucleotide polymorphisms (SNPs) • Variants that

    affect a single nucleotide, though often relaxed to include other types of small scale variation • SNP discovery requires resolving a new sequence relative to an existing one, so almost certainly need sequencing of some kind • Need multiple observations on the same base
  15. SNP = mismatch. But … mismatches can be • True

    SNP • PCR error during library construction • Base calling error • Misalignment • Reference error
  16. NPs, haplotypes and tag SNPs. a, SNPs. Shown is a

    short stretch of DNA sions of the same chromosome region in different people. Most of the DNA identical in these chromosomes, but three bases are shown where curs. Each SNP has two possible alleles; the first SNP in panel a has the d T. b, Haplotypes. A haplotype is made up of a particular combination of three SNPs that are shown in panel a. For this region, most of the chr population survey turn out to have haplotypes 1–4. c, Tag SNPs. Gen three tag SNPs out of the 20 SNPs is sufficient to identify these four h uniquely. For instance, if a particular chromosome has the pattern A– three tag SNPs, this pattern matches the pattern determined for haploty
  17. ARTICLE doi:10.1038/nature09534 A map of human genome variation from population-scale

    sequencing The 1000 Genomes Project Consortium* The 1000 Genomes Project aims to provide a deep characterization of human genome sequence variation as a foundation for investigating the relationship between genotype and phenotype. Here we present results of the pilot phase of the project, designed to develop and compare different strategies for genome-wide sequencing with high-throughput platforms. We undertook three projects: low-coverage whole-genome sequencing of 179 individuals from four populations; high-coverage sequencing of two mother–father–child trios; and exon-targeted sequencing of 697 individuals from seven populations. We describe the location, allele frequency and local haplotype structure of approximately 15 million single nucleotide polymorphisms, 1 million short insertions and deletions, and 20,000 structural variants, most of which were previously undescribed. We show that, because we have catalogued the vast majority of common variation, over 95% of the currently accessible variants found in any individual are present in this data set. On average, each person is found to carry approximately 250 to 300 loss-of-function variants in annotated genes and 50 to 100 variants previously implicated in inherited disorders. We demonstrate how these results can be used to inform association and functional studies. From the two trios, we directly estimate the rate of de novo germline base substitution mutations to be approximately 1028 per base pair per generation. We explore the data with regard to signatures of natural selection, and identify a marked reduction of genetic variation in the neighbourhood of genes, due to selection at linked sites. These methods and public data will support the next phase of human genetic research. Understanding the relationship between genotype and phenotype is one of the central goals in biology and medicine. The reference human genome sequence1 provides a foundation for the study of human genetics, but systematic investigation of human variation requires full knowledge of DNA sequence variation across the entire spectrum of allele frequencies and types of DNA differences. Substantial progress significantlytothegeneticarchitectureofdisease,but ithasnotyetbeen possible to study them systematically7–9. Meanwhile, advances in DNA sequencing technology have enabled the sequencing of individual genomes10–13, illuminating the gaps in the first generation of databases that contain mostly common variant sites. A much more complete catalogue of human DNA variation is a prerequisite to understand fully
  18. 1000 Genomes project • Discover all common human variation: 95%

    of variation down to 1% MAF • In gene regions, down to 0.5% to 0.1% • Three pilot studies:
  19. 1000 Genomes • Excluding pilot data, the May 2011 data

    release contains sequence data for 1094 individuals! • Most data is exon targeted • Varying levels of coverage and number of datasets: • 93 NA19068 93 NA19000 80 NA19066 78 NA18986 77 NA18982 72 NA20502 72 NA19065 72 NA18988 72 NA18983 72 NA18950 72 NA12286 72 NA11933 69 NA20539 66 NA20519 66 NA20515 66 NA19190 66 NA12046 64 NA18984 64 NA12829 63 NA20756 63 NA20754... • http://browser.1000genomes.org
  20. Average individual variation • Average individual differs from reference at

    ~10k to ~11k non-synonymous sites (plus 10k to 12k synonymous sites) • 190–210 in-frame indels, 80–100 premature stop codons, 40–50 splice-site-disrupting variants and 220–250 deletions that shift reading frame • 50–100 variants classified by the Human Gene Mutation Database as causing inherited disorders
  21. Economist June 17 2010 First human genome:
 10 years, ~

    $3 billion A genome every 2 days UK£ 3000
  22. • "Let us be in no doubt about what we

    are witnessing today: A revolution in medical science whose implications far surpass even the discovery of antibiotics, the first great technological triumph of the 21st century.” (Tony Blair)
 • "Having the genetic code is not a very important moment other than it's the beginning of what we can do with it”. (Craig Venter)
 • the benefits of human genome mapping will include “a new understanding of genetic contributions to human disease” and “the development of rational strategies for minimizing or preventing disease phenotypes altogether.” (Francis Collins) • “it is fair to say that the Human Genome Project has not yet directly affected the health care of most individuals.” (Francis Collins, more recently)
  23. • 15 month old child with severe disease • No

    diagnosis, no clinical management • Sequencing: mutation found • Suggests therapy, child cured March 2011:
  24. Genomic partitioning / capture • Goal: extract a particular region

    or regions of a large DNA molecule for deep sequencing • (Or at least, highly enrich for those regions)
  25. Gapped molecular inversion probes ter concept, Hardenbol et al. demonstrated

    that over 10,000 SNPs could be genotyped in par- allel via a padlock probe scheme requiring sin- gle base gap-fills at interrogated positions and four-color readout on microarrays (27). To adapt this approach for genomic par- titioning, Shendure and colleagues explored a Anneal b Gap fill polymerization c Gap fill ligation d Exonuclease selection e Probe release f Amplification Genome Genome Probe Probe C G G A G A T G G C C C A G C C T C T C C G G G T C G G A G A T G G C C C A G C C T C T A C C G G G T C G G A G A T G G C C C A G C C T C T A C C G G G T Figure 6 Gapped molecular inversion probes. (a) Probes are designed with a target- specific sequence at the ends, and an internal sequence that is common to all MIPs. Probes hybridize to single-stranded genomic DNA, leaving a gap over the target region. The gap can range from a single nucleotide for SNP genotyping, as in References 26, 27, to several hundred nucleotides for exon desired scale, the 55,000 requ obtained as a complex mixtu by synthesis on and release fr of an Agilent microarray. Afte via 15-bp universal sequences 100-mers were converted to through a series of restriction d 70-mer MIP consisted of uniqu ing sequences flanking a comm The individual targets ranged 60 to 191 bp. With the amplifi estimate that the yield of one array is sufficient to support independent capture reactio hybridization to genomic D and circularization, and exonuc capture products were rolling converted into shotgun sequen sequenced on the Illumina Ge Analysis of the resulting dat that: (a) specificity was high, as that could be confidently mapp overlapping with one of the (b) completeness and specific as only ∼10,000 of the 55,00 detectably captured, and the a which individual targets were o over several logs; and (c) geno was high at homozygous pos at heterozygous positions, like stochastic effects with poor cap We have subsequently obse optimizations markedly impro mance of this strategy (64a). Annu. Rev. Genom. Human Genet. 2009.10:263-284. Downloaded from arjournals.annualreviews.org by EMORY UNIVERSITY on 09/29/09. For personal use only. (Turner et al. ARGHG, 2009)
  26. Hybridization capture in solution 6-GG10-13 ARI 6 August 2009 6:45

    Shotgun library or PCR amplified metagenomic library inserts Biotinylated probes Hybridize in solution Capture probes on strepdavidin- coated beads Wash, elute captured DNAs Amplify by PCR with common primers Sequence products Figure 8 In solution hybrid selection. Target DNA is prepared as an in vitro shotgun library, with common adaptors flanking genomic DNA fragments. The library is hybridized in solution to a set of biotinylated probes. After hybridization, biotinylated probes are captured with streptavidin beads. Beads are washed to remove any nonspecific, unbound library molecules. Multitemplate PCR with primers directed at the common adaptors is used to amplify eluted target molecules before high-throughput sequencing. Adapted from Noona et al. (50). Images reprinted with permission from AASS. Parallel capture of 29 of 35 human targets was demonstrated, with the caveat that these sequences were already known to be present in the Neanderthal library via sequencing of of this approach include the following: (a) because the RNA baits are single-stranded and present in only one orientation, a high con- centration and molar excess can drive the kinet- (Turner et al. ARGHG, 2009)
  27. Hybridization capture on array (Turner et al. ARGHG, 2009) probe

    (55) and the RNA-DNA hybrid selection methods described above). However, in this section we focus on reports from several groups that apply the programmable microarray it- self as a selective substrate for solid-phase capture-by-hybridization. Fragmentation Genomic DNA Random library Repair, adaptor ligation Target capture Adaptor-ligated fragments Custom MGS array Wash, elution Selected target region Amplification with single primer pair Enriched target region Figure 9 On array hybrid selection. In vitro shotgun libraries are generated from genomic DNA, with common adapters flanking each fragment. The library is hybridized to oligos tethered on a high-density programmable microarray. Unbound molecules are washed from the array, followed by heat-based elution of specifically hybridized material. Multitemplate PCR with primers directed at the common adaptors is used to amplify eluted target molecules before high- quences designed from the referen genome to tile region(s) of interest at sity (i.e., 1 to 10 bp spacing) for i hybridization, while excluding non repetitive sequences from considerat hybridization for ∼65 h at 42◦C, an wash steps, heat-based elution at 95◦C out to recover specifically hybridized Universal primers corresponding to mon adaptors are used for PCR amp after which the target-enriched shotg can be sequenced. Albert et al. (2) designed and several capture arrays, one focused o ing 6726 discontiguous exons and ad quences from 660 genes (total targ 5 Mb), and the remainder focused on ous intervals of varying sizes at the B cus (200 kb, 500 kb, 1 Mb, 2 Mb, a with the same array format but differ ties of probe spacing. With three re the exon-focused array, sequencing da to 115 Mb of sequence generated fro richment libraries by 454 sequencin relatively consistent performance, wi 77% of reads mapping to targets, an 96% of targets overlapped by at lea For capture directed at a contiguo (200 kb to 5 Mb), the fraction of re ping to the target appeared to be with the size of the target, i.e., 14% kb target vs 64% for a 5-Mb target. given that the 200-kb target is 25-fo than the 5-Mb target, the calcula Annu. Rev. Genom. Human Genet. 2009.10:263-284. Downloaded from arjournals.annualrevie by EMORY UNIVERSITY on 09/29/09. For personal use only.
  28. Solution hybridization enrichment at least half 5% was not We

    attribu exon targets single captur from withou regional cap baits, that is contributing slightly long bases compa contributed including gra instead of 1 Supplementa Effects of b Separating th Sequence coverage 350 250 150 100 50 0 10,000 6,000 8,000 2,000 4,000 0 Base position 300 200 Figure 3 Sequence coverage along a contiguous target. Shown is base-by- base sequence coverage along a typical 11-kb segment (chr4:118635000– 118646000) out of 1.7 Mb. Sequence corresponding to bait is marked in blue. Segments that had more than 40 repeat-masked bases per 170-base
  29. Multiplexing • Goal: allow multiple samples to be sequenced in

    a single run • Attach a unique identifier to each fragment of each sample, which can later be used to determine what sample a given read comes from
  30. e moved n situ ary dule. ncing le- o lysis

    ed ntify- alysis. nome - on . at ead nher- sign, hose sed TION on 1.0 to mistry ware nd nt tively are detecting structural variation, and UNLIMITED ACCESSIBILITY FIGURE 3: ADDING THE SEQUENCE INDEX TO A LIBRARY 3. A third primer in the PCR adds the Index as well as a second flow cell attachment site (P7) to the PCR product shown in step 2. P7 Index Index SP P5 Rd2 SP 2. Prepared samples are amplified via PCR using two universal primers. One primer contains an attachment site (P5) for the flow cell, while the other contains the sequencing primer sites for the index read (Index SP) and for application read 2 (Rd2 SP). DNA Insert Rd1 SP 1. During sample preparation, adapters are ligated to the DNA fragments. One adapter contains the sequencing primer site for application read 1 (Rd1 SP). 4. The indexed library is ready for sequencing using the Genome Analyzer system. Rd1 SP Index SP P5 P7 Index DNA Insert Rd2 SP Illumina Genome Analyzer System Introducing index sequences onto DNA fragments enables sequencing of 96 different samples on a single fl ow cell. This greatly increases experimental scalability, while maintaining extremely low error rates and conserving read length. HIGH-THROUGHPUT SEQUENCING Using the industry’s leading next- generation sequencing technology, the Genome Analyzer system offers proven, exceptionally high data yields and the largest number of error-free reads. Harnessing this se- quencing power in a multiplex fash- ion increases experimental through- put while reducing time and cost. This is especially useful when target- ing genomic sub-regions or studying small genomes. To make multiplexed sequencing on the Genome Analyzer available to any laboratory, Illumina offers the Multiplexing Sample Preparation Oligonucleotide Kit and the Multiplexing Sequencing Primers and PhiX Control Kit. In the multiplexed sequencing method, DNA libraries are “tagged” with a unique identifi er, or index, during sample preparation. Multiple samples are then pooled into a single lane on a fl ow cell and sequenced together in one Genome Analyzer for individual downstream analysis. Using this approach, sample identifi cation is highly accurate. APPLICATIONS Multiplexed sequencing on the Genome Analyzer can be used in a wide range of applications. For HIGHLIGHTS OF ILLUMINA MULTIPLEXED SEQUENCING Fast, High-Throughput • Strategy: Automated sequencing of 96 samples per fl ow cell Cost-Effective Method: • Multi- sample pooling improves productivity by reducing time and reagent use High-Quality Data: • Accurate maintenance of read length for unknown sequences Simplifi ed Analysis: • Automated FIGURE 1: MULTIPLEXED SEQUENCING PROCESS DNA insert A. READ 1 B. INDEX READ C. READ 2 DNA insert Index Index SP Rd2 SP Rd1 SP Sample multiplexing involves a total of three sequencing reads, including a separate index read, which is generated automatically on the Genome Analyzer equipped with the Paired-End Module. A: Application read 1 (dotted line) is generated using the Read 1 Sequencing Primer (Rd1 SP). B: The read 1 product is removed and the Index Sequencing Primer (Index SP) is annealed to the same strand to produce the 6-bp in- dex read (dotted line). C: If a paired-end read is required, the original template strand is used to regenerate the complementary strand. Then, the original strand is removed and the complementary strand acts as a template for application read 2 (dotted line), primed by the Read 2 Sequencing Primer (Rd2 SP). Pipeline Analysis software identifies the index sequence from each cluster so that the application reads can be assigned to a single sample. Hatch marks represent the flow cell surface.
  31. Variant calling • Aim: produce variant calls (w.r.t. reference), and

    genotype calls • True variants usually easy to spot • But: SNPs easier than indels easier than SVs • And: Sufficient coverage required • Divergence / diversity often low (human: 0.1%) • False positives are an issue Lunter 2013
  32. Data issues • Primary data • PCR errors (base errors)

    • SOLiD: reference bias (reference-based color decoding) • Base quality calibration • Indel errors • Overlaps • Duplicates • Primers • Alignment • Base-level misalignments around indels • Reference / mapping • Unrepresented seg dups • Repetitive sequence Lunter 2013
  33. Data issues • Primary data • PCR errors (base errors)

    • SOLiD: reference bias (reference-based color decoding) • Base quality calibration • Indel errors • Overlaps • Duplicates • Primers • Alignment • Base-level misalignments around indels • Reference / mapping • Unrepresented seg dups • Repetitive sequence Lunter 2013
  34. Mother:   20  x  ref   2  x  +A Father:

      12  x  ref   12  x  -­‐AA Child:   14  x  ref   3  x  +A   1  x  +AAA Choose  parsimonious  child  alleles  from  mother  and  father:     -­‐  explain  largest  number  of  reads  in  M+F+C     -­‐  alleles  supported  by  >=  2  reads  in  each  of  M,C  /  each  of  F,C   Remaining  child  alleles  classified  as  errors Method  1  (TRIO):  
 Non-­‐Mendelian  alleles  in  trios Lunter 2013
  35. 9DULDQWGHWHFWLRQ 0DQ\FXUUHQWPHWKRGVXVHD%D\HVLDQPRGHO ZKLFKFRPELQHVVHYHUDOVRXUFHVRILQIRUPDWLRQ Ɣ 6HTXHQFLQJSURYLGHVREVHUYDWLRQTXDOLW\ HVWLPDWHV ż EDVHTXDOLW\ SURE REVHUYDWLRQ_VHTXHQFH

    Ɣ %LRORJ\SURYLGHVSULRUH[SHFWDWLRQVDERXW SRO\PRUSKLVP ż ڧ S VLWHSRO\PRUSKLF aH IRUKXPDQV Ɣ $QGDOVRSRSXODWLRQVWUXFWXUH ż ZKLFKPDWWHUVLIZHFDOOVHYHUDOVDPSOHVWRJHWKHU Garrison 2013
  36. Polybayes (Marth et al. 1999) • Used EST databases aligned

    to (draft) reference human genome • Discard paralogous alignments • At sites where variation is observed, use a probabilistic model to evaluate whether the site is likely to be a real SNP • Incorporates error probability in base calls • Confirm by further sequencing
  37. Washington University 1Department of Genetics and Genome Sequencing Center and

    2Division of Dermatology, St. Louis, Missouri, USA. Correspondence should be addressed to G.T.M. (e-mail: [email protected]) or P.-Y.K. (e-mail: [email protected]). finished and working-draft quality genomic sequences, a data set representative of the typical challenges of sequence-based SNP discovery. duplicated elsewhere in the genome may give rise to false SNP pre- dictions, and the presence of such sequence paralogues points to difficulties during marker development. We devised a Bayesian15 genomic anchor ESTs candidate SNP (a) (b) anchor (c) anchor STS native EST s (d) (e) trace from DNA pool confirmed SNP (g) paralogues trace from CHM1 DNA (f) ESTs Fig. 1 Application of the POLYBAYES procedure to EST data. a, Regions of known human repeats in a genomic sequence are masked. b, Match- ing human ESTs are retrieved from dbEST and traces are re-called. c, Par- alogous ESTs are identified and discarded. d, Alignments of native EST reads are screened for candidate variable sites. e, An STS is designed for the verification of a candidate SNP. f, The uniqueness of the genomic location is determined by sequencing the STS in CHM1 (homozygous DNA). g, The presence of a SNP is analysed by sequencing the STS from pooled DNA samples. a b c d e f g
  38. 7KHIXQFWLRQDOHIIHFWRIYDULDQWVGHSHQGVRQ RWKHUQHDUE\YDULDQWVRQWKHVDPHKDSORW\SH $***$*&7* $UJ*OX/HX UHIHUHQFH $**7$*&7* $UJ7HU DSSDUHQW $**77*&7* $UJ/HX/HX

    DFWXDO 272)JHQH±PXWDWLRQV FDXVHSURIRXQGUHFHVVLYH GHDIQHVV $SSDUHQWQRQVHQVHYDULDQW RQH<5,KRPR]\JRWH $FWXDOO\DEORFNVXEVWLWXWLRQ WKDWUHVXOWVLQDPLVVHQVH VXEVWLWXWLRQ 'DQLHO0DF$UWKXU Garrison 2013
  39. 7KHPRGHO Ɣ %D\HVLDQPRGHOHVWLPDWHVWKHSUREDELOLW\RISRO\PRUSKLVPDW DORFXVJLYHQLQSXWGDWDDQGWKHSRSXODWLRQPXWDWLRQUDWH aSDLUZLVHKHWHUR]\JRVLW\ DQGDVVXPSWLRQRI³QHXWUDOLW\´ UDQGRPPDWLQJ  Ɣ )ROORZLQJ%D\HVWKHRUHPWKHSUREDELOLW\RIDVSHFLILFVHWRI

    JHQRW\SHVRYHUVRPHQXPEHURIVDPSOHVLV ż 3 *_5   3 5_* 3 *  3 5 Ɣ :KLFKLQ)UHH%D\HVZHH[WHQGWR ż 3 *6_5   3 5_*6 3 * 3 6  3 5 ż * JHQRW\SHV5 UHDGV6 ORFXVLVZHOO FKDUDFWHUL]HGPDSSHG ż 3 5_*6 LVRXUGDWDOLNHOLKRRG3 * LVRXUSULRUHVWLPDWH RIWKHJHQRW\SHV3 6 LVRXUSULRUHVWLPDWHRIWKH PDSSDELOLW\RIWKHORFXV3 5 LVDQRUPDOL]HU Garrison 2013
  40. 7KHSURFHVV Ɣ 3DUVHDOOHOHV VPDOOKDSORW\SHV IURPDOLJQPHQWVXVLQJ &,*$5VWULQJV Ɣ 3LFNVXLWDEOHDOOHOHV YHU\ZHDNLQSXWILOWHUVWRLPSURYH UXQWLPH

     Ɣ %XLOGKDSORW\SHVDFURVVWDUJHWORFXV Ɣ *HQHUDWHJHQRW\SHOLNHOLKRRGV Ɣ 6DPSOHDSRVWHULRUVSDFHDURXQGWKHGDWDOLNHOLKRRG PD[LPXP ż XSGDWHJHQRW\SHHVWLPDWHVDQGLWHUDWH KLOOFOLPELQJ SRVWHULRUVHDUFK XQWLOFRQYHUJHQFHRQPD[LPXPD SRVWHULRULJHQRW\SLQJRYHUDOOVDPSOHV Ɣ 2XWSXWDUHFRUGDQGGRLWDJDLQ Garrison 2013