$30 off During Our Annual Pro Sale. View Details »

BMMB554: Variant Calling

BMMB554: Variant Calling

Anton Nekrutenko

March 21, 2016
Tweet

More Decks by Anton Nekrutenko

Other Decks in Education

Transcript

  1. Re-sequencing

    View Slide

  2. Transitions and transversions
    A G
    T C
    transition
    transition
    transversions

    View Slide

  3. Mutations

    View Slide

  4. Mutations
    ‣ Mutations = stable changes in the genetic
    material (DNA) transmitted from parent to
    offspring
    ‣ Ultimate origin of all genetic variation

    View Slide

  5. Mutations and the
    magnitude of their effect
    No detectable effect Small effect Drastic effect
    E.g., synonymous change
    in DNA encoding a
    protein

    View Slide

  6. Good, bad, and neutral
    mutations
    Advantageous Neutral Disadvantageous
    A mutation
    leading to resistence of
    the virus to the drug is
    beneficial to the virus
    E.g., synonymous change
    in DNA encoding a
    protein

    View Slide

  7. Mutations: important but
    weak force
    ‣ Initially occurs in just one individual
    ‣ Takes many generations to spread through
    population
    ‣ Other processes must cause it to increase in
    frequency within a population

    View Slide

  8. Mutations occur at random
    - in respect to organisms in a population
    - whether or not the organism is in an
    environment in which that mutation would
    be advantageous (environment does not
    cause adaptive mutations)

    View Slide

  9. Outcomes of mutations

    View Slide

  10. Three possible outcomes of a
    mutation:


    1) Lost 

    2) Polymorphic in a population

    3) Fixation

    View Slide

  11. Mutation

    View Slide

  12. Lost
    Mutation

    View Slide

  13. Mutation

    View Slide

  14. Polymorphism
    (both mutant
    and wild type
    alleles are
    present in a
    population)
    Mutation

    View Slide

  15. Polymorphism
    Mutation

    View Slide

  16. Polymorphism
    Mutation

    View Slide

  17. Polymorphism
    Mutation

    View Slide

  18. Polymorphism
    Mutation

    View Slide

  19. Polymorphism
    Mutation

    View Slide

  20. Polymorphism
    Mutation

    View Slide

  21. Polymorphism
    Mutation

    View Slide

  22. Polymorphism
    Fixation
    (mutant allele
    replaced wild type
    allele completely)
    Mutation

    View Slide



  23. 1) Lost 

    2) Polymorphic in a population

    3) Fixation

    View Slide

  24. How polymorphic is human
    DNA?
    • ~0.1% of nucleotides differ when a DNA sequence is
    compared between two humans
    • ~1-1.5% of nucleotides differ when a DNA sequence is
    compared between human and chaimpanzee

    View Slide

  25. Mutations: important but weak
    force
    • Initially occur in just one individual
    • Take many generations to spread through population
    • Other processes must cause them to increase in frequency within
    a population

    View Slide

  26. Good, bad, and neutral
    mutations
    Advantageous Neutral Disadvantageous (deleterious)
    A mutation
    leading to resistence
    of the virus to the drug
    is beneficial to the
    virus
    E.g., synonymous
    change in DNA
    encoding a protein
    Which mutations will be picked up by
    - genetic drift?
    - natural selection?

    View Slide

  27. Classification of point
    mutations: effect on a protein
    synonymous
    (silent, no aa change)
    missense
    (from one aa to another aa)
    nonsense
    (sense codon -> Stop)
    nonsynonymous
    (aa altering)
    Point mutations
    (nucleotide substitutions)
    MC1R
    haemoglobin

    View Slide

  28. The big picture…
    Population
    Population
    Population
    Species
    Mutations
    Polymorphisms
    Fixations
    Fixations of alleles
    important for reproduction
    Speciation

    View Slide

  29. Drift
    Kent Holsinger, UConn
    very small population
    N = 10
    small population
    N = 1,000
    small population are more affected by chance

    View Slide

  30. View Slide

  31. Random Genetic Drift:
    Consequences
    ‣ Since real populations are finite in size, the genetic
    make-up of offspring will by chance differ from that
    of the parents’ generation
    ‣ This effect is stronger in small populations

    View Slide

  32. Forces of Evolution
    1. Mutations
    2. Natural selection
    3. Genetic drift
    Changes in the
    genetic make-up
    of a population
    Evolution

    View Slide

  33. Schlotterer, 2004

    View Slide

  34. Discovering SNPs
    • Single nucleotide polymorphisms (SNPs)
    • Variants that affect a single nucleotide, though
    often relaxed to include other types of small
    scale variation
    • SNP discovery requires resolving a new sequence
    relative to an existing one, so almost certainly
    need sequencing of some kind
    • Need multiple observations on the same base

    View Slide

  35. SNP = mismatch. But … mismatches
    can be
    • True SNP
    • PCR error during library construction
    • Base calling error
    • Misalignment
    • Reference error

    View Slide

  36. NPs, haplotypes and tag SNPs. a, SNPs. Shown is a short stretch of DNA
    sions of the same chromosome region in different people. Most of the DNA
    identical in these chromosomes, but three bases are shown where
    curs. Each SNP has two possible alleles; the first SNP in panel a has the
    d T. b, Haplotypes. A haplotype is made up of a particular combination of
    three SNPs that are shown in panel a. For this region, most of the chr
    population survey turn out to have haplotypes 1–4. c, Tag SNPs. Gen
    three tag SNPs out of the 20 SNPs is sufficient to identify these four h
    uniquely. For instance, if a particular chromosome has the pattern A–
    three tag SNPs, this pattern matches the pattern determined for haploty

    View Slide

  37. ARTICLE
    doi:10.1038/nature09534
    A map of human genome variation from
    population-scale sequencing
    The 1000 Genomes Project Consortium*
    The 1000 Genomes Project aims to provide a deep characterization of human genome sequence variation as a foundation
    for investigating the relationship between genotype and phenotype. Here we present results of the pilot phase of the
    project, designed to develop and compare different strategies for genome-wide sequencing with high-throughput
    platforms. We undertook three projects: low-coverage whole-genome sequencing of 179 individuals from four
    populations; high-coverage sequencing of two mother–father–child trios; and exon-targeted sequencing of 697
    individuals from seven populations. We describe the location, allele frequency and local haplotype structure of
    approximately 15 million single nucleotide polymorphisms, 1 million short insertions and deletions, and 20,000
    structural variants, most of which were previously undescribed. We show that, because we have catalogued the vast
    majority of common variation, over 95% of the currently accessible variants found in any individual are present in this
    data set. On average, each person is found to carry approximately 250 to 300 loss-of-function variants in annotated
    genes and 50 to 100 variants previously implicated in inherited disorders. We demonstrate how these results can be used
    to inform association and functional studies. From the two trios, we directly estimate the rate of de novo germline base
    substitution mutations to be approximately 1028 per base pair per generation. We explore the data with regard to
    signatures of natural selection, and identify a marked reduction of genetic variation in the neighbourhood of genes,
    due to selection at linked sites. These methods and public data will support the next phase of human genetic research.
    Understanding the relationship between genotype and phenotype is
    one of the central goals in biology and medicine. The reference human
    genome sequence1 provides a foundation for the study of human
    genetics, but systematic investigation of human variation requires full
    knowledge of DNA sequence variation across the entire spectrum of
    allele frequencies and types of DNA differences. Substantial progress
    significantlytothegeneticarchitectureofdisease,but ithasnotyetbeen
    possible to study them systematically7–9. Meanwhile, advances in DNA
    sequencing technology have enabled the sequencing of individual
    genomes10–13, illuminating the gaps in the first generation of databases
    that contain mostly common variant sites. A much more complete
    catalogue of human DNA variation is a prerequisite to understand fully

    View Slide

  38. 1000 Genomes project
    • Discover all common human variation: 95% of
    variation down to 1% MAF
    • In gene regions, down to 0.5% to 0.1%
    • Three pilot studies:

    View Slide

  39. 1000 Genomes
    • Excluding pilot data, the May 2011 data release
    contains sequence data for 1094 individuals!
    • Most data is exon targeted
    • Varying levels of coverage and number of
    datasets:
    • 93 NA19068 93 NA19000 80 NA19066 78 NA18986 77
    NA18982 72 NA20502 72 NA19065 72 NA18988 72 NA18983
    72 NA18950 72 NA12286 72 NA11933 69 NA20539 66
    NA20519 66 NA20515 66 NA19190 66 NA12046 64 NA18984
    64 NA12829 63 NA20756 63 NA20754...
    • http://browser.1000genomes.org

    View Slide

  40. Average individual variation
    • Average individual differs from reference at ~10k
    to ~11k non-synonymous sites (plus 10k to 12k
    synonymous sites)
    • 190–210 in-frame indels, 80–100 premature stop
    codons, 40–50 splice-site-disrupting variants and
    220–250 deletions that shift reading frame
    • 50–100 variants classified by the Human Gene
    Mutation Database as causing inherited disorders

    View Slide

  41. View Slide

  42. Variant calling steps

    View Slide

  43. Economist June 17 2010
    First human genome:

    10 years, ~ $3 billion
    A genome every 2 days
    UK£ 3000

    View Slide

  44. ● "Let us be in no doubt about what we are witnessing today: A revolution in medical
    science whose implications far surpass even the discovery of antibiotics, the first great
    technological triumph of the 21st century.” (Tony Blair)

    ● "Having the genetic code is not a very important moment other than it's the beginning
    of what we can do with it”. (Craig Venter)

    ● the benefits of human genome mapping will include “a new understanding of genetic
    contributions to human disease” and “the development of rational strategies for
    minimizing or preventing disease phenotypes altogether.” (Francis Collins)
    ● “it is fair to say that the Human Genome Project has not yet directly affected the health
    care of most individuals.” (Francis Collins, more recently)

    View Slide

  45. View Slide

  46. ● 15 month old child with severe disease
    ● No diagnosis, no clinical management
    ● Sequencing: mutation found
    ● Suggests therapy, child cured
    March 2011:

    View Slide

  47. View Slide

  48. Saving money:
    Going deep rather than wide

    View Slide

  49. Genomic partitioning / capture
    • Goal: extract a particular region or regions of a
    large DNA molecule for deep sequencing
    • (Or at least, highly enrich for those regions)

    View Slide

  50. Gapped molecular inversion probes
    ter concept, Hardenbol et al. demonstrated that
    over 10,000 SNPs could be genotyped in par-
    allel via a padlock probe scheme requiring sin-
    gle base gap-fills at interrogated positions and
    four-color readout on microarrays (27).
    To adapt this approach for genomic par-
    titioning, Shendure and colleagues explored
    a Anneal b Gap fill polymerization
    c Gap fill ligation d Exonuclease selection
    e Probe release f Amplification
    Genome
    Genome
    Probe
    Probe
    C G G A G A T G G C C C A
    G C C T C T C C G G G T
    C G G A G A T G G C C C A
    G C C T C T A C C G G G T
    C G G A G A T G G C C C A
    G C C T C T A C C G G G T
    Figure 6
    Gapped molecular inversion probes. (a) Probes are designed with a target-
    specific sequence at the ends, and an internal sequence that is common
    to all MIPs. Probes hybridize to single-stranded genomic DNA, leaving a
    gap over the target region. The gap can range from a single nucleotide for SNP
    genotyping, as in References 26, 27, to several hundred nucleotides for exon
    desired scale, the 55,000 requ
    obtained as a complex mixtu
    by synthesis on and release fr
    of an Agilent microarray. Afte
    via 15-bp universal sequences
    100-mers were converted to
    through a series of restriction d
    70-mer MIP consisted of uniqu
    ing sequences flanking a comm
    The individual targets ranged
    60 to 191 bp. With the amplifi
    estimate that the yield of one
    array is sufficient to support
    independent capture reactio
    hybridization to genomic D
    and circularization, and exonuc
    capture products were rolling
    converted into shotgun sequen
    sequenced on the Illumina Ge
    Analysis of the resulting dat
    that: (a) specificity was high, as
    that could be confidently mapp
    overlapping with one of the
    (b) completeness and specific
    as only ∼10,000 of the 55,00
    detectably captured, and the a
    which individual targets were o
    over several logs; and (c) geno
    was high at homozygous pos
    at heterozygous positions, like
    stochastic effects with poor cap
    We have subsequently obse
    optimizations markedly impro
    mance of this strategy (64a).
    Annu. Rev. Genom. Human Genet. 2009.10:263-284. Downloaded from arjournals.annualreviews.org
    by EMORY UNIVERSITY on 09/29/09. For personal use only.
    (Turner et al. ARGHG, 2009)

    View Slide

  51. Hybridization capture in solution
    6-GG10-13 ARI 6 August 2009 6:45
    Shotgun library
    or PCR amplified
    metagenomic
    library inserts
    Biotinylated
    probes
    Hybridize
    in solution
    Capture probes
    on strepdavidin-
    coated beads
    Wash, elute
    captured DNAs
    Amplify by PCR with
    common primers
    Sequence
    products
    Figure 8
    In solution hybrid selection. Target DNA is prepared as an in vitro shotgun library, with common adaptors flanking genomic DNA
    fragments. The library is hybridized in solution to a set of biotinylated probes. After hybridization, biotinylated probes are captured
    with streptavidin beads. Beads are washed to remove any nonspecific, unbound library molecules. Multitemplate PCR with primers
    directed at the common adaptors is used to amplify eluted target molecules before high-throughput sequencing. Adapted from Noona
    et al. (50). Images reprinted with permission from AASS.
    Parallel capture of 29 of 35 human targets
    was demonstrated, with the caveat that these
    sequences were already known to be present
    in the Neanderthal library via sequencing of
    of this approach include the following:
    (a) because the RNA baits are single-stranded
    and present in only one orientation, a high con-
    centration and molar excess can drive the kinet-
    (Turner et al. ARGHG, 2009)

    View Slide

  52. Hybridization capture on array
    (Turner et al. ARGHG, 2009)
    probe (55) and the RNA-DNA hybrid selection
    methods described above). However, in this
    section we focus on reports from several groups
    that apply the programmable microarray it-
    self as a selective substrate for solid-phase
    capture-by-hybridization.
    Fragmentation
    Genomic DNA Random library
    Repair,
    adaptor ligation
    Target capture
    Adaptor-ligated fragments
    Custom MGS array
    Wash, elution
    Selected target region
    Amplification with
    single primer pair
    Enriched target region
    Figure 9
    On array hybrid selection. In vitro shotgun libraries are generated from
    genomic DNA, with common adapters flanking each fragment. The library is
    hybridized to oligos tethered on a high-density programmable microarray.
    Unbound molecules are washed from the array, followed by heat-based elution
    of specifically hybridized material. Multitemplate PCR with primers directed at
    the common adaptors is used to amplify eluted target molecules before high-
    quences designed from the referen
    genome to tile region(s) of interest at
    sity (i.e., 1 to 10 bp spacing) for i
    hybridization, while excluding non
    repetitive sequences from considerat
    hybridization for ∼65 h at 42◦C, an
    wash steps, heat-based elution at 95◦C
    out to recover specifically hybridized
    Universal primers corresponding to
    mon adaptors are used for PCR amp
    after which the target-enriched shotg
    can be sequenced.
    Albert et al. (2) designed and
    several capture arrays, one focused o
    ing 6726 discontiguous exons and ad
    quences from 660 genes (total targ
    5 Mb), and the remainder focused on
    ous intervals of varying sizes at the B
    cus (200 kb, 500 kb, 1 Mb, 2 Mb, a
    with the same array format but differ
    ties of probe spacing. With three re
    the exon-focused array, sequencing da
    to 115 Mb of sequence generated fro
    richment libraries by 454 sequencin
    relatively consistent performance, wi
    77% of reads mapping to targets, an
    96% of targets overlapped by at lea
    For capture directed at a contiguo
    (200 kb to 5 Mb), the fraction of re
    ping to the target appeared to be
    with the size of the target, i.e., 14%
    kb target vs 64% for a 5-Mb target.
    given that the 200-kb target is 25-fo
    than the 5-Mb target, the calcula
    Annu. Rev. Genom. Human Genet. 2009.10:263-284. Downloaded from arjournals.annualrevie
    by EMORY UNIVERSITY on 09/29/09. For personal use only.

    View Slide

  53. Solution hybridization enrichment
    at least half
    5% was not
    We attribu
    exon targets
    single captur
    from withou
    regional cap
    baits, that is
    contributing
    slightly long
    bases compa
    contributed
    including gra
    instead of 1
    Supplementa
    Effects of b
    Separating th
    Sequence coverage
    350
    250
    150
    100
    50
    0
    10,000
    6,000 8,000
    2,000 4,000
    0
    Base position
    300
    200
    Figure 3 Sequence coverage along a contiguous target. Shown is base-by-
    base sequence coverage along a typical 11-kb segment (chr4:118635000–
    118646000) out of 1.7 Mb. Sequence corresponding to bait is marked in
    blue. Segments that had more than 40 repeat-masked bases per 170-base

    View Slide

  54. Multiplexing

    View Slide

  55. Multiplexing
    • Goal: allow multiple samples to be sequenced in
    a single run
    • Attach a unique identifier to each fragment of
    each sample, which can later be used to
    determine what sample a given read comes from

    View Slide

  56. e
    moved
    n situ
    ary
    dule.
    ncing
    le-
    o
    lysis
    ed
    ntify-
    alysis.
    nome
    -
    on
    .
    at
    ead
    nher-
    sign,
    hose
    sed
    TION
    on 1.0
    to
    mistry
    ware
    nd
    nt
    tively
    are
    detecting structural variation, and UNLIMITED ACCESSIBILITY
    FIGURE 3: ADDING THE SEQUENCE INDEX TO A LIBRARY
    3. A third primer in the PCR adds the Index as well as a second flow cell
    attachment site (P7) to the PCR product shown in step 2.
    P7
    Index
    Index SP
    P5
    Rd2 SP
    2. Prepared samples are amplified via PCR using two universal primers. One
    primer contains an attachment site (P5) for the flow cell, while the other
    contains the sequencing primer sites for the index read (Index SP) and
    for application read 2 (Rd2 SP).
    DNA Insert
    Rd1 SP
    1. During sample preparation, adapters are ligated to the DNA fragments.
    One adapter contains the sequencing primer site for application read 1
    (Rd1 SP).
    4. The indexed library is ready for sequencing using the Genome Analyzer
    system.
    Rd1 SP Index SP
    P5 P7
    Index
    DNA Insert
    Rd2 SP
    Illumina Genome Analyzer System
    Introducing index sequences onto DNA fragments enables sequencing of 96 different samples
    on a single fl ow cell. This greatly increases experimental scalability, while maintaining extremely
    low error rates and conserving read length.
    HIGH-THROUGHPUT SEQUENCING
    Using the industry’s leading next-
    generation sequencing technology,
    the Genome Analyzer system offers
    proven, exceptionally high data
    yields and the largest number of
    error-free reads. Harnessing this se-
    quencing power in a multiplex fash-
    ion increases experimental through-
    put while reducing time and cost.
    This is especially useful when target-
    ing genomic sub-regions or studying
    small genomes. To make multiplexed
    sequencing on the Genome Analyzer
    available to any laboratory, Illumina
    offers the Multiplexing Sample
    Preparation Oligonucleotide Kit and
    the Multiplexing Sequencing Primers
    and PhiX Control Kit.
    In the multiplexed sequencing
    method, DNA libraries are “tagged”
    with a unique identifi er, or index,
    during sample preparation. Multiple
    samples are then pooled into a single
    lane on a fl ow cell and sequenced
    together in one Genome Analyzer
    for individual downstream analysis.
    Using this approach, sample
    identifi cation is highly accurate.
    APPLICATIONS
    Multiplexed sequencing on the
    Genome Analyzer can be used in
    a wide range of applications. For
    HIGHLIGHTS OF ILLUMINA
    MULTIPLEXED SEQUENCING
    Fast, High-Throughput

    Strategy: Automated sequencing
    of 96 samples per fl ow cell
    Cost-Effective Method:
    • Multi-
    sample pooling improves
    productivity by reducing time
    and reagent use
    High-Quality Data:
    • Accurate
    maintenance of read length for
    unknown sequences
    Simplifi ed Analysis:
    • Automated
    FIGURE 1: MULTIPLEXED SEQUENCING PROCESS
    DNA
    insert
    A. READ 1 B. INDEX READ C. READ 2
    DNA
    insert
    Index
    Index SP
    Rd2 SP
    Rd1 SP
    Sample multiplexing involves a total of three sequencing reads, including a separate
    index read, which is generated automatically on the Genome Analyzer equipped with
    the Paired-End Module. A: Application read 1 (dotted line) is generated using the
    Read 1 Sequencing Primer (Rd1 SP). B: The read 1 product is removed and the Index
    Sequencing Primer (Index SP) is annealed to the same strand to produce the 6-bp in-
    dex read (dotted line). C: If a paired-end read is required, the original template strand
    is used to regenerate the complementary strand. Then, the original strand is removed
    and the complementary strand acts as a template for application read 2 (dotted
    line), primed by the Read 2 Sequencing Primer (Rd2 SP). Pipeline Analysis software
    identifies the index sequence from each cluster so that the application reads can be
    assigned to a single sample. Hatch marks represent the flow cell surface.

    View Slide

  57. Variant calling
    ● Aim: produce variant calls (w.r.t. reference), and genotype calls
    ● True variants usually easy to spot
    ● But: SNPs easier than indels easier than SVs
    ● And: Sufficient coverage required
    ● Divergence / diversity often low (human: 0.1%)
    ● False positives are an issue
    Lunter 2013

    View Slide

  58. Lunter 2013

    View Slide

  59. Lunter 2013

    View Slide

  60. 4 reads:
    1 bp insertion
    Lunter 2013

    View Slide

  61. 5 reads:
    1 bp deletion
    Lunter 2013

    View Slide

  62. 4 reads:
    reference
    Lunter 2013

    View Slide

  63. Data issues
    ● Primary data
    ● PCR errors (base errors)
    ● SOLiD: reference bias (reference-based color decoding)
    ● Base quality calibration
    ● Indel errors
    ● Overlaps
    ● Duplicates
    ● Primers
    ● Alignment
    ● Base-level misalignments around indels
    ● Reference / mapping
    ● Unrepresented seg dups
    ● Repetitive sequence
    Lunter 2013

    View Slide

  64. Data issues
    ● Primary data
    ● PCR errors (base errors)
    ● SOLiD: reference bias (reference-based color decoding)
    ● Base quality calibration
    ● Indel errors
    ● Overlaps
    ● Duplicates
    ● Primers
    ● Alignment
    ● Base-level misalignments around indels
    ● Reference / mapping
    ● Unrepresented seg dups
    ● Repetitive sequence
    Lunter 2013

    View Slide

  65. Mother:  
    20  x  ref  
    2  x  +A
    Father:  
    12  x  ref  
    12  x  -­‐AA
    Child:  
    14  x  ref  
    3  x  +A  
    1  x  +AAA
    Choose  parsimonious  child  alleles  from  mother  and  father:  
      -­‐  explain  largest  number  of  reads  in  M+F+C  
      -­‐  alleles  supported  by  >=  2  reads  in  each  of  M,C  /  each  of  F,C  
    Remaining  child  alleles  classified  as  errors
    Method  1  (TRIO):  

    Non-­‐Mendelian  alleles  in  trios
    Lunter 2013

    View Slide

  66. :RUNLQJZLWKVHTXHQFHV
    ILQGLQJYDULDWLRQV
    *HQRPH)$67$ 9DULDWLRQ9&)
    DOLJQPHQWDQG
    YDULDQWFDOOLQJ
    5HDGV
    )$674
    Garrison 2013

    View Slide

  67. 9DULDQWGHWHFWLRQ
    0DQ\FXUUHQWPHWKRGVXVHD%D\HVLDQPRGHO
    ZKLFKFRPELQHVVHYHUDOVRXUFHVRILQIRUPDWLRQ
    Ɣ 6HTXHQFLQJSURYLGHVREVHUYDWLRQTXDOLW\
    HVWLPDWHV
    ż EDVHTXDOLW\ SUREREVHUYDWLRQ_VHTXHQFH
    Ɣ %LRORJ\SURYLGHVSULRUH[SHFWDWLRQVDERXW
    SRO\PRUSKLVP
    ż ڧ SVLWHSRO\PRUSKLFaHIRUKXPDQV
    Ɣ $QGDOVRSRSXODWLRQVWUXFWXUH
    ż ZKLFKPDWWHUVLIZHFDOOVHYHUDOVDPSOHVWRJHWKHU
    Garrison 2013

    View Slide

  68. %D\HVLDQYLVXDOLQWXLWLRQ
    )LJXUHVIURPKWWSRVFDUERQLOODFRPYLVXDOL]LQJED\HVWKHRUHP
    $ VDPSOHVZLWKD
    YDULDQWDWVRPHORFXV
    :HKDYHDXQLYHUVHRILQGLYLGXDOV
    % SXWDWLYHREVHUYDWLRQV
    RIYDULDQWDWVRPHORFXV
    Garrison 2013

    View Slide

  69. Polybayes (Marth et al. 1999)
    • Used EST databases aligned to (draft) reference
    human genome
    • Discard paralogous alignments
    • At sites where variation is observed, use a
    probabilistic model to evaluate whether the site
    is likely to be a real SNP
    • Incorporates error probability in base calls
    • Confirm by further sequencing

    View Slide

  70. Washington University 1Department of Genetics and Genome Sequencing Center and 2Division of Dermatology, St. Louis, Missouri, USA. Correspondence
    should be addressed to G.T.M. (e-mail: [email protected]) or P.-Y.K. (e-mail: [email protected]).
    finished and working-draft quality genomic sequences, a data
    set representative of the typical challenges of sequence-based
    SNP discovery.
    duplicated elsewhere in the genome may give rise to false SNP pre-
    dictions, and the presence of such sequence paralogues points to
    difficulties during marker development. We devised a Bayesian15
    genomic
    anchor
    ESTs
    candidate SNP
    (a)
    (b)
    anchor
    (c)
    anchor
    STS
    native EST s
    (d)
    (e)
    trace from DNA pool
    confirmed SNP
    (g)
    paralogues
    trace from CHM1 DNA
    (f)
    ESTs
    Fig. 1 Application of the POLYBAYES procedure to EST data. a, Regions
    of known human repeats in a genomic sequence are masked. b, Match-
    ing human ESTs are retrieved from dbEST and traces are re-called. c, Par-
    alogous ESTs are identified and discarded. d, Alignments of native EST
    reads are screened for candidate variable sites. e, An STS is designed for
    the verification of a candidate SNP. f, The uniqueness of the genomic
    location is determined by sequencing the STS in CHM1 (homozygous
    DNA). g, The presence of a SNP is analysed by sequencing the STS from
    pooled DNA samples.
    a
    b
    c d
    e
    f g

    View Slide

  71. 2QHSRVLWLRQDWDWLPH
    5HIHUHQFH
    5HDGV
    9DULDQWREVHUYDWLRQV
    Garrison 2013

    View Slide

  72. 2QHSRVLWLRQDWDWLPH
    5HIHUHQFH
    +DSORW\SHLQIRUPDWLRQLVORVW
    Garrison 2013

    View Slide

  73. 'LUHFWGHWHFWLRQRIKDSORW\SHV)UHH%D\HV
    'HWHFWLRQZLQGRZ
    5HIHUHQFH
    5HDGV
    'LUHFWGHWHFWLRQRIKDSORW\SHV
    IURPUHDGVUHVROYHV
    GLIIHUHQWLDOO\UHSUHVHQWHG
    DOOHOHVDVWKHVHTXHQFHLV
    FRPSDUHGQRWWKHDOLJQPHQW
    $OOHOHGHWHFWLRQLVVWLOO
    DOLJQPHQWEDVHG
    Garrison 2013

    View Slide

  74. Garrison 2013

    View Slide

  75. :K\KDSORW\SHV"
    Ɣ 9DULDQWVFOXVWHU
    Ɣ 7KLVKDVIXQFWLRQDOVLJQLILFDQFH
    Ɣ 2EVHUYLQJKDSORW\SHVOHWVXVEHPRUH
    FHUWDLQRIWKHORFDOVWUXFWXUHRIWKHJHQRPH
    Ɣ :HFDQLPSURYHWKHGHWHFWLRQSURFHVVLWVHOI
    E\XVLQJKDSORW\SHVUDWKHUWKDQSRLQW
    PXWDWLRQV
    Garrison 2013

    View Slide

  76. 7KHIXQFWLRQDOHIIHFWRIYDULDQWVGHSHQGVRQ
    RWKHUQHDUE\YDULDQWVRQWKHVDPHKDSORW\SH
    $***$*&7*
    $UJ*OX/HX
    UHIHUHQFH
    $**7$*&7*
    $UJ7HU
    DSSDUHQW
    $**77*&7*
    $UJ/HX/HX
    DFWXDO
    272)JHQH±PXWDWLRQV
    FDXVHSURIRXQGUHFHVVLYH
    GHDIQHVV
    $SSDUHQWQRQVHQVHYDULDQW
    RQH<5,KRPR]\JRWH
    $FWXDOO\DEORFNVXEVWLWXWLRQ
    WKDWUHVXOWVLQDPLVVHQVH
    VXEVWLWXWLRQ
    'DQLHO0DF$UWKXU
    Garrison 2013

    View Slide

  77. ,PSRUWDQFHRIKDSORW\SHHIIHFWV
    IUDPHUHVWRULQJLQGHOV
    Ɣ
    7ZRDSSDUHQWIUDPHVKLIWGHOHWLRQVLQWKH&$63$3
    JHQHRQHESRQHESRQWKHVDPHKDSORW\SH
    Ɣ
    2YHUDOOHIIHFWLVLQIUDPHGHOHWLRQRIVL[DPLQRDFLGV
    'DQLHO0DF$UWKXU
    Garrison 2013

    View Slide

  78. 7KHPRGHO
    Ɣ %D\HVLDQPRGHOHVWLPDWHVWKHSUREDELOLW\RISRO\PRUSKLVPDW
    DORFXVJLYHQLQSXWGDWDDQGWKHSRSXODWLRQPXWDWLRQUDWH
    aSDLUZLVHKHWHUR]\JRVLW\DQGDVVXPSWLRQRI³QHXWUDOLW\´
    UDQGRPPDWLQJ
    Ɣ )ROORZLQJ%D\HVWKHRUHPWKHSUREDELOLW\RIDVSHFLILFVHWRI
    JHQRW\SHVRYHUVRPHQXPEHURIVDPSOHVLV
    ż 3*_5 35_*3*35
    Ɣ :KLFKLQ)UHH%D\HVZHH[WHQGWR
    ż 3*6_5 35_*63*3635
    ż * JHQRW\SHV5 UHDGV6 ORFXVLVZHOO
    FKDUDFWHUL]HGPDSSHG
    ż 35_*6LVRXUGDWDOLNHOLKRRG3*LVRXUSULRUHVWLPDWH
    RIWKHJHQRW\SHV36LVRXUSULRUHVWLPDWHRIWKH
    PDSSDELOLW\RIWKHORFXV35LVDQRUPDOL]HU
    Garrison 2013

    View Slide

  79. 7KHSURFHVV
    Ɣ 3DUVHDOOHOHVVPDOOKDSORW\SHVIURPDOLJQPHQWVXVLQJ
    &,*$5VWULQJV
    Ɣ 3LFNVXLWDEOHDOOHOHVYHU\ZHDNLQSXWILOWHUVWRLPSURYH
    UXQWLPH
    Ɣ %XLOGKDSORW\SHVDFURVVWDUJHWORFXV
    Ɣ *HQHUDWHJHQRW\SHOLNHOLKRRGV
    Ɣ 6DPSOHDSRVWHULRUVSDFHDURXQGWKHGDWDOLNHOLKRRG
    PD[LPXP
    ż XSGDWHJHQRW\SHHVWLPDWHVDQGLWHUDWHKLOOFOLPELQJ
    SRVWHULRUVHDUFKXQWLOFRQYHUJHQFHRQPD[LPXPD
    SRVWHULRULJHQRW\SLQJRYHUDOOVDPSOHV
    Ɣ 2XWSXWDUHFRUGDQGGRLWDJDLQ
    Garrison 2013

    View Slide