Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Targeted Sequencing Overview

PacBio
August 01, 2013

Targeted Sequencing Overview

PacBio

August 01, 2013
Tweet

More Decks by PacBio

Other Decks in Science

Transcript

  1. FIND MEANING IN COMPLEXITY © Copyright 2013 by Pacific Biosciences

    of California, Inc. All rights reserved. Targeted Sequencing - Experimental Design
  2. Applications of Targeted Sequencing & Experimental Design Benefits from PacBio

    for characterizing genomic variations Variant detection experimental design recommendations Example project calculation Sample preparation and sequencing recommendations Analysis recommendations Where to find additional information
  3. Targeted Sequencing: High-Resolution Insights Exquisite sensitivity and specificity to fully

    characterize genetic complexity – Multi-kilobase reads – 99.999% consensus accuracy – Linear variant detection to <0.1% frequency – Access to the entire genome SNP Detection and Validation Repeat Expansions Compound Mutations and Haplotype Phasing Minor Variants Detection www.pacb.com/target Full-Length Transcripts and Splice Variants
  4. Benefits of PacBio for Targeted Sequencing Applications • SMRT® Sequencing

    can achieve greater than 99.999% (QV 50) accurate sequencing results for resequencing applications – Near perfect consensus accuracy – Best coverage uniformity with no amplification and minimal GC bias – Improved mappability with longer average read lengths • Flexibility in amplicon size allow for customized solutions for complex SNP detection applications • Sequencing of long amplicons with multi-kilobase read lengths provides direct strand-specific haplotype phasing • Single-molecule sequencing detects low-frequency minor variants at high- resolution 4
  5. Targeted Sequencing Experimental Design Guidance www.pacb.com/target 6 • High level

    overview of experimental design theory for variant detection • Examples of how to estimate number of SMRT® Cells per project • Guidance on samples preparation and analysis
  6. Experimental Design 7 • Coverage defined as: – Number of

    independent base calls generated for a particular position in a known reference alignment – Needs will vary based on experimental question and analysis approach Variables that impact coverage: • Type of variation queried • Analysis method used • Required confidence • Predicted variant frequency • QV or accuracy of the base calls at the variant position Experimental Design Target Enrichment SMRTbell™ Library Preparation Sequencing Analysis Every project is different - understanding goals and objectives are critical for proper experimental design
  7. Coverage Needs Vary by Base QV and Variant Frequency •

    Low frequency SNPs require more coverage • More coverage for higher confidence
  8. CCS Throughput vs Total SMRT Cell Throughput • Total SMRT

    Cell throughput is a function of read length and number of reads per SMRT Cell passing filtering criteria • CCS throughput will always be lower – CCS reads require a minimum of two passes – Usable throughput per SMRT® Cell depends on insert length − Amplicon size * # loaded ZMW = max throughput – ~50-60K reads per SMRT Cell • Number of Reads per SMRT Cell will vary due to: – Instrument Type: RS II vs RS – Insert Length – Required # of passes to reach desired QV – Chemistry choice – Instrument run conditions (loading conc., movie times, etc.) – Sample quality 1 x 45 movie, RS II Trading single molecule accuracy for # of reads
  9. Considerations for Choosing Between CLR and CCS • Multi-molecule consensus

    (or use of CLR reads) is sufficient for majority of applications – Allow longer amplicons, more useful reads per SMRT Cell – Same final consensus accuracy in fewer SMRT Cells – Flexibility for shorter movies • CCS useful for applications where high intra-molecular consensus accuracy required (rare variant) – The larger the amplicon size, the fewer CCS reads per SMRT® Cell 10
  10. Estimated Number of Reads per SMRT® Cell: P4 – C2

    Chemistry, RS II, No Stage Start, 1 x 45 movie Insert Size Full Pass Subreads* 2 pass CCS 3 pass CCS 4 pass CCS 5 pass CCS 0.5 kb 380,000 47,000 44,000 40,000 36,000 1.0 kb 190,000 40,000 34,000 26,000 17,000 1.5 kb 100,000 33,000 22,000 4,200 25 2.0 kb 70,000 26,000 4,500 5 0 2.5 kb 30,000 12,000 35 0 0 Estimated number of reads by insert size and number of passes derived 500 bp amplicon dataset Note: E. coli 2 kb library, SMRT Analysis 2.0.1 Post Filter polymerase reads for this data set was ~60K 0.94 0.95 0.96 0.97 0.98 0.99 1 2 3 4 5 6 7 8 9 10 Accuracy Number of Passes CCS Accuracy by Number of Passes 0 20000 40000 60000 80000 100000 120000 140000 160000 180000 Full pass subreads 2 pass CCS 3 pass CCS 4 pass CCS 5 pass CCS Estimated Number of Reads 0.5 kb - 45 min 1.0 kb - 45 min 1.5 kb - 45 min 2.0 kb - 45 min 2.5 kb - 45 min Estimated Reads by Insert Size and Passes
  11. Estimated Number of Reads per SMRT® Cell: P4 – C2

    Chemistry, RS II, No Stage Start, 1 x 90 movie Insert Size Full Pass Subreads* 2 pass CCS 3 pass CCS 4 pass CCS 5 pass CCS 0.5 kb 650,000 60,000 57,000 53,000 49,000 1.0 kb 300,000 50,000 44,000 38,000 27,000 1.5 kb 200,000 38,000 29,000 21,000 14,000 2.0 kb 100,000 30,000 20,000 10,000 2,500 2.5 kb 50,000 20,000 10,000 1,400 30 3.5 kb 11,000 2,500 540 20 5 3.5 kb (120 min movie) 20,000 5,500 2,300 600 25 Estimated number of reads by insert size and number of passes derived 1400 bp amplicon dataset and real data from 3.5 kb data set Note: Post filter polymerase reads ~60K Note: E. coli 2 kb library, SMRT Analysis 2.0.1 0.94 0.95 0.96 0.97 0.98 0.99 1 2 3 4 5 6 7 8 9 10 Accuracy Number of Passes CCS Accuracy by Number of Passes Estimated Reads by Insert Size and Passes 0 20000 40000 60000 80000 100000 120000 140000 160000 180000 Full pass subreads 2 pass CCS 3 pass CCS 4 pass CCS 5 pass CCS Estimated Number of Reads 0.5 kb - 90 min 1.0 kb - 90 min 1.5 kb - 90 min 2.0 kb - 90 min 2.5 kb - 90 min 3.5 kb - 90 min 3.5 kb - 120 min
  12. How Many SMRT® Cells? • The number of SMRT Cells

    per project is contingent on the number of reads needed to reach target coverage 13 • Per-base coverage is key – Amplification and pooling bias – Insert-size loading bias • If pooling amplicons: – The better the amplicon uniformity, the fewer SMRT® Cells required – Keep amplicons of similar size, +/- 10% – If possible, run a test SMRT Cell t = (# targets) c = (required coverage per target) β = (sample bias) r = (number of usable reads)).
  13. Calculation Caveat When Using Subreads – Sampling Depth • Short

    amplicons & long movies creates imbalance in ratio of identical subreads to actual unique reads per SMRT® Cell 14 t = (# targets) c = (required coverage per target) β = (sample bias) r = (number of usable reads) Insert Size Full Pass Subreads* 2 pass CCS 0.5 kb 650,000 60,000 Estimated Number of Reads per SMRT® Cell: P4 – C2 Chemistry, RS II, No Stage Start, 1 x 90 movie Post filter polymerase reads ~60K • If majority of coverage coming from same molecule, difficult for minor variant or accurate heterozygote detection • If using subreads, recommend adjusting “useable reads” to be within 2-3 fold of unique reads (read of inserts) per SMRT Cell – In above case, would be ~180,000 reads
  14. Experiment Design Example • How many SMRT® Cells per sample

    are needed for my project? • Should I use CCS or CLR? • Key Assumptions: – 500 amplicons – Expected allele frequency: 20% – Range of amplicon representation: 3-fold – Amplicon size & distribution: 1000 bp +/- 100 – No barcoding 16
  15. Example Estimates for Number of SMRT® Cells using CLR 18

    # targets * minimum coverage * bias = 500 * 140 * 3 = 210,000 = ~1 SMRT Cell Usable reads per SMRT Cell 180,000 180,000 Estimated number of reads by insert size and # of passes derived from 550 bp amplicon, 45 min movie Post Filter polymerase reads for this data set was ~60K Ratio of Subreads to actual reads cell: 3:1, adjust for sampling depth (3*60,000) Insert Size Full Pass Subreads* 2 pass CCS 3 pass CCS 4 pass CCS 5 pass CCS 0.5 kb 380,000 47,000 44,000 40,000 36,000 1.0 kb 190,000 40,000 34,000 26,000 17,000 1.5 kb 100,000 33,000 22,000 4,200 25 2.0 kb 70,000 26,000 4,500 5 0 2.5 kb 30,000 12,000 35 0 0 180,000
  16. Example Estimates for Number of SMRT® Cells using CCS 19

    # targets * minimum coverage * bias = 500 * 55 * 3 = 82,500 = ~2 SMRT Cells Usable reads per SMRT Cell 40,000 40,000 Estimated number of reads by insert size and # of passes derived from 550 bp amplicon, 45 min movie Post Filter polymerase reads for this data set was ~60K Insert Size Full Pass Subreads* 2 pass CCS 3 pass CCS 4 pass CCS 5 pass CCS 0.5 kb 380,000 47,000 44,000 40,000 36,000 1.0 kb 190,000 40,000 34,000 26,000 17,000 1.5 kb 100,000 33,000 22,000 4,200 25 2.0 kb 70,000 26,000 4,500 5 0 2.5 kb 30,000 12,000 35 0 0
  17. Sample Preparation • Compatible with variety of target enrichment methods

    • Equimolar amplicon pooling • Barcoding is useful when sample is limited & to simplify workflow 21 Experimental Design Target Enrichment SMRTbell™ Library Preparation Sequencing Analysis
  18. Two options for barcoding samples, depending on project: 1. Barcodes

    added to 5’ ends of PCR primers: 2. Barcoded adapters Multiplex Samples with Barcodes Step 1. Append barcodes to inserts via PCR primer 22 Step 2. Standard SMRTbell™ template preparation 16 bp 16 bp Use as standard primers in library prep 7 bp 7 bp
  19. Multiplexing Options Added to PCR Primers Barcoded Adapters Number of

    barcodes designed by PacBio 48 pairs; can be combined in different ways 12 adapters Length of barcode sequence 16 bases 7 bases Applications Amplicons for projects in design phase Pre-existing amplicons; multiplexing BACs or fosmids User needs to order Barcoded PCR primers Barcoded adapters Library prep workflow (both use standard protocols) Barcoded amplicons pooled and prepped together Samples must be prepped separately until after ligation Input requirements per sample Reduced – input requirement split between multiplexed samples Standard or slightly lower – pooling of samples post ligation • Protocols and sequences for preparing barcoded samples with either PCR primer tails or barcoded adapters are available on SampleNet: http://www.smrtcommunity.com/Share/Protocol/List • Recommend order barcodes using PacBio supplied sequences • Barcoding analysis recommendations available on DevNet: https://github.com/PacificBiosciences/Bioinformatics-Training/wiki/Barcoding
  20. Target Enrichment Strategies Compatible with SMRT® Sequencing Enrichment Technology Enrichment

    Method Target Insert Size Target Region Targets per sample Traditional PCR PCR Up to 14 kb Flexible Flexible Fluidigm® Access Array™ System PCR Up to 10 kb <10 Mb 48 samples x 48 primer pairs (480 primer pairs with multiplexing) Agilent Technologies® SureSelect® Target Enrichment System Hybridization Up to 2 kb <50 Mb 100,000s of fragments by designing specific baits Raindance Technologies® Rainstorm™ technology PCR Up to 1.5 kb <50 Mb Up to 20,000 primer pairs 24
  21. Target Enrichment Technical Notes Available • Suggestions on for target

    enrichment methods using PacBio® system with – Agilent Technologies® SureSelect® Target Enrichment – Fluidigm® Access Array™ System • Available through customer portal 25 www.pacb.com/target
  22. SMRTbell™ Library Preparation • For <3 kb amplicons: – DNA

    Template Prep Kit 2.0 (250 bp - <3 kb) – Procedure & Checklist – 2 kb Template Preparation and Sequencing – Minimum input: 500 ng • For >3 kb amplicons – DNA Template Prep Kit 2.0 (3 kb - <10 kb) – Procedure & Checklist – Low-Input 10 kb Library Preparation and Sequencing (MagBead Station) – Minimum input: 1 µg • Additional suggestions to improve yield for amplicons – DNA damage repair for amplicons > 2 kb – Suggest three AMPure® bead purification steps instead of two for final purification – Separation of DNA damage repair and end repair steps may be useful 26 Experimental Design Target Enrichment SMRTbell™ Library Preparation Sequencing Analysis
  23. Sequencing 27 Experimental Design Target Enrichment SMRTbell™ Library Preparation Sequencing

    Analysis Longer Amplicons (>3 kb) Shorter Amplicons (<3 kb) DNA Polymerase/ Binding Kit DNA/Polymerase Binding Kit P4 DNA/Polymerase Binding Kit P4 DNA Sequencing Kit DNA Sequencing Kit 2.0 (C2) DNA Sequencing Kit 2.0 (C2) Loading MagBead loading MagBead loading or Diffusion Stage Start Stage Start = yes Stage Start = no Movie Time 120 minutes 45 to 90 minutes (Beware of running too long due to primary analysis / CCS bottleneck)
  24. Increase Samples per Day by Decreasing Movie Time 8 SMRT®

    Cells 16 SMRT Cells 24 SMRT Cells 30 min movies 8.6 hr 17.3 hr 26 hr 45 min movie 10.6 hr 21.3 hr 32 hr 60 min movies 12.6 hr 25.3 hr 38 hr 90 minute movies 16.6 hr 33.3 hr 50 hr 120 minute movies 20.6 hr 41.3 hr 62 hr 8 SMRT Cells 16 SMRT Cells 24 SMRT Cells 30 min movies 10.5 hr 21 hr 31.5 hr 45 min movie 12.5 hr 25 37.5 hr 60 min movies 14.5 hr 29 hr 43.5 hr 90 minute movies 18.5 hr 37 hr 55.5 hr 120 minute movies 22.5 hr 45 hr 67.5 hr Estimated run times for diffusion loaded, no Stage Start Estimated run times for magbead loaded, with Stage Start Note: 24 SMRT Cell runs must be set up separately as a 16-cell and 8-cell run: Run times do not include primary analysis
  25. Analysis SMRT® Analysis Software: • Genome Analysis Toolkit (GATK): –

    Diploid and haploid calling – Incorporates recalibration • Minor and Compound Variants – Detect rare variants using CCS alignments • Quiver – Consensus and variant calling (haploid only) – Uses QVs to choose the optimal consensus – Incorporated into the standard resequencing protocol – Genomic Consensus remains an option • Barcoding support – Group reads by barcodes and run GATK Experimental Design Target Enrichment SMRTbell™ Library Preparation Sequencing Analysis Align Base Quality Recalibration Call Variants BWA-SW BLASR GATK Quiver Genomic Consensus SNP Detection Pipeline
  26. SMRT® Portal Resequencing Protocols Resequencing Purpose RS_Resequencing_GATK Diploid variant calls

    RS_Resequencing Resequencing against haploid genomes (uses Quiver) RS_Resequencing_GATK_ Barcode Bucket reads by barcode and run GATK on each barcoded sample RS_Minor_and_ Compound_Variants Call minor and compound variants using CCS alignments RS_Resequencing_CCS and related protocols Align CCS reads against reference, instead of using subreads as is normally the case 32
  27. Quiver: A New Consensus Caller for PacBio® Data • Takes

    multiple reads of a given DNA template, outputs best guess of template’s identity • QV-aware hidden Markov model to account for sequencing errors; a greedy algorithm to find the maximum likelihood template • Can achieve accuracy >Q50 (i.e. > 99.999%) using pure PacBio raw reads • Same underlying algorithm currently used for CCS generation • Additional information at: https://github.com/PacificBiosciences/GenomicConsensus/blob/master/doc/HowToQuiver.rst 33
  28. GATK Analysis Recommendations • GATK is the Broad Institute’s unified

    genotyper for Bayesian diploid and haploid SNP calling – Available at: http://www.broadinstitute.org/gatk/ • Supports base quality score recalibration through known SNP data – Shown to increase variant calling accuracy – Uses data from the experiment to correct for variation in quality scores between machine cycles or sequence context • GATK has been specifically modified for integration into SMRT Analysis for variation detection – Note: Variant confidence recalibration and other advanced operations are not currently supported – No special data processing is required for GATK analysis; simply use the appropriate GATK protocols within SMRT Portal and make any appropriate changes to the default settings for Filtering, Mapping, and Consensus – For more information, consult SMRT Analysis and SMRT Pipe document (available at http://pacbiodevnet.com/) 34
  29. Barcoding in SMRT® Analysis • SMRT Analysis barcoding: Allows for

    detection and identification of unique barcodes:  Splits reads into separate files per barcode  Calls variants on individual barcodes using GATK Unified Genotyper • Note: Recommend barcode analysis using multiple pass (CCS) inserts • Barcode sequence: A set of 48 pairs of 16 bp barcodes custom designed for the PacBio, recommended for use in SMRT barcoding protocols:  Uniquely distinguishable in both forward and reverse complement orders  Optimized for distinguishing between pairs  Numerically ordered starting with the most differentiable barcodes • Optional Padding sequence:  A short (5 bp) constant padding sequence (GGTAG) between the barcodes and SMRTbell™ adapter  Standardizes ends for ligation  Current analysis software assumeds padding sequence is used • Custom barcodes:  Allowed, but users should be aware of the PacBio error mode  Homopolymers should be avoided 35 https://github.com/PacificBiosciences/Bioinformatics-Training/wiki/Barcoding
  30. Barcoding Analysis 36 • Barcode fasta file: • Default is

    pre-set to point to the 48 paired barcodes with padding in the reference directory • User-defined FASTA files of barcode should be accessible by the SMRT Portal. • GATK maximum coverage: • This value is per barcode per amplicon • For large data sets with many barcodes it may be necessary to decrease this number • SMRT® Analysis barcode outputs: • Barcode Fastqs: A compressed file containing FASTQ files of subreads for each barcode • Variants (VCF,GFF): The output from GATK Unified Genotyper, with each barcode identified as a separate read group. Command line tools: Users wishing to retrieve barcoded CCS reads at the command line can use the tool: • pbbarcode.py • Typing pbbarcode.py --help returns some general help for this script https://github.com/PacificBiosciences/Bioinformatics-Training/wiki/Barcoding
  31. Minor Variant Detection • SMRT® Portal v2.1 allows for detection

    of minor and compound variants:  RS_Minor_and_Compund_Variants protocol for samples with no barcoding  RS_Resequencing_GATK_Barcode protocol for barcoded samples – Note: SMRT® Portal variant detection analysis works only with multiple pass CCS reads. So these protocols recommended for smaller insert sizes • Reference sequence:  A reference sequence must be provided for generating alignments before calling variants  If reference is unknown, the user might want to try generating a consensus sequence by running Quiver (RS_Resequencing) first – Note: A reference agnostic de novo method for detecting minor and compounds variants is currently not supported by SMRT® Analysis though this option may be available in the future • Output  rare_variants.gff, contains a list of minor variants  correlated_variants.csv, contains a list of any compound mutations – Note: The output files will be empty if no variants were found 37
  32. SNP Detection and Validation Experimental Design • Compatible with standard

    target enrichment methods • When pooling amplicons, aim for uniform amplicon size and concentration • Barcoding options available • Use PacBio 16 bp barcode if possible • Amplicon length <3 kb for best results • P4 Binding, C2 Sequencing Kit • MagBead Loading • No stage start for shorter amplicons • 1x45 for shorter amplicons • Longer movie times for longer amplicons, especially if CCS needed • Variability in number of SMRT® Cells per experiment due primarily to: • Desired confidence • Amplicon read length • Expected variant frequency • Sample bias • CLR vs CCS tradeoffs • Ensure enough reads for sampling statistics • Beware of primary analysis bandwidth constraints • SMRT Analysis 2.0.1 • GATK • Quiver • Minor and Compound Variant Detection • Barcoding Support • Additional information on DevNet Sample Prep Run Design Sequencing on the PacBio® RS and primary analysis Secondary Analysis Tertiary Analysis
  33. Where to Find Additional Information • Customer Examples in the

    form of Publications/ Presentations/ Posters/ Videos available through PacBio’s website (www.pacb.com/target) • Technical & Application Notes available through PacBio’s website and customer portal • SampleNet • DevNet 40
  34. Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, and SMRTbell

    are trademarks of Pacific Biosciences in the United States and/or other countries. All other trademarks are the sole property of their respective owners. 41