Track 2: Barcoding and Amplicon Sequencing

F11d4ddd9ca7e190fdabf0cda3f7ae29?s=47 PacBio
October 15, 2014

Track 2: Barcoding and Amplicon Sequencing

Demultiplexing and Long Amplicon Analysis

F11d4ddd9ca7e190fdabf0cda3f7ae29?s=128

PacBio

October 15, 2014
Tweet

Transcript

  1. FIND MEANING IN COMPLEXITY © Copyright 2014 by Pacific Biosciences

    of California, Inc. All rights reserved. For Research Use Only. Not for use in diagnostic procedures.. John Harting Bioinformatics Scientist, Applications Lab October 2014 Barcoding and Amplicon Sequencing Using SMRT® Analysis V2.3
  2. Targeted Sequencing: High-Resolution Insights Exquisite sensitivity and specificity to fully

    characterize genetic complexity – Multi-kilobase reads – Achieves 99.999% consensus accuracy – Linear variant detection to <0.1% frequency – Access to the entire genome SNP Detection and Validation Repeat Expansions Compound Mutations and Haplotype Phasing Minor-variant Detection www.pacb.com/target Iso-Seq™ Full-Length Transcript Sequencing
  3. Barcoding Basics

  4. Barcoding Background 4 Insert Barcode Barcode Short Insert Polymerase will

    go around multiple times; multiple opportunities to view barcode Long Insert Few polymerases may make >1 pass; many polymerases may not see first barcode (or second one)
  5. Barcoding Solution Works Well for Short Inserts (tested to 6

    kb) Barcode During Amplification Barcode After Amplification/Fragmentation 450, 16-bp barcodes can be synthesized into primers Forward Primer Forward Barcode Reverse Primer Reverse Barcode 12 adapters with 7-bp barcodes in the stems* Barcode Adapter PCR
  6. Barcode Scoring Modes Symmetric Mode • Barcode sequences are the

    same on both sides of the insert • Recommended for all inserts, including inserts longer than 3 kb. Paired Mode (aka Asymmetric) • Different barcode sequences on either end of the insert. • Only recommended for high multiplex of sequences shorter than 3 kb. 6
  7. Barcoding – SMRT® Portal Protocols Optional Barcoding • RS_Subreads •

    RS_ReadsOfInsert • RS_Long_Amplicon_Analysis Default Barcoding • RS_Resequencing_Barcode
  8. Barcoding – Example Use Cases Multiplexed Amplicon Variant Analysis (SNPs)

    Multiplexed Amplicon Analysis (Indels, SV, SNPs) Multiplexed Assembly Multiplexed Amplicon Minor Variants PCR Append Barcodes (Paired or Symmetric) PCR Append or Ligated Barcodes (Symmetric) PCR Append Barcodes (Paired or Symmetric) Ligated Barcoded Adapters (Symmetric) RS_Subreads RS_ReadsOfInsert RS_Long_Amplicon _Analysis RS_Resequencing_ Barcode Quiver MinorVariants WhiteListed HGAP SMRT® Portal Command Line
  9. Long Amplicon Analysis

  10. Long Amplicon Analysis 10 Generalized Amplicon Pipeline • Customizable pipeline

    for de novo analysis of pooled amplicon datasets Grouping Algorithms • Barcodes (sample prep) • Markov Graph Clustering (large-scale differences) • Phasing (small-scale differences) Polishing and Filtering • Barcode Filters (pre-process) • Length Filters (pre-process) • Quiver Polishing • UCHIME Chimera Filter (post-process) • Quality (Noise) Filter (post-process)
  11. Process for Long-Amplicon Analysis Overlap Cluster Quiver Quiver Phasing Post-Processing

    Filters Optionally Separate by Barcode Haplotype 1 Haplotype 2
  12. Long Amplicon Analysis Use Cases LAA Mode Cluster? Phase? Example

    Multiple Gene, Multiple Phases X X HLA Single Gene, Multiple Phases X Human Amplicon with Phasing Single Gene, Single Phase Clone Validation HLA Anlaysis Type Cluster? Phase? Note Just HLA Class I X X Just HLA Class II (single gene) X Combined HLA Class I & II X X Supported in SMRT® Analysis 2.3
  13. Long Amplicon Analysis Pre-filtering Inputs 13 Which ZMWs? • SMRT®

    Portal • SMRT Pipe • Barcode Module • Whitelist Filter <param name="whiteList"> <value>/path/to/whitelist.txt</value> </param> • Command Line • barcodes - barcode fofn (from pbbarcode) • doBc - Specify a subset of all barcode • minBarcodeScore - Minimum average barcode • whiteList - A list of subreads to use in TXT or FASTA format.
  14. Minimum subread length: - Set to 75-95% length of smallest

    target amplicon* Coarse Cluster Subreads by Gene Family: - Keep clicked for HLA - Unclick for amplicon consensus calling. Maximum number of subreads: - Set to ~700 reads per Gene  Phase Alleles: - Unclick for amplicons with a single allele. Long Amplicon Analysis Run Parameters 14 SMRT® Portal Ignore Primer When Clustering: - # bp to ignore on ends Trim Ends: - # bp to trim after consensus Split Results by Barcode: - Generate fasta/q file per barcode
  15. Long Amplicon Analysis Run Parameters 15 SMRT® Pipe Barcoding Options

    (same) LAA Options Same, with option of adding parameters from command line tool
  16. Long Amplicon Analysis Run Parameters 16 SMRT® Analysis Command Line

    • All the above plus lots more! • Extra Subread Criteria – maxLength, minReadScore, minSnr • Extra Clustering Criteria – maxClusters, clusterInflation (Markov) • Extra Phasing Criteria – minSplitScore, minSplitFraction, minSplitReads • Extra Filtering Options – minPredictedAccuracy, noChimeraFilter, chimeraScoreThreshold, convergenceFilter • Process Control – numThreads & forced threading across barcodes* *May cause out of memory error, only recommended for high memory machines
  17. Long Amplicon Analysis Outputs 17 SMRT® Portal Tables

  18. Long Amplicon Analysis Outputs 18 Fasta/q • Consensus Sequences passing

    all filters Amplicon Analysis Summary (csv) • Pass/Fail status of filters for all consensus sequences Amplicon Analysis (csv) • Per-base coverage and QV scores Amplicon Analysis Zmws/Subreads (csv) • Mapping of zmws/subreads to consensus sequences Amplicon Analysis Chimeras Noise (fasta/q) • Consensus sequences failing filters. (Not available directly from SMRT Portal)
  19. Long Amplicon Analysis Documentation 19 Coming Soon!

  20. Long Amplicon Analysis Examples – HLA

  21. Phase Information for Full-length HLA Class I and II Genes

    2013 ASHG Poster: Allele-Level Sequencing and Phasing of Full-length HLA Class I and II Genes HLA-A HLA-B HLA-C HLA-DRB1 Sample ID Allele1 Allele2 Allele1 Allele2 Allele1 Allele2 Allele1 Allele2 TU01 A*02:06:01 A*11:01:01 B*40:02:01 B*55:02:01:02 C*01:02:01 C*03:03:01 DRB*09:01:02:01/02 DRB*15:01:01:03 TU02 A*02:01:01:01 A*31:01:02 B*51:02:01 B*56:01:01:02 C*01:02:01 C*03:04:01:02 DRB*09:01:02:02 DRB*14:05:01:02 TU03 A*24:02:01:01 A*31:01:02 B*07:02:01 B*35:01:01:02 C*03:03:01 C*07:02:01:03 DRB*01:01:01 DRB*14:05:01:02 TU04 A*02:06:01 A*02:07:01 B*40:02:01 B*44:03:01 C*03:03:01 C*14:03 DRB*04:10:03:01 DRB*14:54:01:02 TU05 A*26:01:01 A*31:01:02 B*15:01:01:01 B*35:01:01:02 C*03:04:01:02 C*07:02:01:04 DRB*09:01:02:01 DRB*13:02:01:02 TU06 A*26:03:01 A*33:03:01 B*15:11:01 B*44:03:01 C*03:03:01 C*14:03 DRB*04:05:01:01 DRB*13:02:01:02 TU07 A*02:03:01 A*24:02:01:01 B*38:02:01 B*54:01:01 C*01:02:01 C*07:02:01:05 DRB*04:03:01:02 DRB*08:03:02:02 TU08 A*24:02:01:01 A*33:03:01 B*44:03:01 B*48:01:01 C*08:03:01 C*14:03 DRB*13:02:01:02 DRB*16:02:01:02 TU09 A*02:01:01:01 A*02:06:01 B*40:06:01:01 B*48:01:01 C*08:01:01 C*15:02:01 DRB*14:05:01:02 - TU10 A*11:01:01 A*31:01:02 B*40:01:02 B*51:01:01 C*07:02:01:01 C*15:02:01 DRB*09:01:02:01 DRB*12:01:01:02 TU21 A*03:02:01 A*24:02:01:01 B*07:02:01 B*13:02:01 C*06:02:01:01 C*07:02:01:03 DRB*01:01:01 DRB*07:01:01:01 Example PacBio® Result - HLA class I (A, B and C) and class II (DRB1) genes showed: • 100% concordance with cDNA reference • One mismatch in intron 2 of TU04 versus SS-SBT generated reference • Resolved allele ambiguities from PCR-SSO typing when compared to Tokai University Reference Database
  22. Fully-phased, Allele-specific HLA Sequencing

  23. Allele-level HLA Typing Using SMRT® Sequencing: Multiplex Option 48 x

    3 Full-Length HLA Class I Genes Each SMRT Cell generates ~50,000 barcoded sequences Barcode 2 Sequences from each bin are clustered by gene type & allele; Consensus sequences are generated ~100x coverage per allele Fasta files per allele at ≥Q50 Barcode 3 Barcode 1 Barcodes are identified; Sequences are binned (48 Bins) Barcode 2 Sequence run time = 2 hours
  24. Process for Long-Amplicon Analysis Overlap Cluster Quiver Quiver Phasing Post-Processing

    Filters Optionally Separate by Barcode Haplotype 1 Haplotype 2
  25. Long Amplicon Analysis Examples – Enzymology

  26. Barcoded Amplicon Use Case – Enzyme Engineering Design Cycle Detailed

    studies on interesting mutants Screening & data interpretation High-throughput cloning & purification Formulate design hypotheses 26
  27. Process for Long-Amplicon Analysis for Barcoded Samples Overlap Post-Processing Filters

    Separate By Barcode Quiver Consensus Sequence
  28. Sequencing & Informatics Workflow 28 Each SMRT® Cell Will Generate

    ~50,000 Barcoded Sequences Barcode 2 Sequences from Each Bin Are Aligned and Bases Are Called ~100x coverage per clone • Q50 accuracy at ~30x coverage Single Fasta file at ≥Q50 Barcode 3 Barcode 1 Barcodes Are Identified and Sequences Are Binned (384 Bins) Barcode 2
  29. Coverage and Sequencing Accuracy 29 • Assemblies performed with subsets

    of data at differing levels of coverage • At 45X coverage, errors detected with a frequency of 10-5 • Above 50X coverage, no errors detected in ~700 kb of sequence 6.4 X 10-5 2.1 X 10-5 Error Rates
  30. Coverage Levels in Dataset 30 100X coverage 200X coverage 385X

    mean coverage Per-base Coverage by Barcode Barcode Number Coverage Rank Sorted Coverage Levels Simple pooling of PCR products produced >100X coverage for all 384 clones in a single run. 50X coverage
  31. Sanger vs PacBio® Sequencing 31 384 Plasmids 384 x 5

    Sequencing Reactions 1,920 reactions Sanger Sequencing Vector Gene Vector 384 PCR Reactions 1 template prep & 1 SMRT® Cell PacBio Sequencing 1,700 bp Vector Gene Vector 700-850 bp
  32. Long Amplicon Analysis Walk-Through

  33. Create New Long Amplicon Analysis Job in SMRT® Portal 2.3

  34. Create New Long Amplicon Analysis Job in SMRT® Portal 2.3

  35. Create New Long Amplicon Analysis Job in SMRT® Portal 2.3

  36. LAA Protocol Parameters in SMRT® Portal 2.3 Activate barcodes 

  37. LAA Barcode Parameters in SMRT® Portal 2.3 My library has

    DNA Barcodes that are: - Symmetric in most cases. Minimum Barcode score: - Maximum score is 2x(length barcode), which in this case is 2x16=32. - For 16 bp barcodes, a minimum score of 22 results in less than 1% false positive scores. Barcode FASTA file: - Enter the location of your barcode file here. - Default is PacBio set of 384 barcodes. 
  38. LAA Amplicon Parameters in SMRT® Portal 2.3 Minimum subread length:

    - Set to 80% of your insert size Coarse Cluster Subreads by Gene Family: - Keep clicked for HLA - Unclick for amplicon consensus calling. Maximum number of subreads: - Set to 200  Phase Alleles: - Keep clicked for HLA or other applications where you expect 2 alleles. - Unclick for amplicon with a single allele.
  39. LAA Output – Overview 39

  40. Bioinformatics Workflow 40 Output is a multiFASTA file – One

    Consensus Sequence per Barcode
  41. Command Line Walk-through – Quality Control Check 41 Long Amplicon

    Analysis – Investigate Single Barcode • Goals • Use Long Amplicon Analysis from Command line • Learn about LAA outputs • Verify Barcode Minimum Score • SMRT® Analysis Tools • Long Amplicon Analysis • BLASR • Cmph5tools • 3rd Party Tools • Basic Linux/Unix (awk, grep, text editor)
  42. Command Line Walk-through 2 – Barcoded Variant Analysis Prep 42

    Resequencing Reads of Insert and Splitting by Barcode • Goals • Prepare barcoded alignment files for Minor Variants analysis • Learn about cmph5tools • SMRT® Analysis Tools • SMRT Portal ReadsOfInsert • pbbarcode • cmph5tools • (MinorVariants)
  43. For Research Use Only. Not for use in diagnostic procedures.

    Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, SMRTbell, and Iso-Seq are trademarks of Pacific Biosciences in the United States and/or other countries. All other trademarks are the sole property of their respective owners.