Slide 1

Slide 1 text

FIND MEANING IN COMPLEXITY © Copyright 2014 by Pacific Biosciences of California, Inc. All rights reserved. For Research Use Only. Not for use in diagnostic procedures.. John Harting Bioinformatics Scientist, Applications Lab October 2014 Barcoding and Amplicon Sequencing Using SMRT® Analysis V2.3

Slide 2

Slide 2 text

Targeted Sequencing: High-Resolution Insights Exquisite sensitivity and specificity to fully characterize genetic complexity – Multi-kilobase reads – Achieves 99.999% consensus accuracy – Linear variant detection to <0.1% frequency – Access to the entire genome SNP Detection and Validation Repeat Expansions Compound Mutations and Haplotype Phasing Minor-variant Detection www.pacb.com/target Iso-Seq™ Full-Length Transcript Sequencing

Slide 3

Slide 3 text

Barcoding Basics

Slide 4

Slide 4 text

Barcoding Background 4 Insert Barcode Barcode Short Insert Polymerase will go around multiple times; multiple opportunities to view barcode Long Insert Few polymerases may make >1 pass; many polymerases may not see first barcode (or second one)

Slide 5

Slide 5 text

Barcoding Solution Works Well for Short Inserts (tested to 6 kb) Barcode During Amplification Barcode After Amplification/Fragmentation 450, 16-bp barcodes can be synthesized into primers Forward Primer Forward Barcode Reverse Primer Reverse Barcode 12 adapters with 7-bp barcodes in the stems* Barcode Adapter PCR

Slide 6

Slide 6 text

Barcode Scoring Modes Symmetric Mode • Barcode sequences are the same on both sides of the insert • Recommended for all inserts, including inserts longer than 3 kb. Paired Mode (aka Asymmetric) • Different barcode sequences on either end of the insert. • Only recommended for high multiplex of sequences shorter than 3 kb. 6

Slide 7

Slide 7 text

Barcoding – SMRT® Portal Protocols Optional Barcoding • RS_Subreads • RS_ReadsOfInsert • RS_Long_Amplicon_Analysis Default Barcoding • RS_Resequencing_Barcode

Slide 8

Slide 8 text

Barcoding – Example Use Cases Multiplexed Amplicon Variant Analysis (SNPs) Multiplexed Amplicon Analysis (Indels, SV, SNPs) Multiplexed Assembly Multiplexed Amplicon Minor Variants PCR Append Barcodes (Paired or Symmetric) PCR Append or Ligated Barcodes (Symmetric) PCR Append Barcodes (Paired or Symmetric) Ligated Barcoded Adapters (Symmetric) RS_Subreads RS_ReadsOfInsert RS_Long_Amplicon _Analysis RS_Resequencing_ Barcode Quiver MinorVariants WhiteListed HGAP SMRT® Portal Command Line

Slide 9

Slide 9 text

Long Amplicon Analysis

Slide 10

Slide 10 text

Long Amplicon Analysis 10 Generalized Amplicon Pipeline • Customizable pipeline for de novo analysis of pooled amplicon datasets Grouping Algorithms • Barcodes (sample prep) • Markov Graph Clustering (large-scale differences) • Phasing (small-scale differences) Polishing and Filtering • Barcode Filters (pre-process) • Length Filters (pre-process) • Quiver Polishing • UCHIME Chimera Filter (post-process) • Quality (Noise) Filter (post-process)

Slide 11

Slide 11 text

Process for Long-Amplicon Analysis Overlap Cluster Quiver Quiver Phasing Post-Processing Filters Optionally Separate by Barcode Haplotype 1 Haplotype 2

Slide 12

Slide 12 text

Long Amplicon Analysis Use Cases LAA Mode Cluster? Phase? Example Multiple Gene, Multiple Phases X X HLA Single Gene, Multiple Phases X Human Amplicon with Phasing Single Gene, Single Phase Clone Validation HLA Anlaysis Type Cluster? Phase? Note Just HLA Class I X X Just HLA Class II (single gene) X Combined HLA Class I & II X X Supported in SMRT® Analysis 2.3

Slide 13

Slide 13 text

Long Amplicon Analysis Pre-filtering Inputs 13 Which ZMWs? • SMRT® Portal • SMRT Pipe • Barcode Module • Whitelist Filter /path/to/whitelist.txt • Command Line • barcodes - barcode fofn (from pbbarcode) • doBc - Specify a subset of all barcode • minBarcodeScore - Minimum average barcode • whiteList - A list of subreads to use in TXT or FASTA format.

Slide 14

Slide 14 text

Minimum subread length: - Set to 75-95% length of smallest target amplicon* Coarse Cluster Subreads by Gene Family: - Keep clicked for HLA - Unclick for amplicon consensus calling. Maximum number of subreads: - Set to ~700 reads per Gene  Phase Alleles: - Unclick for amplicons with a single allele. Long Amplicon Analysis Run Parameters 14 SMRT® Portal Ignore Primer When Clustering: - # bp to ignore on ends Trim Ends: - # bp to trim after consensus Split Results by Barcode: - Generate fasta/q file per barcode

Slide 15

Slide 15 text

Long Amplicon Analysis Run Parameters 15 SMRT® Pipe Barcoding Options (same) LAA Options Same, with option of adding parameters from command line tool

Slide 16

Slide 16 text

Long Amplicon Analysis Run Parameters 16 SMRT® Analysis Command Line • All the above plus lots more! • Extra Subread Criteria – maxLength, minReadScore, minSnr • Extra Clustering Criteria – maxClusters, clusterInflation (Markov) • Extra Phasing Criteria – minSplitScore, minSplitFraction, minSplitReads • Extra Filtering Options – minPredictedAccuracy, noChimeraFilter, chimeraScoreThreshold, convergenceFilter • Process Control – numThreads & forced threading across barcodes* *May cause out of memory error, only recommended for high memory machines

Slide 17

Slide 17 text

Long Amplicon Analysis Outputs 17 SMRT® Portal Tables

Slide 18

Slide 18 text

Long Amplicon Analysis Outputs 18 Fasta/q • Consensus Sequences passing all filters Amplicon Analysis Summary (csv) • Pass/Fail status of filters for all consensus sequences Amplicon Analysis (csv) • Per-base coverage and QV scores Amplicon Analysis Zmws/Subreads (csv) • Mapping of zmws/subreads to consensus sequences Amplicon Analysis Chimeras Noise (fasta/q) • Consensus sequences failing filters. (Not available directly from SMRT Portal)

Slide 19

Slide 19 text

Long Amplicon Analysis Documentation 19 Coming Soon!

Slide 20

Slide 20 text

Long Amplicon Analysis Examples – HLA

Slide 21

Slide 21 text

Phase Information for Full-length HLA Class I and II Genes 2013 ASHG Poster: Allele-Level Sequencing and Phasing of Full-length HLA Class I and II Genes HLA-A HLA-B HLA-C HLA-DRB1 Sample ID Allele1 Allele2 Allele1 Allele2 Allele1 Allele2 Allele1 Allele2 TU01 A*02:06:01 A*11:01:01 B*40:02:01 B*55:02:01:02 C*01:02:01 C*03:03:01 DRB*09:01:02:01/02 DRB*15:01:01:03 TU02 A*02:01:01:01 A*31:01:02 B*51:02:01 B*56:01:01:02 C*01:02:01 C*03:04:01:02 DRB*09:01:02:02 DRB*14:05:01:02 TU03 A*24:02:01:01 A*31:01:02 B*07:02:01 B*35:01:01:02 C*03:03:01 C*07:02:01:03 DRB*01:01:01 DRB*14:05:01:02 TU04 A*02:06:01 A*02:07:01 B*40:02:01 B*44:03:01 C*03:03:01 C*14:03 DRB*04:10:03:01 DRB*14:54:01:02 TU05 A*26:01:01 A*31:01:02 B*15:01:01:01 B*35:01:01:02 C*03:04:01:02 C*07:02:01:04 DRB*09:01:02:01 DRB*13:02:01:02 TU06 A*26:03:01 A*33:03:01 B*15:11:01 B*44:03:01 C*03:03:01 C*14:03 DRB*04:05:01:01 DRB*13:02:01:02 TU07 A*02:03:01 A*24:02:01:01 B*38:02:01 B*54:01:01 C*01:02:01 C*07:02:01:05 DRB*04:03:01:02 DRB*08:03:02:02 TU08 A*24:02:01:01 A*33:03:01 B*44:03:01 B*48:01:01 C*08:03:01 C*14:03 DRB*13:02:01:02 DRB*16:02:01:02 TU09 A*02:01:01:01 A*02:06:01 B*40:06:01:01 B*48:01:01 C*08:01:01 C*15:02:01 DRB*14:05:01:02 - TU10 A*11:01:01 A*31:01:02 B*40:01:02 B*51:01:01 C*07:02:01:01 C*15:02:01 DRB*09:01:02:01 DRB*12:01:01:02 TU21 A*03:02:01 A*24:02:01:01 B*07:02:01 B*13:02:01 C*06:02:01:01 C*07:02:01:03 DRB*01:01:01 DRB*07:01:01:01 Example PacBio® Result - HLA class I (A, B and C) and class II (DRB1) genes showed: • 100% concordance with cDNA reference • One mismatch in intron 2 of TU04 versus SS-SBT generated reference • Resolved allele ambiguities from PCR-SSO typing when compared to Tokai University Reference Database

Slide 22

Slide 22 text

Fully-phased, Allele-specific HLA Sequencing

Slide 23

Slide 23 text

Allele-level HLA Typing Using SMRT® Sequencing: Multiplex Option 48 x 3 Full-Length HLA Class I Genes Each SMRT Cell generates ~50,000 barcoded sequences Barcode 2 Sequences from each bin are clustered by gene type & allele; Consensus sequences are generated ~100x coverage per allele Fasta files per allele at ≥Q50 Barcode 3 Barcode 1 Barcodes are identified; Sequences are binned (48 Bins) Barcode 2 Sequence run time = 2 hours

Slide 24

Slide 24 text

Process for Long-Amplicon Analysis Overlap Cluster Quiver Quiver Phasing Post-Processing Filters Optionally Separate by Barcode Haplotype 1 Haplotype 2

Slide 25

Slide 25 text

Long Amplicon Analysis Examples – Enzymology

Slide 26

Slide 26 text

Barcoded Amplicon Use Case – Enzyme Engineering Design Cycle Detailed studies on interesting mutants Screening & data interpretation High-throughput cloning & purification Formulate design hypotheses 26

Slide 27

Slide 27 text

Process for Long-Amplicon Analysis for Barcoded Samples Overlap Post-Processing Filters Separate By Barcode Quiver Consensus Sequence

Slide 28

Slide 28 text

Sequencing & Informatics Workflow 28 Each SMRT® Cell Will Generate ~50,000 Barcoded Sequences Barcode 2 Sequences from Each Bin Are Aligned and Bases Are Called ~100x coverage per clone • Q50 accuracy at ~30x coverage Single Fasta file at ≥Q50 Barcode 3 Barcode 1 Barcodes Are Identified and Sequences Are Binned (384 Bins) Barcode 2

Slide 29

Slide 29 text

Coverage and Sequencing Accuracy 29 • Assemblies performed with subsets of data at differing levels of coverage • At 45X coverage, errors detected with a frequency of 10-5 • Above 50X coverage, no errors detected in ~700 kb of sequence 6.4 X 10-5 2.1 X 10-5 Error Rates

Slide 30

Slide 30 text

Coverage Levels in Dataset 30 100X coverage 200X coverage 385X mean coverage Per-base Coverage by Barcode Barcode Number Coverage Rank Sorted Coverage Levels Simple pooling of PCR products produced >100X coverage for all 384 clones in a single run. 50X coverage

Slide 31

Slide 31 text

Sanger vs PacBio® Sequencing 31 384 Plasmids 384 x 5 Sequencing Reactions 1,920 reactions Sanger Sequencing Vector Gene Vector 384 PCR Reactions 1 template prep & 1 SMRT® Cell PacBio Sequencing 1,700 bp Vector Gene Vector 700-850 bp

Slide 32

Slide 32 text

Long Amplicon Analysis Walk-Through

Slide 33

Slide 33 text

Create New Long Amplicon Analysis Job in SMRT® Portal 2.3

Slide 34

Slide 34 text

Create New Long Amplicon Analysis Job in SMRT® Portal 2.3

Slide 35

Slide 35 text

Create New Long Amplicon Analysis Job in SMRT® Portal 2.3

Slide 36

Slide 36 text

LAA Protocol Parameters in SMRT® Portal 2.3 Activate barcodes 

Slide 37

Slide 37 text

LAA Barcode Parameters in SMRT® Portal 2.3 My library has DNA Barcodes that are: - Symmetric in most cases. Minimum Barcode score: - Maximum score is 2x(length barcode), which in this case is 2x16=32. - For 16 bp barcodes, a minimum score of 22 results in less than 1% false positive scores. Barcode FASTA file: - Enter the location of your barcode file here. - Default is PacBio set of 384 barcodes. 

Slide 38

Slide 38 text

LAA Amplicon Parameters in SMRT® Portal 2.3 Minimum subread length: - Set to 80% of your insert size Coarse Cluster Subreads by Gene Family: - Keep clicked for HLA - Unclick for amplicon consensus calling. Maximum number of subreads: - Set to 200  Phase Alleles: - Keep clicked for HLA or other applications where you expect 2 alleles. - Unclick for amplicon with a single allele.

Slide 39

Slide 39 text

LAA Output – Overview 39

Slide 40

Slide 40 text

Bioinformatics Workflow 40 Output is a multiFASTA file – One Consensus Sequence per Barcode

Slide 41

Slide 41 text

Command Line Walk-through – Quality Control Check 41 Long Amplicon Analysis – Investigate Single Barcode • Goals • Use Long Amplicon Analysis from Command line • Learn about LAA outputs • Verify Barcode Minimum Score • SMRT® Analysis Tools • Long Amplicon Analysis • BLASR • Cmph5tools • 3rd Party Tools • Basic Linux/Unix (awk, grep, text editor)

Slide 42

Slide 42 text

Command Line Walk-through 2 – Barcoded Variant Analysis Prep 42 Resequencing Reads of Insert and Splitting by Barcode • Goals • Prepare barcoded alignment files for Minor Variants analysis • Learn about cmph5tools • SMRT® Analysis Tools • SMRT Portal ReadsOfInsert • pbbarcode • cmph5tools • (MinorVariants)

Slide 43

Slide 43 text

For Research Use Only. Not for use in diagnostic procedures. Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, SMRTbell, and Iso-Seq are trademarks of Pacific Biosciences in the United States and/or other countries. All other trademarks are the sole property of their respective owners.