Slide 1

Slide 1 text

FIND MEANING IN COMPLEXITY © Copyright 2013 by Pacific Biosciences of California, Inc. All rights reserved. Primary Analysis © Copyright 2012 by Pacific Biosciences of California, Inc. All rights reserved.

Slide 2

Slide 2 text

Learning Objectives 2 Scientists and Bioinformaticians: • Familiar with bioinformatics concepts • Interested in learning basics of SMRT Bioinformatics workflow with the PacBio® RS • SMRT® Technology • PacBio® RS Workflow • Bioinformatics Overview After the training, you will be able to describe: • Sequencing to Primary Analysis Workflow • Signal Processing and Detection • Single Molecule Real-time Basecalling • Quality Values

Slide 3

Slide 3 text

Integration with the Detection System 3 Trace File Pipeline ACCGTTA… CCAGTAT… GGGACCTA

Slide 4

Slide 4 text

 Data Volume   log (size/GB)  Primary Analysis Workflow 4 Movie-to-Trace Trace-to-Pulse Pulse-to-Base Circular Consensus GCAACGATCACCTAAA…GCAACGA TCACCTAAA…GCAACGATCACCTA AA…GCAACGATCACCTAAA… ACGATCACCTAAA…  Time ~4TB ~50 GB ~10 GB Algorithms reduce photon counts vs. time to sequences in real time  30 min   30 min 

Slide 5

Slide 5 text

Idle MovieToTrace Acquisition Control Pipeline Job Control Acquisition Control To Customer URI Summary QC Report Basecalls File FASTA File Base File Base File Trace File Pulse File Trace File System Architecture 5 Blade Center Acquisition And Signal Processing Data Analysis Data Analysis Data Analysis Instrument Control TraceToPulse PulseToBase Circular Consensus Generate QC Trace File Pulse File Circular Cons. File Summary QC Report Sequencer Frames Frames ZMW Pipeline Job Control File Format and Delivery Acquisition Control

Slide 6

Slide 6 text

PulseToBase Inputs 6 • Receive observation list from TraceToPulse • Each pulse has vector of associated measurements – Duration – Spectrum – Intensity – Spacing to neighbors – Local context – etc… Duration Intensity Spacing Context Spectrum

Slide 7

Slide 7 text

Modeling of Insertions/Deletions • Observation sequence can contain insertions and deletions with respect to the template sequence • Need a model of the likelihood of insertions/deletions given pulse features and local trace neighborhood 7 Insertion? Deletion?

Slide 8

Slide 8 text

PulseToBase Summary • Single-molecule pulse events are sequential: No phasing problem, no Sanger limit • Main kinetic information retained in the bas.h5 output files are Interpulse duration (IPD) and Pulse Width (PW) • Quality Values • Substitution • Insertion • Deletion • Merge • Sum of all error probabilities 8 A A

Slide 9

Slide 9 text

Additional Primary Analysis Tasks 9 Adapter and Insert Screening = Annotates adapter locations and insert DNA regions in the raw read. Used to break a read into subreads during secondary analysis mapping and Circular Consensus. Productivity Assignment = Assigns a productivity score of 0, 1 or 2 to each sequencing ZMW. High Quality Region Screening = Annotates the high quality sequencing regions of a read to be used during Raw Read Trimming. Read Quality Assignment = A trained prediction of a read’s mapped accuracy based on its pulse and base file characteristics (peak signal-to- noise ratio, average base QV, interpulse duration, and so on). Used during secondary analysis filtering.

Slide 10

Slide 10 text

From Raw Reads to Circular Consensus Sequences (CCS) • Subreads (purple and gold) are separated by adapter sequences (green) • ≥ 2 full polymerase passes required for CCS • Individual subreads or CCS reads can be used for subsequent analysis 10 Raw Read Subreads Circular Consensus Sequence (CCS)

Slide 11

Slide 11 text

Pipeline Summary • Fast sequencing times with real time analysis = fast time to results • Primary analysis results are erased after successful transfer to the secondary storage server • Transfer of trace and pulse files to secondary storage is possible, but not suggested • Proactive monitoring by PacBio Tech Support if RS Insight access is enabled – Sequencing QC metrics are retained on the Blade Center 11

Slide 12

Slide 12 text

Sequencing Trace/Pulse View 12

Slide 13

Slide 13 text

Pulse View 1: 4226 bp Trace 13

Slide 14

Slide 14 text

Pulse View 2 : Close-up 14

Slide 15

Slide 15 text

Output Directories and Files

Slide 16

Slide 16 text

Primary Data Output Structure 16 bas.h5 primary PacBio output file Raw sequencing read: FASTA & FASTQ (no HQ region, with adapters) Circular Consensus Sequence (CCS): FASTA & FASTQ • Without adapter • 2 full passes minimum sts.xml: Summary statistics and metrics (QC) sts.csv: Extensive per ZMW statistics

Slide 17

Slide 17 text

PacBio-Specific File Identifier 17 Primary Analysis Movie Associated Files Time stamp Instrument serial # Part # Set # movie SMRT® Cell barcode #

Slide 18

Slide 18 text

Primary Data Output Structure 18 • metadata.xml: - Run information for each movie - Retain for archiving - Required for import into SMRT® Portal - Designed for LIMS accessibility

Slide 19

Slide 19 text

bas.h5 Contents • Archive together with matching metadata.xml • HDF5 binary format (Directory system in a file) for fast access • All base calls indexed by ZMW with base quality values • BaseQV = Sum(SubstitutionQV, InsertionQV, DeletionQV, MergeQV) • Region Annotation: • High Quality Region start and end • Adapter start and end • CCS consensus sequence with per base quality value • Number of passes for CCS • Kinetic data: - Pulse width (PW), Inter Pulse Duration(IPD) • ZMW XY and ZMW classification 19

Slide 20

Slide 20 text

DevNet Tools Available (subset) • PacBioFx (python) – Lightweight and modular downloads of SMRT® Pipe • R-pbh5 (R) – An R package for interacting with data in HDF5 format from the PacBio® RS – Based on h5r • R-kinetics (R) – Introduce users to PacBio data - specifically kinetics data collected when performing a sequencing experiment • Java File APIs – Reads base, trace, pulse and CCS basecall files 20 www.pacbiodevnet.com

Slide 21

Slide 21 text

Summary of Key Points Key Points • Fast sequencing times with real time signal processing and base calling = fast time to results • Richer information available: Kinetic information and multiple quality values • Circular Consensus Sequences available Where to Find More Information • HDF5 Java API User Guide, available on DevNet • Base, Pulse, and Trace File Reference Guide, available on DevNet • www.pacbiodevnet.com 21