Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Primary Analysis

PacBio
April 02, 2013

Primary Analysis

PacBio

April 02, 2013
Tweet

More Decks by PacBio

Other Decks in Science

Transcript

  1. FIND MEANING IN COMPLEXITY © Copyright 2013 by Pacific Biosciences

    of California, Inc. All rights reserved. Primary Analysis © Copyright 2012 by Pacific Biosciences of California, Inc. All rights reserved.
  2. Learning Objectives 2 Scientists and Bioinformaticians: • Familiar with bioinformatics

    concepts • Interested in learning basics of SMRT Bioinformatics workflow with the PacBio® RS • SMRT® Technology • PacBio® RS Workflow • Bioinformatics Overview After the training, you will be able to describe: • Sequencing to Primary Analysis Workflow • Signal Processing and Detection • Single Molecule Real-time Basecalling • Quality Values
  3.  Data Volume   log (size/GB)  Primary Analysis

    Workflow 4 Movie-to-Trace Trace-to-Pulse Pulse-to-Base Circular Consensus GCAACGATCACCTAAA…GCAACGA TCACCTAAA…GCAACGATCACCTA AA…GCAACGATCACCTAAA… ACGATCACCTAAA…  Time ~4TB ~50 GB ~10 GB Algorithms reduce photon counts vs. time to sequences in real time  30 min   30 min 
  4. Idle MovieToTrace Acquisition Control Pipeline Job Control Acquisition Control To

    Customer URI Summary QC Report Basecalls File FASTA File Base File Base File Trace File Pulse File Trace File System Architecture 5 Blade Center Acquisition And Signal Processing Data Analysis Data Analysis Data Analysis Instrument Control TraceToPulse PulseToBase Circular Consensus Generate QC Trace File Pulse File Circular Cons. File Summary QC Report Sequencer Frames Frames ZMW Pipeline Job Control File Format and Delivery Acquisition Control
  5. PulseToBase Inputs 6 • Receive observation list from TraceToPulse •

    Each pulse has vector of associated measurements – Duration – Spectrum – Intensity – Spacing to neighbors – Local context – etc… Duration Intensity Spacing Context Spectrum
  6. Modeling of Insertions/Deletions • Observation sequence can contain insertions and

    deletions with respect to the template sequence • Need a model of the likelihood of insertions/deletions given pulse features and local trace neighborhood 7 Insertion? Deletion?
  7. PulseToBase Summary • Single-molecule pulse events are sequential: No phasing

    problem, no Sanger limit • Main kinetic information retained in the bas.h5 output files are Interpulse duration (IPD) and Pulse Width (PW) • Quality Values • Substitution • Insertion • Deletion • Merge • Sum of all error probabilities 8 A A
  8. Additional Primary Analysis Tasks 9 Adapter and Insert Screening =

    Annotates adapter locations and insert DNA regions in the raw read. Used to break a read into subreads during secondary analysis mapping and Circular Consensus. Productivity Assignment = Assigns a productivity score of 0, 1 or 2 to each sequencing ZMW. High Quality Region Screening = Annotates the high quality sequencing regions of a read to be used during Raw Read Trimming. Read Quality Assignment = A trained prediction of a read’s mapped accuracy based on its pulse and base file characteristics (peak signal-to- noise ratio, average base QV, interpulse duration, and so on). Used during secondary analysis filtering.
  9. From Raw Reads to Circular Consensus Sequences (CCS) • Subreads

    (purple and gold) are separated by adapter sequences (green) • ≥ 2 full polymerase passes required for CCS • Individual subreads or CCS reads can be used for subsequent analysis 10 Raw Read Subreads Circular Consensus Sequence (CCS)
  10. Pipeline Summary • Fast sequencing times with real time analysis

    = fast time to results • Primary analysis results are erased after successful transfer to the secondary storage server • Transfer of trace and pulse files to secondary storage is possible, but not suggested • Proactive monitoring by PacBio Tech Support if RS Insight access is enabled – Sequencing QC metrics are retained on the Blade Center 11
  11. Primary Data Output Structure 16 bas.h5 primary PacBio output file

    Raw sequencing read: FASTA & FASTQ (no HQ region, with adapters) Circular Consensus Sequence (CCS): FASTA & FASTQ • Without adapter • 2 full passes minimum sts.xml: Summary statistics and metrics (QC) sts.csv: Extensive per ZMW statistics
  12. PacBio-Specific File Identifier 17 Primary Analysis Movie Associated Files Time

    stamp Instrument serial # Part # Set # movie SMRT® Cell barcode #
  13. Primary Data Output Structure 18 • metadata.xml: - Run information

    for each movie - Retain for archiving - Required for import into SMRT® Portal - Designed for LIMS accessibility
  14. bas.h5 Contents • Archive together with matching metadata.xml • HDF5

    binary format (Directory system in a file) for fast access • All base calls indexed by ZMW with base quality values • BaseQV = Sum(SubstitutionQV, InsertionQV, DeletionQV, MergeQV) • Region Annotation: • High Quality Region start and end • Adapter start and end • CCS consensus sequence with per base quality value • Number of passes for CCS • Kinetic data: - Pulse width (PW), Inter Pulse Duration(IPD) • ZMW XY and ZMW classification 19
  15. DevNet Tools Available (subset) • PacBioFx (python) – Lightweight and

    modular downloads of SMRT® Pipe • R-pbh5 (R) – An R package for interacting with data in HDF5 format from the PacBio® RS – Based on h5r • R-kinetics (R) – Introduce users to PacBio data - specifically kinetics data collected when performing a sequencing experiment • Java File APIs – Reads base, trace, pulse and CCS basecall files 20 www.pacbiodevnet.com
  16. Summary of Key Points Key Points • Fast sequencing times

    with real time signal processing and base calling = fast time to results • Richer information available: Kinetic information and multiple quality values • Circular Consensus Sequences available Where to Find More Information • HDF5 Java API User Guide, available on DevNet • Base, Pulse, and Trace File Reference Guide, available on DevNet • www.pacbiodevnet.com 21