Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Secondary Analysis

PacBio
April 02, 2013

Secondary Analysis

PacBio

April 02, 2013
Tweet

More Decks by PacBio

Other Decks in Science

Transcript

  1. FIND MEANING IN COMPLEXITY
    © Copyright 2013 by Pacific Biosciences of California, Inc. All rights reserved.
    Secondary Analysis

    View Slide

  2. Complete Informatics Workflow
    TraceToPulse
    Trace
    File
    PulseToBase
    Circular
    Consensus
    Generate QC
    PacBio RS
    Pulse File Base File
    Circular
    Consensus
    File
    QC
    Reports
    Filter Mapping Consensus Variants
    Application Specific
    Base File
    Reference
    Mapped
    File
    Consensus
    File
    Variants
    File
    QC
    Reports
    QC
    Reports
    Primary Analysis Pipeline
    Secondary Analysis Pipeline
    bas.h5 cmp.h5 cmp.h5 var.gff
    QC
    Reports
    summary.gff
    SMRT® Pipe

    View Slide

  3. SMRT® Portal Workflow Overview
    3
    Shared File
    System 6a. Data &
    results are
    written
    Job
    Management
    Service
    4. Submit job
    SMRT® View
    7b. Launch
    SMRT View to
    visualize results
    Secondary
    Analysis DB
    SMRT® Portal
    SMRT® Portal
    Services
    1. Create job 2. Create job 3. Create job
    Web browser http SQL
    6b. Progress
    updated
    http
    SMRT® Pipe
    5. Run analysis
    pipelines
    CLI

    View Slide

  4. Secondary Analysis Job Submission
    SMRT®
    Portal/SMRT
    View
    Tomcat
    SMRT Pipe
    SMRT Analysis Algorithms
    Linux OS
    Hardware
    GUI
    Web Services API
    Command line

    View Slide

  5. Creating Jobs in SMRT® Portal
    5

    View Slide

  6. Viewing Data in SMRT® Portal
    6

    View Slide

  7. SMRT® Portal Protocols
    Resequencing
    RS_Resequencing_GATK Align against reference and call variants using GATK
    RS_Resequencing Align against reference and generate consensus
    RS_Resequencing_GATK_Barcode Identify barcodes, align against reference and call variants using GATK
    RS_Modification_Detection Align against reference and identify base modification positions
    RS_Modification_and_Motif_Analysis Map bacterial modifications m6A, m4C and m5C and analyze motifs
    RS_Minor_and_Compound_Variants Align CCS against a reference can call minor and compound variants.
    Assembly
    RS_PreAssembler_Allora
    Construct de novo assembly from single long insert library using HGAP
    method with ALLORA
    RS_PreAssembler
    Generate high quality pre-assembled long reads as a first step for use in
    de novo assembly (HGAP method)
    RS_Allora_Assembly De novo assembly using ALLORA
    RS_Allora_Assembly_EC Hybrid assembly using P_ErrorCorrection and ALLORA
    RS_AHA_Scaffolding Scaffolding assembly using AHA
    RS_Celera_Assembler
    Use pacBioToCA and Celera® Assembler to combine PacBio® CLR and
    CCS or short-reads
    Other
    RS_cDNA_Mapping Align splice reads against genomic reference with GMAP
    RS_Filter Filter to generate filtered_subreads.fastq

    View Slide

  8. Example Protocol: RS_Resequencing_GATK
    8
    Filter
    Mapping
    Consensus
    Variant
    calling
    Action performed
    Remove adaptors;
    filter reads,
    e.g. >0.75 RQ and >50 bp
    Align subreads to reference
    Generate consensus
    Make SNP calls
    Module Name
    P_Filter
    P_Mapping
    P_Consensus
    P_GATKVC
    Algorithm
    BLASR
    GenomicConsensus
    GATK
    Outputs
    filtered_subreads. fastq
    filtered_subreads. fasta
    aligned_reads. sam
    aligned_reads. bam
    aligned_reads. cmp.h5
    consensus. fastq
    variants. gff
    variants. vcf

    View Slide

  9. DAG Workflow Visualization
    In workflow directory

    View Slide

  10. SMRT® Analysis Algorithms – De Novo Assembly
    • HGAP – de novo assembly
    – De novo assembly from a single long insert library
    preparation
    • Celera® Assembler – de novo assembly
    – Combines PacBio® long reads with short reads or CCS
    – Scales to plant and mammalian-sized genomes
    • ALLORA (“A Long Read Assembler”) – de novo
    assembly
    – Tailored to PacBio long reads and error profile
    – Uses overlap-layout-consensus approach
    – Outputs contigs as FASTA sequence and HDF5 files.
    • AHA (“A Hybrid Assembler”) – scaffolding of contigs
    – Combines PacBio sequence with high confidence contigs
    from an existing assembly, joining them into larger contigs
    – Can generate high confidence contigs from 2nd generation
    sequencing technologies or Sanger sequencing contigs 10

    View Slide

  11. SMRT® Analysis Algorithms – Targeted Sequencing
    • BLASR (“Basic Local Alignment with Successive
    Refinement”) – reference-based alignment
    – Maps reads to reference genomes and sequences
    – Designed to handle error profile of PacBio® reads
    • Quiver – consensus and variant caller
    – Uses PacBio’s rich QVs to choose the optimal consensus
    – Then calls haploid SNPs and indels
    – Can achieve Q50 for de novo assembly and resequencing
    • GATK (Genome Analysis Tool Kit) variant caller
    – Identifies haploid and diploid SNPs using the Broad’s
    Unified Genotyper
    • GMAP (Genomic Mapping and Alignment Program)
    – Align splice reads against genomic reference for full length
    cDNA discovery
    – Developed at Genentech 11

    View Slide

  12. Unique Computational Demands in Mapping SMRT® Sequences
    Mapping PacBio® data requires different
    tools compared to 2nd Gen:
    • Datasets are larger than what traditional database
    search methods such as BLAST handle
    • Read lengths are exponentially distributed and
    much longer than what short-read aligners are
    designed for
    • Error profile is different than what short-read
    aligners are designed for
    • Detailed alignment should benefit from rich quality
    values
    Read length histogram
    Accuracy by position

    View Slide

  13. BLASR Combines Methods from Multiple Applications
    Sparse dynamic programming:
    rearrangements BWT-FM Index,
    and suffix array
    search for rapid
    mapping
    Detailed banded
    dynamic programming
    alignments
    Chaisson et al. (2012) Mapping single molecule sequencing reads using Basic Local Alignment with Successive Refinement
    (BLASR): Theory and Application. BMC Bioinformatics 13, 238.

    View Slide

  14. Map short subsequences of a read to a reference
    genome using a suffix array, or BWT-FM index
    (based on short-read mapping)
    Find high-scoring sets of anchors using global
    chaining (based on whole-genome alignment)
    4 3 5*
    BLASR Alignment: Suffix Array/BWT Mapping + Refinement

    View Slide

  15. 4 5*
    Score putative matches using Sparse Dynamic
    Programming
    score=372* score=250
    GCAG-TCGTTAGCTAAC
    |||| ||||||||| ||
    GCAGGTCGTTAGCT-AC
    Align matches using
    backbone-following
    banded dynamic
    programming
    High Scoring Candidates Refined by Dynamic Programming

    View Slide

  16. Summary of Key Points
    • Secondary Analysis consists of multiple parts
    – SMRT® Portal is the GUI
    – SMRT® Pipe is the command-line script
    – Web Services API for automation
    • Protocols can be configured for multiple workflows
    16

    View Slide

  17. Additional Resources Available on DevNet
    Topic Where to look
    Installing
    SMRT® Portal
    SMRT Analysis Software Installation (v1.4)
    Running SMRT Analysis on Amazon
    SMRT Portal
    Administration
    SMRT Portal Help
    SMRT Portal
    Network Setup
    SMRT Analysis Software Installation (v1.4)
    PacBio® RS Network Diagram
    PacBio RS IT Site Prep Document
    Using SMRT
    Portal
    SMRT Portal Help
    Setting Module
    Parameters
    SMRT Pipe Reference Guide (v1.4)
    Troubleshoot
    SMRT Portal
    DevNet Discussion Forums
    PacBio Technical Support

    View Slide

  18. Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, and SMRTbell are trademarks of Pacific Biosciences in
    the United States and/or other countries. All other trademarks are the sole property of their respective owners.

    View Slide