Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Secondary Analysis

PacBio
April 02, 2013

Secondary Analysis

PacBio

April 02, 2013
Tweet

More Decks by PacBio

Other Decks in Science

Transcript

  1. FIND MEANING IN COMPLEXITY © Copyright 2013 by Pacific Biosciences

    of California, Inc. All rights reserved. Secondary Analysis
  2. Complete Informatics Workflow TraceToPulse Trace File PulseToBase Circular Consensus Generate

    QC PacBio RS Pulse File Base File Circular Consensus File QC Reports Filter Mapping Consensus Variants Application Specific Base File Reference Mapped File Consensus File Variants File QC Reports QC Reports Primary Analysis Pipeline Secondary Analysis Pipeline bas.h5 cmp.h5 cmp.h5 var.gff QC Reports summary.gff SMRT® Pipe
  3. SMRT® Portal Workflow Overview 3 Shared File System 6a. Data

    & results are written Job Management Service 4. Submit job SMRT® View 7b. Launch SMRT View to visualize results Secondary Analysis DB SMRT® Portal SMRT® Portal Services 1. Create job 2. Create job 3. Create job Web browser http SQL 6b. Progress updated http SMRT® Pipe 5. Run analysis pipelines CLI
  4. Secondary Analysis Job Submission SMRT® Portal/SMRT View Tomcat SMRT Pipe

    SMRT Analysis Algorithms Linux OS Hardware GUI Web Services API Command line
  5. SMRT® Portal Protocols Resequencing RS_Resequencing_GATK Align against reference and call

    variants using GATK RS_Resequencing Align against reference and generate consensus RS_Resequencing_GATK_Barcode Identify barcodes, align against reference and call variants using GATK RS_Modification_Detection Align against reference and identify base modification positions RS_Modification_and_Motif_Analysis Map bacterial modifications m6A, m4C and m5C and analyze motifs RS_Minor_and_Compound_Variants Align CCS against a reference can call minor and compound variants. Assembly RS_PreAssembler_Allora Construct de novo assembly from single long insert library using HGAP method with ALLORA RS_PreAssembler Generate high quality pre-assembled long reads as a first step for use in de novo assembly (HGAP method) RS_Allora_Assembly De novo assembly using ALLORA RS_Allora_Assembly_EC Hybrid assembly using P_ErrorCorrection and ALLORA RS_AHA_Scaffolding Scaffolding assembly using AHA RS_Celera_Assembler Use pacBioToCA and Celera® Assembler to combine PacBio® CLR and CCS or short-reads Other RS_cDNA_Mapping Align splice reads against genomic reference with GMAP RS_Filter Filter to generate filtered_subreads.fastq
  6. Example Protocol: RS_Resequencing_GATK 8 Filter Mapping Consensus Variant calling Action

    performed Remove adaptors; filter reads, e.g. >0.75 RQ and >50 bp Align subreads to reference Generate consensus Make SNP calls Module Name P_Filter P_Mapping P_Consensus P_GATKVC Algorithm BLASR GenomicConsensus GATK Outputs filtered_subreads. fastq filtered_subreads. fasta aligned_reads. sam aligned_reads. bam aligned_reads. cmp.h5 consensus. fastq variants. gff variants. vcf
  7. SMRT® Analysis Algorithms – De Novo Assembly • HGAP –

    de novo assembly – De novo assembly from a single long insert library preparation • Celera® Assembler – de novo assembly – Combines PacBio® long reads with short reads or CCS – Scales to plant and mammalian-sized genomes • ALLORA (“A Long Read Assembler”) – de novo assembly – Tailored to PacBio long reads and error profile – Uses overlap-layout-consensus approach – Outputs contigs as FASTA sequence and HDF5 files. • AHA (“A Hybrid Assembler”) – scaffolding of contigs – Combines PacBio sequence with high confidence contigs from an existing assembly, joining them into larger contigs – Can generate high confidence contigs from 2nd generation sequencing technologies or Sanger sequencing contigs 10
  8. SMRT® Analysis Algorithms – Targeted Sequencing • BLASR (“Basic Local

    Alignment with Successive Refinement”) – reference-based alignment – Maps reads to reference genomes and sequences – Designed to handle error profile of PacBio® reads • Quiver – consensus and variant caller – Uses PacBio’s rich QVs to choose the optimal consensus – Then calls haploid SNPs and indels – Can achieve Q50 for de novo assembly and resequencing • GATK (Genome Analysis Tool Kit) variant caller – Identifies haploid and diploid SNPs using the Broad’s Unified Genotyper • GMAP (Genomic Mapping and Alignment Program) – Align splice reads against genomic reference for full length cDNA discovery – Developed at Genentech 11
  9. Unique Computational Demands in Mapping SMRT® Sequences Mapping PacBio® data

    requires different tools compared to 2nd Gen: • Datasets are larger than what traditional database search methods such as BLAST handle • Read lengths are exponentially distributed and much longer than what short-read aligners are designed for • Error profile is different than what short-read aligners are designed for • Detailed alignment should benefit from rich quality values Read length histogram Accuracy by position
  10. BLASR Combines Methods from Multiple Applications Sparse dynamic programming: rearrangements

    BWT-FM Index, and suffix array search for rapid mapping Detailed banded dynamic programming alignments Chaisson et al. (2012) Mapping single molecule sequencing reads using Basic Local Alignment with Successive Refinement (BLASR): Theory and Application. BMC Bioinformatics 13, 238.
  11. Map short subsequences of a read to a reference genome

    using a suffix array, or BWT-FM index (based on short-read mapping) Find high-scoring sets of anchors using global chaining (based on whole-genome alignment) 4 3 5* BLASR Alignment: Suffix Array/BWT Mapping + Refinement
  12. 4 5* Score putative matches using Sparse Dynamic Programming score=372*

    score=250 GCAG-TCGTTAGCTAAC |||| ||||||||| || GCAGGTCGTTAGCT-AC Align matches using backbone-following banded dynamic programming High Scoring Candidates Refined by Dynamic Programming
  13. Summary of Key Points • Secondary Analysis consists of multiple

    parts – SMRT® Portal is the GUI – SMRT® Pipe is the command-line script – Web Services API for automation • Protocols can be configured for multiple workflows 16
  14. Additional Resources Available on DevNet Topic Where to look Installing

    SMRT® Portal SMRT Analysis Software Installation (v1.4) Running SMRT Analysis on Amazon SMRT Portal Administration SMRT Portal Help SMRT Portal Network Setup SMRT Analysis Software Installation (v1.4) PacBio® RS Network Diagram PacBio RS IT Site Prep Document Using SMRT Portal SMRT Portal Help Setting Module Parameters SMRT Pipe Reference Guide (v1.4) Troubleshoot SMRT Portal DevNet Discussion Forums PacBio Technical Support
  15. Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, and SMRTbell

    are trademarks of Pacific Biosciences in the United States and/or other countries. All other trademarks are the sole property of their respective owners.