Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Track 1: IsoSeq

PacBio
October 15, 2014

Track 1: IsoSeq

PacBio

October 15, 2014
Tweet

More Decks by PacBio

Other Decks in Science

Transcript

  1. FIND MEANING IN COMPLEXITY Nicole Rapicavoli Field Applications Scientist October

    2014 Iso-Seq™: Full-Length Transcript Analysis Using SMRT® Analysis V2.3 For Research Use Only. Not for use in diagnostic procedures. Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, SMRTbell, and Iso-Seq are trademarks of Pacific Biosciences in the United States and/or other countries. All other trademarks are the sole property of their respective owners. © Copyright 2014 by Pacific Biosciences of California, Inc. All rights reserved.
  2. Learning Objectives 2 Scientists • Interested in Genome Annotation and

    Full-length Isoform detection using the PacBio® Iso-Seq™ method. After the training, you will be able to • Choose the best protocol for your experimental design. • Understand how the Iso-Seq method works. • Run an Iso_Seq job and understand the reports generated in SMRT® Portal. • SMRT Technology • PacBio System Workflow • General Understanding of SMRT Portal
  3. Iso-Seq™ Method: Resolving Transcript Diversity On average, 8 exons in

    human gene Candidate space: 5.8 x 1076 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 ?
  4. Current State of Transcript Assembly “The way we do RNA-seq

    now is… you take the transcriptome, you blow it up into pieces and then you try to figure out how they all go back together again… If you think about it, it’s kind of a crazy way to do things” Michael Synder Professor and Chair of Genetics Stanford University Tal Nawy, End to end RNA Sequencing, Nature Methods, v10, n10, Dec . 2013, p1144–1145 Ian Korf (2013) Genomics: the state of the art in RNA-seq analysis, Nature Methods, Nov 26;10(12):1165-6. doi: 10.1038/nmeth.2735.
  5. Sequencing Full-length cDNA in the Human Transcriptome “Our results show

    the feasibility of deep sequencing full-length RNA from complex eukaryotic transcriptomes on a single-molecule level.”
  6. Detailed Workflow for Conversion of cDNA into SMRTbell™ Libraries polyA+

    RNA Total RNA Optional Poly-A Selection Reverse Transcription Full-length 1st Strand cDNA PCR Optimization Large-scale Amplification Amplified cDNA 1-2 kb 2-3 kb 3-6 kb BluePippin™ Size Selection (<10 kb) or Gel (<6 kb ) Methodology 1-2 kb 2-3 kb 3-6 kb Re-Amplification 1-2 kb 2-3 kb 3-6 kb SMRTbell Template Preparation 1-2 kb 2-3 kb 3-6 kb SMRT® Sequencing 3-6 kb Optional Size Selection (BluePippin System) 5-10 kb 5-10 kb 5-10 kb 5-10 kb 5-10 kb
  7. Library Preparation SMRT® Sequencing Bioanalyzer trace of SMRTbell™ templates DNA

    Damage Repair Repair Ends Purify Templates Ligate Adapters Primer Annealing and Bind Polymerase 3-6 kb BluePippin™ Size Selection
  8. Size selection is necessary for loading longer isoforms 12 Your

    ability to detect longer isoforms will be determined by the method that you use. Size selection is needed because shorter transcripts: • PCR amplify better • Load preferentially in ZMWs during sequencing
  9. 5-10 kb fraction supported 13 • Now possible to directly

    sequence full-length transcripts up to 10 kb • Use of Kapa HiFi DNA Polymerase and size-selection bins. • Considerations • Know your tissue and transcript content! • mRNA trace must show transcripts larger than 6 kb.
  10. Updated Iso-Seq™ Protocols Size- Selection Protocol Transcript Range Advantages Disadvantages

    When to Use None ONLY 1-1.5 kb • Least expensive • Simplest protocol • Better than short-read seq. for isoform characterization • Limited transcript size range • If transcripts <1.5 kb are desired • Time & resource limited Manual Agarose Gel 1 to 6 kb • Long transcripts • Long insert-size range • Uses generally accessible lab methods. • More time & resource intensive. • Unable to generate longest transcripts • Size bins not as precise • Chance of contamination between samples on the same gel • Need transcripts >1.5 kb • Cannot access BluePippin System BluePippin™ System 1 to 10 kb • Longest transcripts • Largest insert-size range • Least chance of cross contamination between samples • Best representation of isoforms in your sample • Most time & resource intensive. • 4 libraries per sample • Requires BluePippin System • When the best understanding of the transcriptome is desired.
  11. PacBio’s Iso-Seq™ Method for High-quality, Full-length Transcripts PolyA mRNA AAAAA

    AAAAA AAAAA AAAAA cDNA synthesis with adapters AAAAA TTTTT AAAAA TTTTT AAAAA TTTTT AAAAA TTTTT AAAAA TTTTT AAAAA TTTTT AAAAA TTTTT AAAAA TTTTT Size partitioning & PCR amplification SMRTbell™ ligation PacBio® RS II Sequencing Experimental Pipeline Informatics Pipeline Remove adapters Remove artifacts Clean sequence reads Reads clustering Isoform clusters Consensus calling Nonredundant transcript isoforms Quality filtering Final isoforms PacBio raw sequence reads Raw 5’ primer 3’ primer Map to reference genome Experimental pipeline Informatics pipeline PacBio raw sequence reads a b AAAA AAAA AAAAA AAAAA AAAAA AAAAA AAAAA Size partitioning & PCR amplification cDNA synthesis with adapters SMRTbell ligation RS sequencing Remove adapters Remove artifacts Reads clustering Quality filtering Clean sequence reads Nonredundant transcript isoforms Final isoforms TTTT TTTT Consensus calling Isoform clusters Map to reference genome Evidence-based gene models polyA mRNA AAAA AAAA TTTT TTTT AAAA TTTT AAAA TTTT AAAA TTTT AAAA TTTT Evidenced-based gene models (AAA)n (TTT)n SMRT adapter 1 2 3 4 5 6 7 8 9 10 (TTT)n (AAA)n mRNA_sequence polyA tail SMRT® adapter DevNet: Iso-Seq wiki page (AAA)n Reads of Insert (AAA)n
  12. Iso-Seq™ Bioinformatics Workflow Informatics Pipeline Remove adapters Remove artifacts Clean

    sequence reads Reads clustering Isoform clusters Consensus calling Nonredundant transcript isoforms Quality filtering Final isoforms PacBio raw sequence reads Raw 5’ primer 3’ primer Map to reference genome Experimental pipeline Informatics pipeline PacBio raw sequence reads Figure 1 a b AAAA AAAA AAAAA AAAAA AAAAA AAAAA AAAAA Size partitioning & PCR amplification cDNA synthesis with adapters SMRTbell ligation RS sequencing Remove adapters Remove artifacts Reads clustering Quality filtering Clean sequence reads Nonredundant transcript isoforms Final isoforms TTTT TTTT Consensus calling Isoform clusters Map to reference genome Evidence-based gene models polyA mRNA AAAA AAAA TTTT TTTT AAAA TTTT AAAA TTTT AAAA TTTT AAAA TTTT Evidenced-based gene models (AAA)n (TTT)n SMRT adapter 6 7 8 9 10 (TTT)n SMRT® adapter (AAA)n Reads of Insert (AAA)n pbtranscripts.py classify pbtranscripts.py cluster (ICE) Quiver
  13. Identify Full-length (FL) Reads – pbtranscripts.py classify Full-Length = 5’

    primer seen, polyA tail seen, 3’ primer seen • Identify and remove primers and polyA/T tail • Identify read strandedness ®
  14. Validate size selection by plotting FL read lengths 1 –

    2 kb 2 – 3 kb 3 – 6 kb Distribution of full-length reads
  15. Expected FL% at Different Size Ranges *based on in-house training

    samples Size Selection FL % 1 – 2 kb 50 - 60% 2 – 3 kb 30 – 45 % 3 – 6 kb (gel or 1 BP) 20 – 35 % 3 – 6 kb (2 BP) 15 – 20 %
  16. Artificial Chimeras (1) Artificial Concatemer (AAA)n 5’ primer Transcript 1

    Transcript 2 3’ primer 3’ primer 5’ primer Cause Outcome Detection Low SMRT® adaptor concentration Primer-ligated cDNA form concatemers High incidence of artificial chimera (identifiable cDNA primer in the middle) MCF=7 Clontech 1 – 2 kb Trainee Artificial chimeras A 2415 (3.9%) B 79 (0.5%) C 304 (0.2%) D 235 (0.2%)
  17. Artificial Chimeras (2) Transcript 2, partial PCR Chimera (TT< T

    < )< n <<<<<<<<<<<<<<<<<<<<<<< 5’ primer Transcript 1, partial, reversed 3’ primer >>>>>>>>>>>>>>>>>>>>>> (AAA)n However, there are also biological chimeras! Cause Outcome Detection PCR amplification Random fusion of ligated transcripts Single read maps to different loci/genes Sample Size Selection Multi-mapped MCF7 1 – 2 kb 2.7% Rat Muscle 1 – 2 kb 3.2% Mouse Liver 1 – 2 kb 2.2% Mouse Liver 2 – 3 kb 1.6%
  18. Bioinformatics QC Summary • Identify Full-length Reads – % of

    expected full-length reads/cell differs depending on transcript size range. • Detect and Remove Artificial Chimeras – Artificial concatemers are rare (~0.2%) and avoidable by increasing SMRT® adapter concentration. – PCR chimeras are difficult to completely avoid (~3%) but can be detected computationally (if reference genome available). – However, there are also biological chimeras. 25
  19. Iso-Seq™ Bioinformatics Workflow Informatics Pipeline Remove adapters Remove artifacts Clean

    sequence reads Reads clustering Isoform clusters Consensus calling Nonredundant transcript isoforms Quality filtering Final isoforms PacBio raw sequence reads Raw 5’ primer 3’ primer Map to reference genome Experimental pipeline Informatics pipeline PacBio raw sequence reads Figure 1 a b AAAA AAAA AAAAA AAAAA AAAAA AAAAA AAAAA Size partitioning & PCR amplification cDNA synthesis with adapters SMRTbell ligation RS sequencing Remove adapters Remove artifacts Reads clustering Quality filtering Clean sequence reads Nonredundant transcript isoforms Final isoforms TTTT TTTT Consensus calling Isoform clusters Map to reference genome Evidence-based gene models polyA mRNA AAAA AAAA TTTT TTTT AAAA TTTT AAAA TTTT AAAA TTTT AAAA TTTT Evidenced-based gene models (AAA)n (TTT)n SMRT adapter 6 7 8 9 10 (TTT)n SMRT® adapter (AAA)n Reads of Insert (AAA)n pbtranscripts.py classify pbtranscripts.py cluster (ICE) Quiver
  20. Runs (.bax.h5) Reads Of Insert (RoI) (.fasta, .ccs.h5) Non-full-length, non-

    chimeric RoI Clustering-pbtranscripts.py cluster Next step: figure out which reads come from the same isoforms Full-length, non- chimeric RoI
  21. Full-length, non- chimeric RoIs Non-full-length, non- chimeric RoI Cluster Consensus

    ICE Isoform-level Clustering: Overview Pbtranscript.py classify Pbtranscript.py cluster Full-length, non- chimeric RoI Runs (.bax.h5) Reads Of Insert (RoI) (.fasta, .ccs.h5)
  22. Iterative Clustering for Error Correction Build Similarity Graph using BLASR

    Clique Finding Fast Consensus Calling using DAGCon Full-length, non- chimeric RoI Cluster Reassignment Merge Clusters
  23. Why Isoform-level Clustering? Reads Of Insert (RoI) are already multi-pass

    consensus sequences: Advantage of isoform-level clustering: • Remove redundancy • Increase accuracy
  24. Isoform-level Clustering: Background Multiple reads come from multiple copies of

    the same isoform (AAA)n TGGGAGCCTATGCGACAATGAAACCTG… (AAA)n TGGAGCAATATGCGAACAATAAAACCTC… (AAA)n TGGAGCATATGCGAACAATAAAACGGG… Errors are randomly distributed and mostly indels If we can cluster reads from same isoform  higher accuracy consensus sequence
  25. Given a collection of isoform reads, we can use the

    same consensus calling algorithm used in PacBio’s de novo genome assemblies (DAGCon) Chin et al., Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data, Nature Methods (2013) In de novo genome assembly, the longest reads are used as “backbone/seed” reads
  26. Iso-Seq™ Bioinformatics Workflow Informatics Pipeline Remove adapters Remove artifacts Clean

    sequence reads Reads clustering Isoform clusters Consensus calling Nonredundant transcript isoforms Quality filtering Final isoforms PacBio raw sequence reads Raw 5’ primer 3’ primer Map to reference genome Experimental pipeline Informatics pipeline PacBio raw sequence reads Figure 1 a b AAAA AAAA AAAAA AAAAA AAAAA AAAAA AAAAA Size partitioning & PCR amplification cDNA synthesis with adapters SMRTbell ligation RS sequencing Remove adapters Remove artifacts Reads clustering Quality filtering Clean sequence reads Nonredundant transcript isoforms Final isoforms TTTT TTTT Consensus calling Isoform clusters Map to reference genome Evidence-based gene models polyA mRNA AAAA AAAA TTTT TTTT AAAA TTTT AAAA TTTT AAAA TTTT AAAA TTTT Evidenced-based gene models (AAA)n (TTT)n SMRT adapter 6 7 8 9 10 (TTT)n SMRT® adapter (AAA)n Reads of Insert (AAA)n pbtranscripts.py classify pbtranscripts.py cluster (ICE) Quiver
  27. Full-length, non- chimeric RoI Non-full-length, non- chimeric RoI Cluster Consensus

    HQ, Full-length, Polished Consensus Quiver ICE Isoform-level Clustering: Overview Pbtranscript.py classify Pbtranscript.py cluster Runs (.bax.h5) Reads Of Insert (RoI) (.fasta, .ccs.h5)
  28. Quiver for Final Consensus Polishing • Recruit non-full-length reads –

    Same “isoform hit” criteria – But does not require each read to be fully aligned – Each non-FL read can belong to multiple clusters Chin et al., Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data, Nature Methods (2013)
  29. 2.3 SMRT® Analysis Iso-Seq™ Enhancements • Continued refinement of algorithms

    for isoform identification to improve accuracy – Better read classification (full-length vs. non-full-length, chimeric vs. non-chimeric) – Incorporation of non-full-length reads to predict consensus isoforms and for Quiver polishing • Usability enhancements within SMRT Portal UI supporting simplified analysis setup – Support for custom primers for targeted Iso-Seq projects – UI parameters tuning options for Quiver – Human-readable annotations for predicted consensus isoforms – Support for cDNA samples with no polyA tails • Performance enhancements – Faster job execution for analysis using ICE and polish (up to 3x) – Less memory usage (up to 8x) (*Note: Performance not guaranteed for custom primers) 35 https://github.com/PacificBiosciences/cDNA_primer/wiki
  30. Sequencing Full-length cDNA in the Human Transcriptome without Size Selection

    Sharon et al. (2013) Nature Biotechnology. doi10.10 38/nbt.2705
  31. Improve Gene Annotation of Chicken Genome with Size Selection •

    Thomas et al. (2014) Long-read sequencing of chicken transcripts and identification of new transcript isoforms. plosOne doi: 10.1371/journal.pone.0094650..
  32. Improve Gene Annotation of Chicken Genome • Thomas et al.

    (2014) Long-read sequencing of chicken transcripts and identification of new transcript isoforms. plosOne doi: 10.1371/journal.pone.0094650..
  33. Characterization of Complex Human Splice Variants by Targeted Iso-Seq™ Analysis

    Treutlein et al. (2014) Cartography of neurexin alternative splicing mapped by single-molecule long-read mRNA sequencing. PNAS. doi:10.1073/pnas.1403244111 “We used un-biased long-read sequencing of full- length neurexin mRNA to systematically assess the alternative splicing of neurexins in prefrontal cortex.” “Our data suggests that thousands of neurexin isoforms are physiologically generated…”
  34. Splice Landscape of Neurexin 1a Treutlein et al. (2014) Cartography

    of neurexin alternative splicing mapped by single-molecule long-read mRNA sequencing. PNAS. doi:10.1073/pnas.1403244111 Nrxn1α domain structure Exons • green – present • white – absent Splice isoform abundance (2,574 full-length Nrxn1α mRNAs sequence reads) 6 SMRT® Cells 247 unique alternatively- spliced isoforms
  35. Neurexin mRNA Isoform Diversity in the Brain “Complexity of Nrxn1a

    repertoires correlates with the cellular complexity of neuronal tissues, and a specific subset of isoforms is enriched in a purified cell type.” “Our analysis defines the molecular diversity of a critical synaptic receptor and provides evidence that neurexin diversity is linked to cellular diversity in the nervous system.” Numbers indicate total number of isoforms identified by PacBio® sequencing
  36. Iso_Seq Classify Parameters in SMRT® Portal 2.3 Minimum Sequence Length:

    - Will depend on lower cutoff Full –Length Reads Do Not Require PolyA Tails: - check this box if transcript sequence has no polyA tail (ex: targeted sequencing) Customized Primers: - fill in here if NOT using Clontech primers 
  37. Iso_Seq Clustering Parameters in SMRT® Portal 2.3 Predict Consensus Isoforms

    Using ICE Algorithm: - Click to generate consensus isoforms Call Quiver To Polish Consensus Isoforms: - Click to run Quiver to polish isoforms with non-full length reads. Estimated cDNA Size: - Select size bin based on library size  By default only Isoseq_classify is run!
  38. User Group Meeting Iso-Seq™ Talk 52 A novel retroviral-derived human

    noncoding RNA acts competitively to regulate stem cell biology Thursday, October 16, 3:45 p.m. – 4:10 p.m. Vittorio Sebastiano, Ph.D., Stem Cell Biology and Regenerative Medicine Institute, Stanford School of Medicine
  39. Mendel’s Pod 53 “Today’s podcast is sponsored by Pacific Biosciences,

    providers of long read sequencing solutions … “ Michael Snyder, Ph.D. Professor and Chair of Genetics Director, Stanford Center for Genomics and Personalized Medicine Stanford University
  40. ASHG Conference Activity 54 PacBio Workshop: A new look at

    the human genome – novel insights with long read PacBio sequencing Tuesday, October 21, 12:30 p.m. – 2:00 p.m. Iso-Seq presentation by: Hagen Tilgner, Snyder Lab, Stanford Poster presentations: Monday (2:00 – 3:00 p.m.) 1627M: Full-length, single molecule whole transcriptome sequencing reveals alternative 5’- start sites, spliceoforms, and poly(A) addition signal sequences. David Munroe, NCI Tuesday (3:00 – 4:00 p.m.) 552T: Complex alternative splicing patterns in human hematopoietic cell subpopulations revealed by third-generation long reads Anne Deslattes Mays, Georgetown Univ, Lombardi Cancer Center
  41. For Research Use Only. Not for use in diagnostic procedures.

    Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, SMRTbell, and Iso-Seq are trademarks of Pacific Biosciences in the United States and/or other countries. All other trademarks are the sole property of their respective owners.