Track 1: IsoSeq

FIND MEANING IN COMPLEXITY Nicole Rapicavoli Field Applications Scientist October
2014 Iso-Seq™: Full-Length Transcript Analysis Using SMRT® Analysis V2.3 For Research Use Only. Not for use in diagnostic procedures. Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, SMRTbell, and Iso-Seq are trademarks of Pacific Biosciences in the United States and/or other countries. All other trademarks are the sole property of their respective owners. © Copyright 2014 by Pacific Biosciences of California, Inc. All rights reserved.

Learning Objectives 2 Scientists • Interested in Genome Annotation and
Full-length Isoform detection using the PacBio® Iso-Seq™ method. After the training, you will be able to • Choose the best protocol for your experimental design. • Understand how the Iso-Seq method works. • Run an Iso_Seq job and understand the reports generated in SMRT® Portal. • SMRT Technology • PacBio System Workflow • General Understanding of SMRT Portal

Iso-Seq™ Method: Resolving Transcript Diversity On average, 8 exons in
human gene Candidate space: 5.8 x 1076 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 ?

Transcript Diversity

Current State of Transcript Assembly “The way we do RNA-seq
now is… you take the transcriptome, you blow it up into pieces and then you try to figure out how they all go back together again… If you think about it, it’s kind of a crazy way to do things” Michael Synder Professor and Chair of Genetics Stanford University Tal Nawy, End to end RNA Sequencing, Nature Methods, v10, n10, Dec . 2013, p1144–1145 Ian Korf (2013) Genomics: the state of the art in RNA-seq analysis, Nature Methods, Nov 26;10(12):1165-6. doi: 10.1038/nmeth.2735.

Sequencing Full-length cDNA in the Human Transcriptome “Our results show
the feasibility of deep sequencing full-length RNA from complex eukaryotic transcriptomes on a single-molecule level.”

Long-read Solutions: One PacBio® Read Spans Entire Transcript

Iso-Seq™ Experimental Design Recommendations

Detailed Workflow for Conversion of cDNA into SMRTbell™ Libraries polyA+
RNA Total RNA Optional Poly-A Selection Reverse Transcription Full-length 1st Strand cDNA PCR Optimization Large-scale Amplification Amplified cDNA 1-2 kb 2-3 kb 3-6 kb BluePippin™ Size Selection (<10 kb) or Gel (<6 kb ) Methodology 1-2 kb 2-3 kb 3-6 kb Re-Amplification 1-2 kb 2-3 kb 3-6 kb SMRTbell Template Preparation 1-2 kb 2-3 kb 3-6 kb SMRT® Sequencing 3-6 kb Optional Size Selection (BluePippin System) 5-10 kb 5-10 kb 5-10 kb 5-10 kb 5-10 kb

SMARTer Workflow PCR Optimization Agarose-based Size Selection

Library Preparation SMRT® Sequencing Bioanalyzer trace of SMRTbell™ templates DNA
Damage Repair Repair Ends Purify Templates Ligate Adapters Primer Annealing and Bind Polymerase 3-6 kb BluePippin™ Size Selection

Size selection is necessary for loading longer isoforms 12 Your
ability to detect longer isoforms will be determined by the method that you use. Size selection is needed because shorter transcripts: • PCR amplify better • Load preferentially in ZMWs during sequencing

5-10 kb fraction supported 13 • Now possible to directly
sequence full-length transcripts up to 10 kb • Use of Kapa HiFi DNA Polymerase and size-selection bins. • Considerations • Know your tissue and transcript content! • mRNA trace must show transcripts larger than 6 kb.

New Officially Supported Iso-Seq™ Protocols http://www.pacificbiosciences.com/support/pubmap/

Updated Iso-Seq™ Protocols Size- Selection Protocol Transcript Range Advantages Disadvantages
When to Use None ONLY 1-1.5 kb • Least expensive • Simplest protocol • Better than short-read seq. for isoform characterization • Limited transcript size range • If transcripts <1.5 kb are desired • Time & resource limited Manual Agarose Gel 1 to 6 kb • Long transcripts • Long insert-size range • Uses generally accessible lab methods. • More time & resource intensive. • Unable to generate longest transcripts • Size bins not as precise • Chance of contamination between samples on the same gel • Need transcripts >1.5 kb • Cannot access BluePippin System BluePippin™ System 1 to 10 kb • Longest transcripts • Largest insert-size range • Least chance of cross contamination between samples • Best representation of isoforms in your sample • Most time & resource intensive. • 4 libraries per sample • Requires BluePippin System • When the best understanding of the transcriptome is desired.

Iso-Seq™ Methodology

PacBio’s Iso-Seq™ Method for High-quality, Full-length Transcripts PolyA mRNA AAAAA
AAAAA AAAAA AAAAA cDNA synthesis with adapters AAAAA TTTTT AAAAA TTTTT AAAAA TTTTT AAAAA TTTTT AAAAA TTTTT AAAAA TTTTT AAAAA TTTTT AAAAA TTTTT Size partitioning & PCR amplification SMRTbell™ ligation PacBio® RS II Sequencing Experimental Pipeline Informatics Pipeline Remove adapters Remove artifacts Clean sequence reads Reads clustering Isoform clusters Consensus calling Nonredundant transcript isoforms Quality filtering Final isoforms PacBio raw sequence reads Raw 5’ primer 3’ primer Map to reference genome Experimental pipeline Informatics pipeline PacBio raw sequence reads a b AAAA AAAA AAAAA AAAAA AAAAA AAAAA AAAAA Size partitioning & PCR amplification cDNA synthesis with adapters SMRTbell ligation RS sequencing Remove adapters Remove artifacts Reads clustering Quality filtering Clean sequence reads Nonredundant transcript isoforms Final isoforms TTTT TTTT Consensus calling Isoform clusters Map to reference genome Evidence-based gene models polyA mRNA AAAA AAAA TTTT TTTT AAAA TTTT AAAA TTTT AAAA TTTT AAAA TTTT Evidenced-based gene models (AAA)n (TTT)n SMRT adapter 1 2 3 4 5 6 7 8 9 10 (TTT)n (AAA)n mRNA_sequence polyA tail SMRT® adapter DevNet: Iso-Seq wiki page (AAA)n Reads of Insert (AAA)n

Iso-Seq™ Bioinformatics Workflow Informatics Pipeline Remove adapters Remove artifacts Clean
sequence reads Reads clustering Isoform clusters Consensus calling Nonredundant transcript isoforms Quality filtering Final isoforms PacBio raw sequence reads Raw 5’ primer 3’ primer Map to reference genome Experimental pipeline Informatics pipeline PacBio raw sequence reads Figure 1 a b AAAA AAAA AAAAA AAAAA AAAAA AAAAA AAAAA Size partitioning & PCR amplification cDNA synthesis with adapters SMRTbell ligation RS sequencing Remove adapters Remove artifacts Reads clustering Quality filtering Clean sequence reads Nonredundant transcript isoforms Final isoforms TTTT TTTT Consensus calling Isoform clusters Map to reference genome Evidence-based gene models polyA mRNA AAAA AAAA TTTT TTTT AAAA TTTT AAAA TTTT AAAA TTTT AAAA TTTT Evidenced-based gene models (AAA)n (TTT)n SMRT adapter 6 7 8 9 10 (TTT)n SMRT® adapter (AAA)n Reads of Insert (AAA)n pbtranscripts.py classify pbtranscripts.py cluster (ICE) Quiver

Identify Full-length (FL) Reads – pbtranscripts.py classify Full-Length = 5’
primer seen, polyA tail seen, 3’ primer seen • Identify and remove primers and polyA/T tail • Identify read strandedness ®

Validate size selection by plotting FL read lengths 1 –
2 kb 2 – 3 kb 3 – 6 kb Distribution of full-length reads

Expected FL% at Different Size Ranges *based on in-house training
samples Size Selection FL % 1 – 2 kb 50 - 60% 2 – 3 kb 30 – 45 % 3 – 6 kb (gel or 1 BP) 20 – 35 % 3 – 6 kb (2 BP) 15 – 20 %

Artificial Chimeras (1) Artificial Concatemer (AAA)n 5’ primer Transcript 1
Transcript 2 3’ primer 3’ primer 5’ primer Cause Outcome Detection Low SMRT® adaptor concentration Primer-ligated cDNA form concatemers High incidence of artificial chimera (identifiable cDNA primer in the middle) MCF=7 Clontech 1 – 2 kb Trainee Artificial chimeras A 2415 (3.9%) B 79 (0.5%) C 304 (0.2%) D 235 (0.2%)

Artificial Chimeras (2) Transcript 2, partial PCR Chimera (TT< T
< )< n <<<<<<<<<<<<<<<<<<<<<<< 5’ primer Transcript 1, partial, reversed 3’ primer >>>>>>>>>>>>>>>>>>>>>> (AAA)n However, there are also biological chimeras! Cause Outcome Detection PCR amplification Random fusion of ligated transcripts Single read maps to different loci/genes Sample Size Selection Multi-mapped MCF7 1 – 2 kb 2.7% Rat Muscle 1 – 2 kb 3.2% Mouse Liver 1 – 2 kb 2.2% Mouse Liver 2 – 3 kb 1.6%

Bioinformatics QC Summary • Identify Full-length Reads – % of
expected full-length reads/cell differs depending on transcript size range. • Detect and Remove Artificial Chimeras – Artificial concatemers are rare (~0.2%) and avoidable by increasing SMRT® adapter concentration. – PCR chimeras are difficult to completely avoid (~3%) but can be detected computationally (if reference genome available). – However, there are also biological chimeras. 25

Runs (.bax.h5) Reads Of Insert (RoI) (.fasta, .ccs.h5) Non-full-length, non-
chimeric RoI Clustering-pbtranscripts.py cluster Next step: figure out which reads come from the same isoforms Full-length, non- chimeric RoI

Full-length, non- chimeric RoIs Non-full-length, non- chimeric RoI Cluster Consensus
ICE Isoform-level Clustering: Overview Pbtranscript.py classify Pbtranscript.py cluster Full-length, non- chimeric RoI Runs (.bax.h5) Reads Of Insert (RoI) (.fasta, .ccs.h5)

Iterative Clustering for Error Correction Build Similarity Graph using BLASR
Clique Finding Fast Consensus Calling using DAGCon Full-length, non- chimeric RoI Cluster Reassignment Merge Clusters

Why Isoform-level Clustering? Reads Of Insert (RoI) are already multi-pass
consensus sequences: Advantage of isoform-level clustering: • Remove redundancy • Increase accuracy

Isoform-level Clustering: Background Multiple reads come from multiple copies of
the same isoform (AAA)n TGGGAGCCTATGCGACAATGAAACCTG… (AAA)n TGGAGCAATATGCGAACAATAAAACCTC… (AAA)n TGGAGCATATGCGAACAATAAAACGGG… Errors are randomly distributed and mostly indels If we can cluster reads from same isoform  higher accuracy consensus sequence

Given a collection of isoform reads, we can use the
same consensus calling algorithm used in PacBio’s de novo genome assemblies (DAGCon) Chin et al., Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data, Nature Methods (2013) In de novo genome assembly, the longest reads are used as “backbone/seed” reads

Full-length, non- chimeric RoI Non-full-length, non- chimeric RoI Cluster Consensus
HQ, Full-length, Polished Consensus Quiver ICE Isoform-level Clustering: Overview Pbtranscript.py classify Pbtranscript.py cluster Runs (.bax.h5) Reads Of Insert (RoI) (.fasta, .ccs.h5)

Quiver for Final Consensus Polishing • Recruit non-full-length reads –
Same “isoform hit” criteria – But does not require each read to be fully aligned – Each non-FL read can belong to multiple clusters Chin et al., Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data, Nature Methods (2013)

2.3 SMRT® Analysis Iso-Seq™ Enhancements • Continued refinement of algorithms
for isoform identification to improve accuracy – Better read classification (full-length vs. non-full-length, chimeric vs. non-chimeric) – Incorporation of non-full-length reads to predict consensus isoforms and for Quiver polishing • Usability enhancements within SMRT Portal UI supporting simplified analysis setup – Support for custom primers for targeted Iso-Seq projects – UI parameters tuning options for Quiver – Human-readable annotations for predicted consensus isoforms – Support for cDNA samples with no polyA tails • Performance enhancements – Faster job execution for analysis using ICE and polish (up to 3x) – Less memory usage (up to 8x) (*Note: Performance not guaranteed for custom primers) 35 https://github.com/PacificBiosciences/cDNA_primer/wiki

Iso-Seq™ Examples

Sequencing Full-length cDNA in the Human Transcriptome without Size Selection
Sharon et al. (2013) Nature Biotechnology. doi10.10 38/nbt.2705

Improve Gene Annotation of Chicken Genome with Size Selection •
Thomas et al. (2014) Long-read sequencing of chicken transcripts and identification of new transcript isoforms. plosOne doi: 10.1371/journal.pone.0094650..

Improve Gene Annotation of Chicken Genome • Thomas et al.
(2014) Long-read sequencing of chicken transcripts and identification of new transcript isoforms. plosOne doi: 10.1371/journal.pone.0094650..

Characterization of Complex Human Splice Variants by Targeted Iso-Seq™ Analysis
Treutlein et al. (2014) Cartography of neurexin alternative splicing mapped by single-molecule long-read mRNA sequencing. PNAS. doi:10.1073/pnas.1403244111 “We used un-biased long-read sequencing of full- length neurexin mRNA to systematically assess the alternative splicing of neurexins in prefrontal cortex.” “Our data suggests that thousands of neurexin isoforms are physiologically generated…”

Splice Landscape of Neurexin 1a Treutlein et al. (2014) Cartography
of neurexin alternative splicing mapped by single-molecule long-read mRNA sequencing. PNAS. doi:10.1073/pnas.1403244111 Nrxn1α domain structure Exons • green – present • white – absent Splice isoform abundance (2,574 full-length Nrxn1α mRNAs sequence reads) 6 SMRT® Cells 247 unique alternatively- spliced isoforms

Neurexin mRNA Isoform Diversity in the Brain “Complexity of Nrxn1a
repertoires correlates with the cellular complexity of neuronal tissues, and a specific subset of isoforms is enriched in a purified cell type.” “Our analysis defines the molecular diversity of a critical synaptic receptor and provides evidence that neurexin diversity is linked to cellular diversity in the nervous system.” Numbers indicate total number of isoforms identified by PacBio® sequencing

Iso-Seq™ Implementation Walk-Through

Create New Iso_Seq Job in SMRT® Portal

Create New Iso_Seq Job in SMRT® Portal 2.3

Iso_Seq Classify Parameters in SMRT® Portal 2.3 Minimum Sequence Length:
- Will depend on lower cutoff Full –Length Reads Do Not Require PolyA Tails: - check this box if transcript sequence has no polyA tail (ex: targeted sequencing) Customized Primers: - fill in here if NOT using Clontech primers 

Iso_Seq Clustering Parameters in SMRT® Portal 2.3 Predict Consensus Isoforms
Using ICE Algorithm: - Click to generate consensus isoforms Call Quiver To Polish Consensus Isoforms: - Click to run Quiver to polish isoforms with non-full length reads. Estimated cDNA Size: - Select size bin based on library size  By default only Isoseq_classify is run!

Additional Information

New Iso-Seq™ Application Website http://www.pacificbiosciences.com/isoseq

Coming Soon: Sample Human Dataset Release and Blog Post 51

User Group Meeting Iso-Seq™ Talk 52 A novel retroviral-derived human
noncoding RNA acts competitively to regulate stem cell biology Thursday, October 16, 3:45 p.m. – 4:10 p.m. Vittorio Sebastiano, Ph.D., Stem Cell Biology and Regenerative Medicine Institute, Stanford School of Medicine

Mendel’s Pod 53 “Today’s podcast is sponsored by Pacific Biosciences,
providers of long read sequencing solutions … “ Michael Snyder, Ph.D. Professor and Chair of Genetics Director, Stanford Center for Genomics and Personalized Medicine Stanford University

ASHG Conference Activity 54 PacBio Workshop: A new look at
the human genome – novel insights with long read PacBio sequencing Tuesday, October 21, 12:30 p.m. – 2:00 p.m. Iso-Seq presentation by: Hagen Tilgner, Snyder Lab, Stanford Poster presentations: Monday (2:00 – 3:00 p.m.) 1627M: Full-length, single molecule whole transcriptome sequencing reveals alternative 5’- start sites, spliceoforms, and poly(A) addition signal sequences. David Munroe, NCI Tuesday (3:00 – 4:00 p.m.) 552T: Complex alternative splicing patterns in human hematopoietic cell subpopulations revealed by third-generation long reads Anne Deslattes Mays, Georgetown Univ, Lombardi Cancer Center

For Research Use Only. Not for use in diagnostic procedures.
Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, SMRTbell, and Iso-Seq are trademarks of Pacific Biosciences in the United States and/or other countries. All other trademarks are the sole property of their respective owners.

Track 1: IsoSeq

Track 1: IsoSeq

More Decks by PacBio

Other Decks in Science

Featured

Transcript