$30 off During Our Annual Pro Sale. View Details »

RNAseq1

 RNAseq1

BMMB554 | RNA seq 1 - mapping and transcript assembly

Anton Nekrutenko

March 30, 2016
Tweet

More Decks by Anton Nekrutenko

Other Decks in Education

Transcript

  1. RNAseq I
    Read mapping and transcript reconstruction

    View Slide

  2. the RNA world
    Transcriptome

    View Slide

  3. View Slide

  4. Why genes
    in pieces?
    Licatalosi and Darnell 2010

    View Slide

  5. View Slide

  6. RNA$Seq bioinformatic
    How,RNA$seq,data,is,generated
    Isolate*Transcript*RNA
    AAAAAA
    AAAAAA
    AAAAAA
    AAAAAA
    Fragment*cDNA
    Size*SelecOon
    Illumina*Sequencing*of*each*end
    CAAA
    AAAA
    GGAG CTGG GAAA
    Reverse*TranscripOon
    CAGG
    based on Illumina approach
    *strand-specific RNA-seq protocols exist for both Illumina and SOLiD
    Slide complements of Andrew McPherson

    View Slide

  7. RNA isolation
    ‣ Treat your samples well
    ‣ Solubilization
    ‣ Recovery
    ‣ Normalization and Enrichment
    Guanidinium thiocyanate

    View Slide

  8. Normalization
    &
    Enrichment
    Adiconis et al. 2013

    View Slide

  9. Library preparation
    ‣ 1st strand synthesis
    ‣ 2nd strand synthesis
    ‣ Stranded libraries

    View Slide

  10. oligo-dT vs. random priming

    View Slide

  11. 2nd strand synthesis
    ‣ RNA displacement
    ‣ NuGen
    ‣ SMART: oligo-dG/strand switching

    View Slide

  12. Stranded libraries

    View Slide

  13. GATC Bioitech

    View Slide

  14. DSN-normalization

    View Slide

  15. Single Cell RNAseq
    Saliba et al. 2014

    View Slide

  16. RNA-seq data analysis
    • Can be analyzed in many different ways
    depending on goals of the experiment, what
    other data is available, et cetera

    View Slide

  17. Align-then-assemble or de novo?
    NA-Seq data enable de novo reconstruction of the transcriptome.
    cognized
    ery1 and
    ation of
    ompared
    parallel
    as vastly
    quencing
    ranscript
    ntroduce
    t capture
    icing in
    describe
    multane-
    ification
    udy gene
    ferentia-
    a similar
    re called
    tomes of
    lete gene
    ntergenic
    has been
    cently it
    RNA-Seq reads
    Align reads to
    genome
    Assemble transcripts
    de novo
    Assemble transcripts
    from spliced alignments
    More abundant
    Less abundant
    Align transcripts
    to genome
    Genome

    View Slide

  18. Align-then-assemble or de novo?
    NA-Seq data enable de novo reconstruction of the transcriptome.
    cognized
    ery1 and
    ation of
    ompared
    parallel
    as vastly
    quencing
    ranscript
    ntroduce
    t capture
    icing in
    describe
    multane-
    ification
    udy gene
    ferentia-
    a similar
    re called
    tomes of
    lete gene
    ntergenic
    has been
    cently it
    RNA-Seq reads
    Align reads to
    genome
    Assemble transcripts
    de novo
    Assemble transcripts
    from spliced alignments
    More abundant
    Less abundant
    Align transcripts
    to genome
    Genome

    View Slide

  19. • Align-then-assemble: potentially more sensitive,
    but requires a reference genome, confounded by
    structural variation
    • de novo: likely to only capture highly expressed
    transcripts, but does not require a reference
    genome, robust to variation

    View Slide

  20. Aligning RNA-seq reads to a genome
    Reads*in*RNA%seq
    Exon*A Exon*B
    Exon*A Exon*B
    transcript
    chromosome
    ?
    ?
    ?
    ?
    ?
    Exon*C Exon*D
    Exon*C Exon*D
    ?
    ?
    ?
    ?
    ?
    7

    View Slide

  21. Spliced mapping



    a








    b


    k






    c
    F
    R
    o
    b
    E
    (
    i
    A
    f
    a
    m
    t
    e
    t
    k
    w
    (
    a
    r
    c
    r
    i
    t
    Exon-first is more efficient, likely more sensitive for shorter reads, but
    can produce erroneous alignments for duplicates and pseudogenes.

    View Slide

  22. BIOINFORMATICS ORIGINAL PAPER Vol. 25 no. 9 2009, pages 1105–1111
    doi:10.1093/bioinformatics/btp120
    Sequence analysis
    TopHat: discovering splice junctions with RNA-Seq
    Cole Trapnell1,∗, Lior Pachter2 and Steven L. Salzberg1
    1Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD 20742 and
    2Department of Mathematics, University of California, Berkeley, CA 94720, USA
    Received on October 23, 2008; revised on February 24, 2009; accepted on February 26, 2009
    Advance Access publication March 16, 2009
    Associate Editor: Ivo Hofacker
    ABSTRACT
    Motivation: A new protocol for sequencing the messenger RNA
    in a cell, known as RNA-Seq, generates millions of short sequence
    fragments in a single run. These fragments, or ‘reads’, can be used
    to measure levels of gene expression and to identify novel splice
    variants of genes. However, current software for aligning RNA-Seq
    data to a genome relies on known splice junctions and cannot identify
    novel ones. TopHat is an efficient read-mapping algorithm designed
    to align reads from an RNA-Seq experiment to a reference genome
    without relying on known splice sites.
    Results: We mapped the RNA-Seq reads from a recent mammalian
    RNA-Seq experiment and recovered more than 72% of the splice
    junctions reported by the annotation-based software from that study,
    along with nearly 20 000 previously unreported junctions. The TopHat
    pipeline is much faster than previous systems, mapping nearly 2.2
    million reads per CPU hour, which is sufficient to process an entire
    RNA-Seq experiment in less than a day on a standard desktop
    computer. We describe several challenges unique to ab initio splice
    site discovery from RNA-Seq reads that will require further algorithm
    measurements of expression at comparable cost (Marioni et al.,
    2008).
    The major drawback of RNA-Seq over conventional EST
    sequencing is that the sequences themselves are much shorter,
    typically 25–50 nt versus several hundred nucleotides with older
    technologies. One of the critical steps in an RNA-Seq experiment
    is that of mapping the NGS ‘reads’ to the reference transcriptome.
    However, because the transcriptomes are incomplete even for well-
    studied species such as human and mouse, RNA-Seq analyses
    are forced to map to the reference genome as a proxy for
    the transcriptome. Mapping to the genome achieves two major
    objectives of RNA-Seq experiments:
    (1) Identification of novel transcripts from the locations of
    regions covered in the mapping.
    (2) Estimation of the abundance of the transcripts from their depth
    of coverage in the mapping.
    Because RNA-Seq reads are short, the first task is challenging.

    View Slide

  23. similarities to
    uses a training
    the reference
    mapping phase
    purpose suffix
    e, fast aligner,
    on machines
    wer than other
    rt that Vmatch
    ur against the
    untime appears
    m; its authors
    to A.thaliana
    per CPU hour.
    package that
    g of RNA-Seq
    lian genome at
    an filtering out
    aligns all sites,
    a data layout
    . This strategy
    s non-junction
    (http://bowtie-
    ping program
    rence genome
    bioinformatics.oxfor
    Downloaded from
    Bowtie mapping does not allow gaps, reads spanning splice
    junctions won’t map

    View Slide

  24. similarities to
    uses a training
    the reference
    mapping phase
    purpose suffix
    e, fast aligner,
    on machines
    wer than other
    rt that Vmatch
    ur against the
    untime appears
    m; its authors
    to A.thaliana
    per CPU hour.
    package that
    g of RNA-Seq
    lian genome at
    an filtering out
    aligns all sites,
    a data layout
    . This strategy
    s non-junction
    (http://bowtie-
    ping program
    rence genome
    bioinformatics.oxfor
    Downloaded from
    Extend islands ~50bp and identify GT-AG pairing sites between
    neighboring islands (within 20kb)

    View Slide

  25. TopHat
    chr9:
    STS Markers
    26559200 26559300 26559400 26559500
    brain RNA
    STS Markers on Genetic and Radiation Hybrid Maps
    UCSC Gene Predictions Based on RefSeq, UniProt, GenBank, and Comparative Genomics
    RefSeq Genes
    Mouse mRNAs from GenBank
    B3gat1
    B3gat1
    B3gat1
    B3gat1
    B3gat1
    AK082739
    AK220561
    AK044599
    AK041316
    AB055781
    AK003020
    BC034655
    brain RNA
    2.34 _
    0.04 _
    Fig. 2. An intron entirely overlapped by the 5′-UTR of another transcript. Both isoforms are present in the brain tissue RNA sample. The top track is the
    normalized uniquely mappable read coverage reported by ERANGE for this region (Mortazavi et al., 2008). The lack of a large coverage gap causes TopHat
    to report a single island containing both exons. TopHat looks for introns within single islands in order to detect this junction.
    Use a coverage statistic to identify pairing sites within
    single islands

    View Slide

  26. similarities to
    uses a training
    the reference
    mapping phase
    purpose suffix
    e, fast aligner,
    on machines
    wer than other
    rt that Vmatch
    ur against the
    untime appears
    m; its authors
    to A.thaliana
    per CPU hour.
    package that
    g of RNA-Seq
    lian genome at
    an filtering out
    aligns all sites,
    a data layout
    . This strategy
    s non-junction
    (http://bowtie-
    ping program
    rence genome
    bioinformatics.oxfor
    Downloaded from

    View Slide

  27. nscript. Both isoforms are present in the brain tissue RNA sample. The top track is the
    for this region (Mortazavi et al., 2008). The lack of a large coverage gap causes TopHat
    ons within single islands in order to detect this junction.
    rage of
    rted by
    ample,
    ct such
    oks for
    raction
    or each
    (1)
    ap, and
    1000],
    nd. We
    high D
    ds with
    Fig. 3. The seed and extend alignment used to match reads to possible splice
    sites. For each possible splice site, a seed is formed by combining a small
    at Pitts Theology Library, Emory University on October 19, 20
    bioinformatics.oxfordjournals.org
    oaded from
    Use putative splice junctions as seeds and search for
    matching unmapped reads

    View Slide

  28. View Slide

  29. View Slide

  30. Kim et al. 2015

    View Slide

  31. Kim et al. 2015

    View Slide

  32. View Slide

  33. View Slide

  34. Transcriptome reconstruction

    View Slide

  35. • We now have
    • Predicted exons expressed in the sample
    • Predicted splice junctions expressed in the
    sample
    • Does this tell us what isoforms are present?

    View Slide

  36. View Slide

  37. L E T T E R S
    High-throughput mRNA sequencing (RNA-Seq) promises
    simultaneous transcript discovery and abundance estimation1–3.
    However, this would require algorithms that are not restricted
    by prior gene annotations and that account for alternative
    transcription and splicing. Here we introduce such algorithms
    in an open-source software program called Cufflinks. To test
    Cufflinks, we sequenced and analyzed >430 million paired
    75-bp RNA-Seq reads from a mouse myoblast cell line over
    a differentiation time series. We detected 13,692 known
    transcripts and 3,724 previously unannotated ones, 62% of
    which are supported by independent expression data or by
    homologous genes in other species. Over the time series, 330
    genes showed complete switches in the dominant transcription
    (75 bp in this work versus 25 bp in our previous work) and pairs of
    reads from both ends of each RNA fragment can reduce uncertainty
    in assigning reads to alternative splice variants12. To produce use-
    ful transcript-level abundance estimates from paired-end RNA-Seq
    data, we developed a new algorithm that can identify complete novel
    transcripts and probabilistically assign reads to isoforms.
    For our initial demonstration of Cufflinks, we performed a time
    course of paired-end 75-bp RNA-Seq on a well-studied model of
    skeletal muscle development, the C2C12 mouse myoblast cell line13
    (see Online Methods). Regulated RNA expression of key transcrip-
    tion factors drives myogenesis, and the execution of the differentia-
    tion process involves changes in expression of hundreds of genes14,15.
    Previous studies have not measured global transcript isoform expres-
    Transcript assembly and quantification by RNA-Seq
    reveals unannotated transcripts and isoform switching
    during cell differentiation
    Cole Trapnell1–3, Brian A Williams4, Geo Pertea2, Ali Mortazavi4, Gordon Kwan4, Marijke J van Baren5,
    Steven L Salzberg1,2, Barbara J Wold4 & Lior Pachter3,6,7

    View Slide

  38. ons the multiple time point novel isoforms were tiled by high-identity
    a
    d
    b
    Map paired cDNA
    fragment sequences
    to genome
    TopHat
    Cufflinks
    Spliced fragment
    alignments
    Abundance estimation
    Assembly
    Mutually
    incompatible
    fragments
    m
    n
    an

    View Slide

  39. Petrea et al. 2015

    View Slide

  40. c
    d
    b
    e
    Cufflinks
    Abundance estimation
    Assembly
    Mutually
    incompatible
    fragments
    Transcript coverage
    and compatibility
    Fragment
    length
    distribution
    Overlap graph
    Maximum likelihood
    abundances
    Log-likelihood
    Minimum path cover
    Transcripts

    3

    3

    1

    1

    2

    2
    rlap
    dge,
    ch
    red
    her
    e
    hs
    ents
    ere can
    ed),
    at
    imum
    ks
    set
    g the
    ated
    cripts
    ave
    come
    nces
    rom
    ment
    agment
    s
    Trapnell et al. 2010

    View Slide

  41. Petrea et al. 2015

    View Slide

  42. Long read RNAseq
    Tilgner et al. 2014

    View Slide

  43. View Slide

  44. View Slide

  45. View Slide