$30 off During Our Annual Pro Sale. View Details »

Summarizing tens of thousands of RNA-seq samples: themes and lessons

Summarizing tens of thousands of RNA-seq samples: themes and lessons

Abstract: The Sequence Read Archive contains RNA-seq data for over 450K samples, including over 140K from humans. Large-scale projects like GTEx and ICGC are generating RNA-seq data on many thousands of samples. Such huge datasets are valuable, but unwieldy for typical researchers. I will describe work toward the goal of making it easy for researchers to use the archived RNA-seq data available today. I will highlight Rail-RNA (http://rail.bio), its dbGaP-protected version (http://docs.rail.bio/dbgap/), as well as the recount resource (https://jhubiostatistics.shinyapps.io/recount/) and Snaptron service/API (http://snaptron.cs.jhu.edu). Besides showcasing these tools and resources, I'll expound three themes: (a) pulic data is valuable but not easy to use and computationalists should attack this; (b) scalability is not just about scaling software to be distributed & multi-threaded, but is also about making the best use of many datasets at once; (c) "strategically unplugging" from gene annotations can lead to clearer statements about splicing and differential expression.

This is joint work with Abhinav Nellore, Jeff Leek, Kasper Hansen, Andrew Jaffe and others.

Ben Langmead

March 30, 2017
Tweet

More Decks by Ben Langmead

Other Decks in Science

Transcript

  1. Ben Langmead
    Assistant Professor, JHU Computer Science
    [email protected], langmead-lab.org, @BenLangmead
    Banff, March 30, 2017
    Summarizing tens of thousands of
    RNA-seq samples: themes and lessons

    View Slide

  2. Themes
    • Public data is available & valuable but hard to use

    View Slide

  3. Themes
    • Public data is available & valuable but hard to use
    • Scalable software benefits from big resources &
    data

    View Slide

  4. Themes
    • Public data is available & valuable but hard to use
    • Scalable software benefits from big resources &
    data
    • Strategically ignoring gene annotations can yield
    clearer results

    View Slide

  5. Sequence Read Archive (SRA) growth
    Open access
    Total
    1 Pbp
    4.5 -> 9 Pbp in
    ~10 months
    https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=announcement
    10 Pbp
    100 Tbp
    10 Tbp

    View Slide

  6. Abhinav
    Nellore
    OHSU
    Jeff Leek, JHU
    Website: http://rail.bio
    Nellore A, Collado-Torres L, Jaffe AE, Alquicira-Hernández J, Wilks C,
    Pritt J, Morton J, Leek JT, Langmead B. Rail-RNA: scalable analysis of
    RNA-seq splicing and coverage. Bioinformatics. 2016 Sep 4.

    View Slide

  7. Spliced RNA-seq aligner for analyzing many samples at once
    • Aggregate across samples to borrow strength and
    eliminate redundant alignment work
    Website: http://rail.bio
    Nellore A, Collado-Torres L, Jaffe AE, Alquicira-Hernández J, Wilks C,
    Pritt J, Morton J, Leek JT, Langmead B. Rail-RNA: scalable analysis of
    RNA-seq splicing and coverage. Bioinformatics. 2016 Sep 4.

    View Slide

  8. Spliced RNA-seq aligner for analyzing many samples at once
    • Aggregate across samples to borrow strength and
    eliminate redundant alignment work
    • Let data prune false junction calls, not annotation
    Website: http://rail.bio
    Nellore A, Collado-Torres L, Jaffe AE, Alquicira-Hernández J, Wilks C,
    Pritt J, Morton J, Leek JT, Langmead B. Rail-RNA: scalable analysis of
    RNA-seq splicing and coverage. Bioinformatics. 2016 Sep 4.

    View Slide

  9. Spliced RNA-seq aligner for analyzing many samples at once
    • Aggregate across samples to borrow strength and
    eliminate redundant alignment work
    • Let data prune false junction calls, not annotation
    • Concise outputs: junctions, junction evidence,
    coverage vectors; no alignments, unless asked for
    Website: http://rail.bio
    Nellore A, Collado-Torres L, Jaffe AE, Alquicira-Hernández J, Wilks C,
    Pritt J, Morton J, Leek JT, Langmead B. Rail-RNA: scalable analysis of
    RNA-seq splicing and coverage. Bioinformatics. 2016 Sep 4.

    View Slide

  10. Spliced RNA-seq aligner for analyzing many samples at once
    • Aggregate across samples to borrow strength and
    eliminate redundant alignment work
    • Let data prune false junction calls, not annotation
    • Concise outputs: junctions, junction evidence,
    coverage vectors; no alignments, unless asked for
    • Runs easily on commercial AWS cloud, other clusters
    Website: http://rail.bio
    Nellore A, Collado-Torres L, Jaffe AE, Alquicira-Hernández J, Wilks C,
    Pritt J, Morton J, Leek JT, Langmead B. Rail-RNA: scalable analysis of
    RNA-seq splicing and coverage. Bioinformatics. 2016 Sep 4.

    View Slide

  11. Spliced RNA-seq aligner for analyzing many samples at once
    • Aggregate across samples to borrow strength and
    eliminate redundant alignment work
    • Let data prune false junction calls, not annotation
    • Concise outputs: junctions, junction evidence,
    coverage vectors; no alignments, unless asked for
    • Runs easily on commercial AWS cloud, other clusters
    Website: http://rail.bio
    Nellore A, Collado-Torres L, Jaffe AE, Alquicira-Hernández J, Wilks C,
    Pritt J, Morton J, Leek JT, Langmead B. Rail-RNA: scalable analysis of
    RNA-seq splicing and coverage. Bioinformatics. 2016 Sep 4.

    View Slide

  12. Sample 1
    S 2
    S 3
    S 4
    S 5
    Candidate
    Junction 1
    CJ 2 CJ 3
    Aggregating across samples adds a
    dimension to junction call confidence

    View Slide

  13. Aggregating across samples adds a
    dimension to junction call confidence
    Sample 1
    S 2
    S 3
    S 4
    S 5
    Candidate
    Junction 1
    CJ 2 CJ 3

    View Slide

  14. Rail-RNA design
    Preprocess
    Aggregate duplicate reads
    Split into readlets
    Aggregate duplicate readlets
    Correlation clustering for readlet alignments
    Call splice junction
    Merge exon differentials
    Compile sample coverages
    Write bigWigs
    Write normalization factors
    Write spliced alignment BAMs
    Write junction & indel BEDs
    Align reads end-to-end to genome
    Align readlets to genome
    Align readlets to junction co-occurrence index
    Bowtie 2
    Bowtie
    Bowtie

    View Slide

  15. Sample 1
    Sample 2
    Sample 3
    Log coverage

    View Slide

  16. Better-than-linear scaling
    Marginal cost of analyzing 1 additional sample
    decreases as we add more samples
    Nellore A, Collado-Torres L, Jaffe AE, Alquicira-Hernández J, Wilks C, Pritt J, Morton J, Leek JT,
    Langmead B. Rail-RNA: scalable analysis of RNA-seq splicing and coverage. Bioinformatics. 2016 Sep 4.

    View Slide

  17. Intropolis
    Website: http://intropolis.rail.bio
    Paper: http://bit.ly/intropolis
    Abhi Nellore
    OHSU
    Jeff Leek
    JHU

    View Slide

  18. RNA-seq & annotation
    Spliced
    alignment
    Count
    overlaps w/
    annotated
    features
    Differential
    gene / exon
    expression
    Reads

    View Slide

  19. RNA-seq & annotation
    Spliced
    alignment
    Isoform
    assembly
    Isoform
    quantification
    Count
    overlaps w/
    annotated
    features
    Differential
    gene / exon
    expression
    Reads

    View Slide

  20. RNA-seq & annotation
    Spliced
    alignment
    Isoform
    assembly
    Isoform
    quantification
    Count
    overlaps w/
    annotated
    features
    Differential
    gene / exon
    expression
    (quasi-,
    pseudo-)
    Reads

    View Slide

  21. Good proxy?
    Transcripts in
    sample
    Isoform
    quantification
    Reads
    Transcripts in
    annotation

    View Slide

  22. Compiling Intropolis
    • Analyzed ~21,500 human RNA-seq samples
    with Rail-RNA; about 62 Tbp
    • Repeatable: http://github.com/nellore/runs
    • ~ $0.72 / sample
    (Compare to sequencing costs)
    (Exact commands we used to run on AWS)
    jxs
    samples

    View Slide

  23. a
    0 2000 4000 6000 8000 10000 12000 14000
    0
    100000
    200000
    300000
    400000
    500000
    600000
    700000
    Minimum number S of samples in which jx is called
    Junction (jx) count J
    18.6%
    56,861 jx
    100%
    96.5%
    81.4%
    85.8%
    Novel
    Alternative donor/acceptor
    Exon skip
    Fully annotated
    800 900 1000 1100 1200
    240000
    260000
    280000
    300000
    320000
    b
    8000
    10000
    samples
    c
    2500
    3000
    Nellore A, et al. Human splicing diversity and the extent of unannotated splice
    junctions across human RNA-seq samples on the Sequence Read Archive. Genome
    Biol. 2016 Dec 30;17(1):266.
    Annotation includes: UCSC, GENCODE v19 & v24,
    RefSeq, CCDS, MGC, lincRNAs, SIB genes, AceView, Vega

    View Slide

  24. 0 2000 4000 6000 8000 10000
    0 20 40 60 80
    Samples
    % called junctions that are annotated
    For ~2.5% of samples, <50% of
    junction calls are annotated
    Median fraction of junction
    calls that are annotated: ~80%
    GENCODE v19

    View Slide

  25. A third way
    Spliced
    alignment
    Rail-RNA: accurate,
    annotation-agnostic
    Differentially
    expressed
    region finder
    derfinder: region-based,
    annotation-agnostic
    Collado-Torres L, Nellore A, Frazee AC, Wilks C, Love MI, Langmead B,
    Irizarry RA, Leek JT, Jaffe AE. Flexible expressed region analysis for
    RNA-seq with derfinder. Nucleic Acids Res. 2017 Jan 25;45(2):e9.

    View Slide

  26. A third way
    Spliced
    alignment
    Rail-RNA: accurate,
    annotation-agnostic
    Differentially
    expressed
    region finder
    derfinder: region-based,
    annotation-agnostic
    bigWigs
    Collado-Torres L, Nellore A, Frazee AC, Wilks C, Love MI, Langmead B,
    Irizarry RA, Leek JT, Jaffe AE. Flexible expressed region analysis for
    RNA-seq with derfinder. Nucleic Acids Res. 2017 Jan 25;45(2):e9.

    View Slide

  27. Boiler: RNA-seq alignment compression
    • As big as bigWigs & 1-2 orders of
    magnitude smaller than sorted BAMs
    • Usable with Cufflinks, StringTie
    Pritt J, Langmead B. Boiler: lossy compression of RNA-seq alignments
    using coverage vectors. Nucleic Acids Res. 2016 Sep 19;44(16):e133.
    F
    R1 R2
    Coverage Length tallies Co-occurrence patterns
    Jacob Pritt

    View Slide

  28. Snaptron & recount2
    • Where to start for our summaries of public human
    RNA-seq; both include:
    • ~50K SRA samples
    • ~10K GTEx samples
    • ~10K TCGA samples
    • Snaptron: pose sophisticated queries re: splicing,
    quick responses, no downloading

    View Slide

  29. Snaptron
    Query planner delegates to appropriate systems (sqlite,
    tabix, lucene) and indexes (R-tree, B-tree, inverted full text)
    Chris Wilks
    Sample
    Filter
    8
    Region
    Limited
    Region
    Limited &
    Filtered
    Region
    Junction
    Records
    Sample
    Metadata
    Records
    Junction
    Records
    Filtered
    Region
    Filtered
    Samples
    Snaptron
    Query
    Planner
    Query Data Store/Index Output
    1
    2
    6 7
    3
    9
    4 5
    10 11 12 13
    4 7
    3
    1 2 8
    5 6
    Sample
    Metadata
    Terms Samples
    "Brain" 1,2,3,6
    "Liver" 4,6,9,11
    Sample
    Filter
    Tabix/R-tree
    Index
    Lucene/Inverted
    Document
    Index
    SQLite/B-tree
    Index
    Wilks C, Gaddipati, P, Nellore, A, & Langmead, B. "Snaptron: querying and visualizing
    splicing across tens of thousands of RNA-seq samples." bioRxiv (2017): 097881.

    View Slide

  30. Snaptron
    • Junctions in ALK gene
    • # times junction occurs in each of 50,000 SRA
    samples
    • Tissue specificity of junction in GTEx data
    • Samples ranked according to how overrepresented
    one splicing pattern is relative to another
    Example queries:
    http://snaptron.cs.jhu.edu
    Wilks C, Gaddipati, P, Nellore, A, & Langmead, B. "Snaptron: querying and visualizing
    splicing across tens of thousands of RNA-seq samples." bioRxiv (2017): 097881.

    View Slide

  31. Jeff Leek
    Jacob Pritt
    Abhinav
    Nellore
    Kasper
    Hansen
    Alyssa
    Frazee
    Leo Collado
    Torres
    Chris Wilks
    Andrew Jaffe
    José Alquicira-
    Hernández
    Jamie
    Morton
    Kai
    Kammers
    Shannon
    Ellis
    Margaret
    Taub
    • NIH R01GM118568
    • NSF CAREER IIS-1349906
    • Sloan Research Fellowship
    • IDIES Seed Funding program
    • Amazon Web Services
    langmead-lab.org, @BenLangmead
    Thank you:
    IDIES Seed funding
    SciServer
    SciServer Compute
    Also for DERfinder: Rafa Irizarry, Sarven
    Sabunciyan, Mike Love

    View Slide