Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Marshaling public data for lean and powerful splicing studies

Marshaling public data for lean and powerful splicing studies

The Sequence Read Archive now contains over a million accessions, including over 500K RNA-seq run accessions for mouse and over 300K for human. Large-scale projects like GTEx, ICGC and TOPmed are major contributors and huge projects on the horizon, such as the All of Us and Million Veterans programs, will throw on more fuel. Such archives are potential gold mines for researchers but they are not organized for everyday use by scientists. The situation resembles the early days of the World Wide Web, before search engines made the web easy to use.

I will describe our work on making making large public RNA sequencing datasets easy to use. I will describe our multi-layered design, with one layer for scalable and uniform (and secure) analysis (Rail-RNA), another for forming easy-to-use summarized (recount2), and a third for indexing the summaries and making them queryable (Snaptron). The overall result is a system where scientists can pose questions that are scientific in nature, and aren't simply about data retrieval. Finally, I will describe collaborations where these tools were applied to (a) evaluate hypotheses about prevalence or specificity of splicing patterns, (b) characterize completeness of the gene annotations we use to understand splicing patterns, and (c) reveal patterns in public data that ultimately changed the study design and allowed more targeted hypotheses to be tested with less new data generation. This is joint work with Abhinav Nellore, Chris Wilks, Jonathan Ling, Jeff Leek, Kasper Hansen, Andrew Jaffe and others.

Ben Langmead

March 07, 2019
Tweet

More Decks by Ben Langmead

Other Decks in Research

Transcript

  1. Ben Langmead
    Assistant Professor, JHU Computer Science
    [email protected], langmead-lab.org, @BenLangmead
    Vanderbilt Genetics Institute
    March 7, 2019
    Marshaling public data for lean
    and powerful splicing studies

    View full-size slide

  2. Lab goals
    Efficient
    Scalable
    Interpretable
    Software:
    Topics:
    Bowtie 1&2, Dashing, Arioc
    applied algorithms, text indexing,
    sketching, thread scaling
    Rail-RNA, recount2, Snaptron, Boiler
    parallel and high-performance
    computing, cloud computing, indexing
    To make high-throughput life science data as usable
    as possible for scientific labs, especially small ones
    Qtip, FORGe, r-index, ref. relaxation
    modeling mapping quality, graph-
    genome variants, addressing biases
    Software:
    Topics:
    Software:
    Topics:

    View full-size slide

  3. Sequence Read Archive
    Langmead B, Nellore A. Cloud computing for genomic data
    analysis and collaboration. Nat Rev Genet. 2018 May;19(5):325.
    Currently ~ 26 petabases

    View full-size slide

  4. An index is a great leveler
    GB Shaw
    Summaries are
    good too
    Not GB Shaw

    View full-size slide

  5. Public summaries of sequencing data
    Langmead B, Nellore A. Cloud computing for genomic data
    analysis and collaboration. Nat Rev Genet. 2018 May;19(5):325.

    View full-size slide

  6. Search engine for RNA-seq
    Snaptron Index & query engine w/ REST API
    snaptron.cs.jhu.edu
    doi:10.1093/bioinformatics/btx547
    Clean summaries of data, metadata,
    packaged as R objects
    jhubiostatistics.shinyapps.io/recount/
    doi:10.1038/nbt.3838
    Scalable, cloud-based spliced alignment
    of archived RNA-seq datasets
    rail.bio
    doi:10.1093/bioinformatics/btw575

    View full-size slide

  7. Themes
    • Cloud computing is a
    natural fit for public
    data
    • Think outside the
    gene annotation
    • Much of the work is
    in the "last mile"

    View full-size slide

  8. Abhinav
    Nellore
    OHSU
    Jeff Leek,
    JHU
    http://rail.bio
    Nellore A, Collado-Torres L, Jaffe AE, Alquicira-Hernández J,
    Wilks C, Pritt J, Morton J, Leek JT, Langmead B. Rail-RNA:
    scalable analysis of RNA-seq splicing and coverage.
    Bioinformatics. 2016 Sep 4.
    Image by Rgocs

    View full-size slide

  9. Spliced RNA-seq aligner for analyzing many samples at once
    • Group across samples to borrow strength and
    eliminate redundant alignment work
    • Let data prune false junction calls, not annotation
    • Concise outputs: junctions & coverage vectors;
    no alignments, unless asked for
    • Runs easily on commercial AWS cloud
    http://rail.bio
    Nellore A, Collado-Torres L, Jaffe AE, Alquicira-Hernández J,
    Wilks C, Pritt J, Morton J, Leek JT, Langmead B. Rail-RNA:
    scalable analysis of RNA-seq splicing and coverage.
    Bioinformatics. 2016 Sep 4.

    View full-size slide

  10. First foray: Intropolis
    • Analyzed ~21,500 human RNA-seq
    samples with Rail-RNA; about 62 Tbp
    Exon-exon
    junctions
    (10s of millions)
    Samples (21.5K)
    http://intropolis.rail.bio
    Nellore A, et al. Human splicing diversity and the extent of
    unannotated splice junctions across human RNA-seq samples on
    the Sequence Read Archive. Genome Biol. 2016 Dec 30;17(1):266.
    Counts

    View full-size slide

  11. a
    0 2000 4000 6000 8000 10000 12000 14000
    0
    100000
    200000
    300000
    400000
    500000
    600000
    700000
    Minimum number S of samples in which jx is called
    Junction (jx) count J
    18.6%
    56,861 jx
    100%
    96.5%
    81.4%
    85.8%
    Novel
    Alternative donor/acceptor
    Exon skip
    Fully annotated
    800 900 1000 1100 1200
    240000
    260000
    280000
    300000
    320000
    b
    8000
    10000
    samples
    c
    2500
    3000
    Annotations: UCSC, GENCODE v19 & v24, RefSeq,
    CCDS, MGC, lincRNAs, SIB genes, AceView, Vega
    http://intropolis.rail.bio
    Nellore A, et al. Human splicing diversity and the extent of
    unannotated splice junctions across human RNA-seq samples on
    the Sequence Read Archive. Genome Biol. 2016 Dec 30;17(1):266.

    View full-size slide

  12. recount2
    • >50K human RNA-seq samples from SRA (open)
    • >10K human RNA-seq samples from The Cancer
    Genome Atlas (dbGaP)
    Image: https://www.sevenbridges.com/welcome-to-the-cancer-genomics-cloud-2/
    • >10K human RNA-seq from the Genotype-
    Tissue Expression (GTEx) project (dbGaP)
    • Total: ~4.4 trillion reads, 100s of terabases
    Image: doi:10.1038/ng.2653
    Collado-Torres L, Nellore A, Kammers K, Ellis SE, Taub MA, Hansen KD, Jaffe AE,
    Langmead B, Leek JT. Reproducible RNA-seq analysis using recount2. Nature
    Biotechnology. 2017 Apr 11;35(4):319-321.

    View full-size slide

  13. Collado-Torres L, Nellore A, Kammers K, Ellis SE, Taub MA, Hansen KD, Jaffe AE,
    Langmead B, Leek JT. Reproducible RNA-seq analysis using recount2. Nature
    Biotechnology. 2017 Apr 11;35(4):319-321.
    bit.ly/recount2 (jhubiostatistics.shinyapps.io/recount/)
    recount2
    Enter search ->
    Study list is
    instantly
    filtered
    Links to data
    objects

    View full-size slide

  14. Search engine for RNA-seq
    Snaptron

    View full-size slide

  15. Snaptron
    Query planner breaks down queries, delegates
    to appropriate systems (sqlite, tabix, Lucene)
    and indexes (R-tree, B-tree, inverted index)
    Chris Wilks
    Sample
    Filter
    8
    Region
    Limited
    Region
    Limited &
    Filtered
    Region
    Junction
    Records
    Sample
    Metadata
    Records
    Junction
    Records
    Filtered
    Region
    Filtered
    Samples
    Snaptron
    Query
    Planner
    Query Data Store/Index Output
    1
    2
    6 7
    3
    9
    4 5
    10 11 12 13
    4 7
    3
    1 2 8
    5 6
    Sample
    Metadata
    Terms Samples
    "Brain" 1,2,3,6
    "Liver" 4,6,9,11
    Sample
    Filter
    Tabix/R-tree
    Index
    Lucene/Inverted
    Document
    Index
    SQLite/B-tree
    Index
    Wilks C, Gaddipati P, Nellore A, Langmead B. Snaptron: querying and visualizing splicing
    across tens of thousands of RNA-seq samples. Bioinformatics. 2017 Sep 1. btx547.

    View full-size slide

  16. Snaptron
    Provides command-line tool and REST API for
    querying junctions (& more summaries coming soon)
    Wilks C, Gaddipati P, Nellore A, Langmead B. Snaptron: querying and visualizing splicing
    across tens of thousands of RNA-seq samples. Bioinformatics. 2017 Sep 1. btx547.

    View full-size slide

  17. Snaptron
    • For each junction in a gene, what is its read
    support in each of 50K SRA samples?
    • What is a junction's tissue specificity in GTEx?
    • In which samples is splicing pattern A
    overrepresented relative to pattern B?
    Example queries:
    http://snaptron.cs.jhu.edu
    Wilks C, Gaddipati P, Nellore A, Langmead B. Snaptron: querying and visualizing splicing
    across tens of thousands of RNA-seq samples. Bioinformatics. 2017 Sep 1. btx547.

    View full-size slide

  18. Case study
    • Goldstein et al searched for novel cassette
    exons in Illumina BodyMap 2.0 RNA-seq
    • Identified 249 within known genes, not
    overlapping a RefSeq-annotated exon
    • Validated 216 out of 249 in independent
    sample via RNA-seq

    View full-size slide

  19. Case study
    A. ABCD3
    B. KMT2E
    3
    1
    2
    1
    2
    3
    C. ALKATI
    1
    2
    3
    4
    • Of the 249 novel exons, 236
    (94.8%) occurred in GTEx
    (one shown above)
    Wilks C, Gaddipati P, Nellore A, Langmead B. Snaptron: querying and visualizing splicing
    across tens of thousands of RNA-seq samples. Bioinformatics. 2017 Sep 1. btx547.










    0
    5000
    10000
    15000
    20000
    GTEx SRAv2
    Data compilation
    Shared sample count (SSC)
    Validation
    Failed
    Passed
    • Shared sample count predicts
    how likely novel exons were
    to validate (right)

    View full-size slide

  20. RNA-seq
    dataset
    Discovery
    Case study
    Validation
    Snaptron
    Independent
    dataset
    • Snaptron for validation:
    what discoveries have
    support in public data?
    Snaptron for hypothesis generation
    Snaptron
    Snaptron
    • Snaptron for discovery:
    what exists? what's
    prevalent? what's specific?
    • Snaptron for prioritization
    of potential discoveries:
    what discoveries are best
    supported?

    View full-size slide

  21. Rod photoreceptors
    Jonathan
    Ling
    Seth
    Blackshaw
    • Detect light & transduce signal
    to brain
    • Degeneration is main cause of
    hereditary blindness;
    treatments are few
    • Can we find rod-specific
    patterns and splicing
    factors, with the aim of
    creating a rod-like model
    from a human cell line?

    View full-size slide

  22. Rod photoreceptors
    Rods and retinal cells have
    characteristic exon-usage
    patterns
    1.
    Purified tissue
    (FACS/affinity)
    Certain exons are utilized
    only in rods
    2.
    Purified
    tissue
    Certain splicing factors
    work specifically in rods
    3.
    GTEx
    Purified tissue
    ENCODE
    Up-regulating those factors
    induces rod-like splicing in
    a human cell line
    4.
    New data,
    HepG2 cell line

    View full-size slide

  23. Ling JP, Wilks C, Charles R, Ghosh D, Jiang L, Santiago CP, Pang B, Venkataraman
    A, Clark BS, Nellore A, Langmead B, Blackshaw S. ASCOT identifies key regulators
    of photoreceptor-specific splicing. bioRxiv doi:10.1101/501882.
    Rods have characteristic
    patterns of exon usage
    Rod photoreceptors

    View full-size slide

  24. Rod photoreceptors
    Exon usage is a useful cell-type signature;
    sometimes invisible at gene level
    Ling JP, Wilks C, Charles R, Ghosh D, Jiang L, Santiago CP, Pang B, Venkataraman
    A, Clark BS, Nellore A, Langmead B, Blackshaw S. ASCOT identifies key regulators
    of photoreceptor-specific splicing. bioRxiv doi:10.1101/501882.
    Cochlear
    Hair Cells
    Pyramidal
    Neurons

    View full-size slide

  25. Certain exons are
    used only in rods
    Rod photoreceptors
    Ling JP, Wilks C, Charles R, Ghosh D, Jiang L, Santiago CP, Pang B, Venkataraman
    A, Clark BS, Nellore A, Langmead B, Blackshaw S. ASCOT identifies key regulators
    of photoreceptor-specific splicing. bioRxiv doi:10.1101/501882.

    View full-size slide

  26. Certain splicing factors are
    specific to rods -- could they
    drive rod-specific splicing?
    Rod photoreceptors
    Ling JP, Wilks C, Charles R, Ghosh D, Jiang L, Santiago CP, Pang B, Venkataraman
    A, Clark BS, Nellore A, Langmead B, Blackshaw S. ASCOT identifies key regulators
    of photoreceptor-specific splicing. bioRxiv doi:10.1101/501882.

    View full-size slide

  27. Rod photoreceptors
    Up-regulating those splicing factors yields rod-like splicing
    in HepG2 cells
    Unannotated
    Unannotated
    Unannotated
    Unannotated
    Unannotated
    Unannotated
    Ling JP, Wilks C, Charles R, Ghosh D, Jiang L, Santiago CP, Pang B, Venkataraman
    A, Clark BS, Nellore A, Langmead B, Blackshaw S. ASCOT identifies key regulators
    of photoreceptor-specific splicing. bioRxiv doi:10.1101/501882.

    View full-size slide

  28. ASCOT
    • Visually explore
    alternative
    splicing events
    in the same
    datasets we
    used
    http://ascot.cs.jhu.edu
    Ling JP, Wilks C, Charles R, Ghosh D, Jiang L, Santiago CP, Pang B, Venkataraman
    A, Clark BS, Nellore A, Langmead B, Blackshaw S. ASCOT identifies key regulators
    of photoreceptor-specific splicing. bioRxiv doi:10.1101/501882.

    View full-size slide

  29. Future: public data
    Rod photoreceptor study involved >90K
    public datasets
    Used public data only up to final HepG2
    experiment
    Desire: querying public data as an everyday
    activity in bio research
    • "Leveler" in a field of haves & have nots
    One of the best ways for a neuroscientist like me to keep up to
    date with what colleagues are working on is to attend confer-
    ences. But on recent trips I have noticed a problem. Too few
    researchers are consulting and using publicly available data — my own
    included. What is going on?
    Massive amounts of biological information are being accumu-
    discrepancy, and propose a biologically valid reason for it.
    Why are so many bench biologists overlooking this wealth of
    cell-type-specific expression data?
    My hunch is there are two reasons. First, researchers under estimate
    how many of these data have been published over the past few years
    because they are being generated across so many different fields.
    Don’t let useful data go
    to waste
    Researchers must seek out others’ deposited biological sequences in
    community databases, urges Franziska Denk.
    MEGHNA ABRAHAM
    WORLD VIEW
    A personal take on events

    View full-size slide

  30. Future: data science
    One
    dataset
    All of
    SRA
    Public data quickly confronts us with technical
    confounders & missing/incorrect metadata
    What questions can we answer robustly?
    At what points on the spectrum?
    Is metadata fixable?
    Ellis SE, Collado-Torres L, Jaffe A, Leek JT.
    Improving the value of public RNA-seq
    expression data by phenotype prediction.
    Nucleic Acids Res. 2018 May 18;46(9):e54.

    View full-size slide

  31. Jeff Leek
    Jacob
    Pritt
    Abhinav
    Nellore
    Kasper
    Hansen
    Leo Collado
    Torres
    Chris
    Wilks
    Andrew
    Jaffe
    José
    Alquicira-
    Hernández
    Jamie
    Morton
    Kai
    Kammers
    Shannon
    Ellis
    Margaret
    Taub
    • NIH R01GM118568 (Langmead)
    • NSF CAREER IIS-1349906 (Langmead)
    • NIH R01GM105705 (Leek)
    • NIH R01GM121459 (Hansen)
    • NIH Cloud Credits Model Pilot, CCREQ-2017-03-00086
    (Langmead)
    • NSF XSEDE projects (TG-CIE170020, TG-DEB180021)
    langmead-lab.org, @BenLangmead
    IDIES Seed funding
    SciServer
    SciServer Compute
    Jonathan
    Ling
    Seth
    Blackshaw
    Rone
    Charles

    View full-size slide