$30 off During Our Annual Pro Sale. View Details »

Using huge public sequencing datasets to answer scientific questions

Using huge public sequencing datasets to answer scientific questions

The Sequence Read Archive now contains over a million accessions, including over 200K RNA-seq runs for mouse and over 160K for human. Large-scale projects like GTEx, ICGC and TOPmed are major contributors and huge projects on the horizon, such as the All of Us and Million Veterans programs, will further accelerate this growth. These archives are potential gold mines for researchers but they are not organized for everyday use by scientists. The situation resembles the early days of the World Wide Web, before search engines made the web easy to use.

I will describe our progress toward the goal of making it easy for researchers to ask scientific questions about public RNA-seq datasets. I will highlight the Rail/recount2 system for ingesting and summarizing public data and the Snaptron service that makes it queryable. Finally, I will discuss scientific collaborations with neuroscientists and cancer researchers where we applied these tools to perform both targeted queries and large-scale screens. I will highlight ways in which we are learning to make our tools better suited to how scientists work. This is joint work with Abhinav Nellore, Chris Wilks, Jonathan Ling, Luigi Marchionni, Jeff Leek, Kasper Hansen, Andrew Jaffe and others.

Ben Langmead

May 25, 2018
Tweet

More Decks by Ben Langmead

Other Decks in Science

Transcript

  1. Ben Langmead
    Assistant Professor, JHU Computer Science
    [email protected], langmead-lab.org, @BenLangmead
    BME seminar, Oregon Health & Science University
    May 25, 2018
    Using huge public sequencing
    datasets to answer scientific questions

    View Slide

  2. View Slide

  3. View Slide

  4. Lab goals
    Efficient
    Scalable
    Interpretable
    Software:
    Topics:
    Bowtie 1&2, Arioc, Flash-dans
    applied algorithms, text indexing,
    sketching, thread scaling
    Myrna, Rail-RNA, recount2, Snaptron
    parallel and high-performance
    computing, cloud computing, indexing
    To make high-throughput life science data as usable as possible
    for scientific labs, especially small ones
    Qtip, FORGe
    modeling mapping quality, modeling
    graph-genome variants, addressing biases
    Software:
    Topics:
    Software:
    Topics:

    View Slide

  5. Terabases
    Open access
    Total
    1 Pbp
    8 -> 16 Pbp in
    ~18 months
    10 Pbp
    4 -> 8 Pbp in
    ~12 months
    Sequence Read Archive (SRA) growth

    View Slide

  6. Search engine for RNA-seq
    Snaptron Index & query engine w/ REST API
    snaptron.cs.jhu.edu
    doi:10.1093/bioinformatics/btx547
    Clean summaries of data, metadata,
    packaged as R objects
    jhubiostatistics.shinyapps.io/recount/
    doi:10.1038/nbt.3838
    Scalable, cloud-based spliced alignment
    of archived RNA-seq datasets
    rail.bio
    doi:10.1093/bioinformatics/btw575

    View Slide

  7. Themes
    • Cloud computing is a natural fit for public data
    • Scalable software benefits from big resources &
    many samples
    • Strategically ignoring gene annotations can yield
    clearer results
    • Queryability is in the eye of the beholder

    View Slide

  8. Search engine for RNA-seq
    Snaptron Index & query engine w/ REST API
    snaptron.cs.jhu.edu
    doi:10.1093/bioinformatics/btx547
    Clean summaries of data, metadata,
    packaged as R objects
    jhubiostatistics.shinyapps.io/recount/
    doi:10.1038/nbt.3838
    Scalable, cloud-based spliced alignment
    of archived RNA-seq datasets
    rail.bio
    doi:10.1093/bioinformatics/btw575

    View Slide

  9. Abhinav
    Nellore
    OHSU
    Jeff Leek, JHU
    http://rail.bio Nellore A, Collado-Torres L, Jaffe AE, Alquicira-Hernández J, Wilks C,
    Pritt J, Morton J, Leek JT, Langmead B. Rail-RNA: scalable analysis of
    RNA-seq splicing and coverage. Bioinformatics. 2016 Sep 4.
    Image by Rgocs

    View Slide

  10. Spliced RNA-seq aligner for analyzing many samples at once
    • Aggregate across samples to borrow strength and
    eliminate redundant alignment work
    http://rail.bio Nellore A, Collado-Torres L, Jaffe AE, Alquicira-Hernández J, Wilks C,
    Pritt J, Morton J, Leek JT, Langmead B. Rail-RNA: scalable analysis of
    RNA-seq splicing and coverage. Bioinformatics. 2016 Sep 4.

    View Slide

  11. Spliced RNA-seq aligner for analyzing many samples at once
    • Aggregate across samples to borrow strength and
    eliminate redundant alignment work
    • Let data prune false junction calls, not annotation
    http://rail.bio Nellore A, Collado-Torres L, Jaffe AE, Alquicira-Hernández J, Wilks C,
    Pritt J, Morton J, Leek JT, Langmead B. Rail-RNA: scalable analysis of
    RNA-seq splicing and coverage. Bioinformatics. 2016 Sep 4.

    View Slide

  12. Spliced RNA-seq aligner for analyzing many samples at once
    • Aggregate across samples to borrow strength and
    eliminate redundant alignment work
    • Let data prune false junction calls, not annotation
    • Concise outputs: junctions, junction evidence,
    coverage vectors; no alignments, unless asked for
    http://rail.bio Nellore A, Collado-Torres L, Jaffe AE, Alquicira-Hernández J, Wilks C,
    Pritt J, Morton J, Leek JT, Langmead B. Rail-RNA: scalable analysis of
    RNA-seq splicing and coverage. Bioinformatics. 2016 Sep 4.

    View Slide

  13. Spliced RNA-seq aligner for analyzing many samples at once
    • Aggregate across samples to borrow strength and
    eliminate redundant alignment work
    • Let data prune false junction calls, not annotation
    • Concise outputs: junctions, junction evidence,
    coverage vectors; no alignments, unless asked for
    • Runs easily on commercial AWS cloud, other clusters
    http://rail.bio Nellore A, Collado-Torres L, Jaffe AE, Alquicira-Hernández J, Wilks C,
    Pritt J, Morton J, Leek JT, Langmead B. Rail-RNA: scalable analysis of
    RNA-seq splicing and coverage. Bioinformatics. 2016 Sep 4.

    View Slide

  14. dbGaP
    http://docs.rail.bio/dbgap/
    Nellore A, Wilks C, Hansen KD, Leek JT, Langmead B. Rail-dbGaP:
    analyzing dbGaP-protected data in the cloud with Amazon Elastic
    MapReduce. Bioinformatics. 2016 Aug 15;32(16):2551-3.

    View Slide

  15. Working toward recount2
    • Analyzed ~21,500 human RNA-seq samples
    with Rail-RNA; about 62 Tbp
    • Repeatable: http://github.com/nellore/runs
    • ~ $0.72 / sample
    (Compare to sequencing costs)
    (Exact commands we used to run on AWS)
    jxs
    samples
    http://intropolis.rail.bio
    Nellore A, et al. Human splicing diversity and the extent of
    unannotated splice junctions across human RNA-seq samples on
    the Sequence Read Archive. Genome Biol. 2016 Dec 30;17(1):266.

    View Slide

  16. a
    0 2000 4000 6000 8000 10000 12000 14000
    0
    100000
    200000
    300000
    400000
    500000
    600000
    700000
    Minimum number S of samples in which jx is called
    Junction (jx) count J
    18.6%
    56,861 jx
    100%
    96.5%
    81.4%
    85.8%
    Novel
    Alternative donor/acceptor
    Exon skip
    Fully annotated
    800 900 1000 1100 1200
    240000
    260000
    280000
    300000
    320000
    b
    8000
    10000
    samples
    c
    2500
    3000
    Annotation includes: UCSC, GENCODE v19 & v24,
    RefSeq, CCDS, MGC, lincRNAs, SIB genes, AceView, Vega
    http://intropolis.rail.bio
    Nellore A, et al. Human splicing diversity and the extent of
    unannotated splice junctions across human RNA-seq samples on
    the Sequence Read Archive. Genome Biol. 2016 Dec 30;17(1):266.

    View Slide

  17. recount2
    • >50K human RNA-seq samples from SRA (open)
    • >10K human RNA-seq samples spanning cancer
    types in The Cancer Genome Atlas (dbGaP)
    Image: https://www.sevenbridges.com/welcome-to-the-cancer-genomics-cloud-2/
    • >10K human RNA-seq samples from
    the Genotype-Tissue Expression (GTEx)
    project (dbGaP)
    • In total, ~4.4 trillion reads, 100s of terabases
    Image: doi:10.1038/ng.2653
    Collado-Torres L, Nellore A, Kammers K, Ellis SE, Taub MA, Hansen KD, Jaffe AE, Langmead B, Leek JT.
    Reproducible RNA-seq analysis using recount2. Nature Biotechnology. 2017 Apr 11;35(4):319-321.

    View Slide

  18. recount2
    Junctions
    Genes
    Coverage
    Exons
    Summarized at levels of genes, exons, junctions,
    and coverage vectors
    Collado-Torres L, Nellore A, Kammers K, Ellis SE, Taub MA, Hansen KD, Jaffe AE, Langmead B, Leek JT.
    Reproducible RNA-seq analysis using recount2. Nature Biotechnology. 2017 Apr 11;35(4):319-321.

    View Slide

  19. Collado-Torres L, Nellore A, Kammers K, Ellis SE, Taub MA, Hansen KD, Jaffe AE, Langmead B, Leek JT.
    Reproducible RNA-seq analysis using recount2. Nature Biotechnology. 2017 Apr 11;35(4):319-321.
    https://jhubiostatistics.shinyapps.io/recount/
    recount2

    View Slide

  20. recount2
    Collado-Torres L, Nellore A, Kammers K, Ellis SE, Taub MA, Hansen KD, Jaffe AE, Langmead B, Leek JT.
    Reproducible RNA-seq analysis using recount2. Nature Biotechnology. 2017 Apr 11;35(4):319-321.
    http://bit.ly/recount_sciserver

    View Slide

  21. Search engine for RNA-seq
    Snaptron

    View Slide

  22. Snaptron
    Query planner delegates query components to
    appropriate systems (sqlite, tabix, lucene) and
    indexes (R-tree, B-tree, Lucene inverted text index)
    Chris Wilks
    Sample
    Filter
    8
    Region
    Limited
    Region
    Limited &
    Filtered
    Region
    Junction
    Records
    Sample
    Metadata
    Records
    Junction
    Records
    Filtered
    Region
    Filtered
    Samples
    Snaptron
    Query
    Planner
    Query Data Store/Index Output
    1
    2
    6 7
    3
    9
    4 5
    10 11 12 13
    4 7
    3
    1 2 8
    5 6
    Sample
    Metadata
    Terms Samples
    "Brain" 1,2,3,6
    "Liver" 4,6,9,11
    Sample
    Filter
    Tabix/R-tree
    Index
    Lucene/Inverted
    Document
    Index
    SQLite/B-tree
    Index
    Wilks C, Gaddipati P, Nellore A, Langmead B. Snaptron: querying and visualizing splicing
    across tens of thousands of RNA-seq samples. Bioinformatics. 2017 Sep 1. btx547.

    View Slide

  23. Snaptron
    • For each junction in gene ABCD3, how many reads
    supported it in each of the 50K SRA samples?
    • What is a particular junction's tissue specificity in
    the GTEx dataset?
    • In which samples is splicing pattern A
    overrepresented relative to splicing pattern B?
    • (A/B might relate to alt splicing, fusions, etc)
    Examples:
    http://snaptron.cs.jhu.edu
    Wilks C, Gaddipati P, Nellore A, Langmead B. Snaptron: querying and visualizing splicing
    across tens of thousands of RNA-seq samples. Bioinformatics. 2017 Sep 1. btx547.

    View Slide

  24. Snaptron vignette 1
    • Goldstein et al searched for novel cassette exons in
    Illumina BodyMap 2.0
    • Identified 249 cassette exons within known genes
    but not overlapping any annotated exon
    • Validated 216 out of 249 in independent sample via
    paired-end RNA-seq (2 x 250 bp)
    Goldstein LD, Cao Y, Pau G, Lawrence M, Wu TD, Seshagiri S, Gentleman R. Prediction and
    Quantification of Splice Events from RNA-Seq Data. PLoS One. 2016 May 24;11(5):e0156132.

    View Slide

  25. Snaptron vignette 1
    Wilks C, Gaddipati P, Nellore A, Langmead B. Snaptron: querying and visualizing splicing
    across tens of thousands of RNA-seq samples. Bioinformatics. 2017 Sep 1. btx547.
    A. ABCD3
    B. KMT2E
    3
    1
    2
    1
    2
    3
    C. ALKATI
    1
    2
    3
    4
    • Snaptron immediately recapitulates ABCD3 exon (above)
    • Of the 249 novel exons, 236 (94.8%) occurred in GTEx
    • Used shared sample count (SSC) query to measure #
    samples the novel exons occurred in...

    View Slide

  26. Snaptron vignette 1










    0
    5000
    10000
    15000
    20000
    GTEx SRAv2
    Data compilation
    Shared sample count (SSC)
    Validation
    Failed
    Passed
    • Exons validated by
    Goldstein et al had
    higher SSC versus
    exons failing
    validation
    • SSC (prevalence) is
    related to how "real"
    they are
    Wilks C, Gaddipati P, Nellore A, Langmead B. Snaptron: querying and visualizing splicing
    across tens of thousands of RNA-seq samples. Bioinformatics. 2017 Sep 1. btx547.

    View Slide

  27. Snaptron vignette 2
    Darby MM, Leek JT, Langmead B, Yolken RH, Sabunciyan S. Widespread splicing of repetitive element
    loci into coding regions of gene transcripts. Hum Mol Genet. 2016 Nov 15;25(22):4962-4982.
    • Darby et al studied prevalence of repeat element (RE)
    expression in the human orbitofrontal cortex
    • Used RNA-seq to find junctions linking annotated exons
    to REs in annotated introns, indicating exonization
    • They supplied us 5 events where RE exon was
    unannotated; Snaptron SSC query confirmed all 5
    occurred at least 35 times in SRAv2 & GTEx
    Wilks C, Gaddipati P, Nellore A, Langmead B. Snaptron: querying and visualizing splicing across tens
    of thousands of RNA-seq samples. Bioinformatics. 2017 Sep 1. btx547.

    View Slide

  28. Snaptron vignette 2
    Wilks C, Gaddipati P, Nellore A, Langmead B. Snaptron: querying and visualizing splicing
    across tens of thousands of RNA-seq samples. Bioinformatics. 2017 Sep 1. btx547.
    • Tissue specificity query showed all 5 events were expressed
    in a tissue-specific pattern in GTEx (Kruskal-Wallis P < 0.01)
    A. ABCD3
    B. KMT2E
    3
    1
    2
    1
    2
    3
    C. ALKATI
    1
    2
    3
    4
    • One of the 5 shown here (arrow 2)

    View Slide

  29. Snaptron vignette 3
    Collaborator Jonathan Ling studies how splicing factors affect
    splicing of certain cryptic cassette exons
    • cryptic: usually unannotated, usually unconserved
    Past work of Jonathan's showed that splicing factor protein
    TDP-43 suppresses splicing of non-conserved cryptic exons
    Implicated in ALS, frontotemporal dementia (FTD), Alzheimer’s
    Jonathan Ling
    Can we rapidly screen for regulatory
    relationships like those between TDP-43
    and its cryptic-exon targets?

    View Slide

  30. Snaptron vignette 3
    Jonathan Ling
    Let's look at mouse datasets because
    that's where you can get the really nice
    purified tissues
    Let's look at cassette-exon percent-
    spliced-in (PSI) as a summary measure
    Let's look at what patterns seem to
    define rod photoreceptors as a cell type

    View Slide

  31. Snaptron vignette 3
    Ling J, Wilks C, Charles R, Blackshaw S, & Langmead, B. "Exploratory analysis of alternative
    splicing in tens of thousands of bulk and single-cell samples" in preparation
    "Supermouse"
    Rods have characteristic
    pattern of PSI levels

    View Slide

  32. Snaptron vignette 3
    Ling J, Wilks C, Charles R, Blackshaw S, & Langmead, B. "Exploratory analysis of alternative
    splicing in tens of thousands of bulk and single-cell samples" in preparation
    PSIs can reveal specific signatures for cell types that are are not
    visible at the gene level

    View Slide

  33. Snaptron vignette 3
    Ling J, Wilks C, Charles R, Blackshaw S, & Langmead, B. "Exploratory analysis of alternative
    splicing in tens of thousands of bulk and single-cell samples" in preparation
    Certain alternative
    exons seem to have
    high PSI only in rods

    View Slide

  34. Snaptron vignette 3
    Ling J, Wilks C, Charles R, Blackshaw S, & Langmead, B. "Exploratory analysis of alternative
    splicing in tens of thousands of bulk and single-cell samples" in preparation
    Certain splicing factors are expressed
    specifically in rods -- could they drive
    rod-specific exon splicing?

    View Slide

  35. Snaptron vignette 3
    Ling J, Wilks C, Charles R, Blackshaw S, & Langmead, B. "Exploratory analysis of alternative
    splicing in tens of thousands of bulk and single-cell samples" in preparation
    Several of these exons involve at least one unannotated junction!

    View Slide

  36. Themes (redux)
    Cloud computing is a natural fit for public data
    Next-generation sequencing (NGS) technologies have
    been improving rapidly and have become the work-
    horse technology for studying nucleic acids. NGS plat-
    forms work by collecting information on a large array
    of poly merase reactions working in parallel, up to bil-
    lions at a time inside a single sequencer1. The speed
    and decreasing cost of NGS have led to the rapid accu-
    mulation of raw sequencing data (sequencing reads),
    used in published studies, in public archives2 such as
    the Sequence Read Archive (SRA)3,4, which is hosted by
    the US National Center for Biotechnology Information
    (NCBI), and the European Nucleotide Archive (ENA)5,
    which is hosted by the European Molecular Biology
    Laboratory at the European Bioinformatics Institute
    (EMBL–EBI). The SRA now holds about 14 petabases
    (millions of billions of bases) and has been doubling in
    size every 10–20 months (FIG. 1). Genomics researchers
    programme17, among others (TABLE 1). gnomAD now
    spans over 120,000 exomes and over 15,000 whole
    genomes. ICGC encompasses over 70 subprojects target-
    ing distinct cancer types, which are conducted in more
    than a dozen countries and have already collected sam-
    ples from more than 20,000 donors. Aligned sequenc-
    ing reads for ICGC require over 1 petabyte (PB; that
    is, a million GB) of storage. The TOPMed programme,
    which plans to sequence more than 120,000 genomes17,
    has already deposited more than 18,000 human whole-
    genome sequencing data sets in the SRA, comprising
    approximately 2.3 petabases or about 16.5% of the
    entire archive. Large observational studies currently in
    progress, such as the Precision Medicine Initiative18 and
    Million Veterans Project19, will drive up the totals yet
    more rapidly.
    While advances in NGS have increased opportunities
    eads
    A sequence as
    DNA sequencer.
    of a computer
    a.
    onent of a
    hich the
    akes place.
    ster
    connected
    t are able to
    dinated fashion
    a.
    Cloud computing for genomic data
    analysis and collaboration
    Ben Langmead1 and Abhinav Nellore2
    Abstract | Next-generation sequencing has made major strides in the past decade. Studies based
    on large sequencing data sets are growing in number, and public archives for raw sequencing
    data have been doubling in size every 18 months. Leveraging these data requires researchers to
    use large-scale computational resources. Cloud computing, a model whereby users rent
    computers and storage from large data centres, is a solution that is gaining traction in genomics
    research. Here, we describe how cloud computing is used in genomics for research and
    large-scale collaborations, and argue that its elasticity, reproducibility and privacy features make
    it ideally suited for the large-scale reanalysis of publicly available archived data, including
    privacy-protected data.
    COMPUTATIONAL TOOLS
    REVIEWS
    Langmead B, Nellore A. Cloud computing for
    genomic data analysis and collaboration. Nature
    Reviews Genetics. 2018 Apr;19(4):208-219.

    View Slide

  37. Themes (redux)
    Scalable software benefits from big resources &
    many samples
    Strategically ignoring gene annotations can yield
    clearer results
    • Many of the splicing patterns described in the
    vignettes were unannotated. Crucial not to be
    biased against these during analysis.

    View Slide

  38. Themes (redux)
    Queryability is in the eye of the beholder
    • Beyond targeted queries, users want
    bulk screens
    • Beyond the boiling cauldron of
    10,000s samples, users want specific
    subsets with key properties
    • Knocked-down splicing factor
    • Carefully purified tissue
    • Disease X

    View Slide

  39. Jeff Leek
    Jacob Pritt
    Abhinav
    Nellore
    Kasper
    Hansen
    Leo Collado
    Torres
    Chris Wilks
    Andrew Jaffe
    José Alquicira-
    Hernández
    Jamie
    Morton
    Kai
    Kammers
    Shannon
    Ellis
    Margaret
    Taub
    • NIH R01GM118568
    • NSF CAREER IIS-1349906
    • Sloan Research Fellowship
    • IDIES Seed Funding program
    • Amazon Web Services
    • NIH R01GM105705 (Leek)
    langmead-lab.org, @BenLangmead
    Thank you:
    IDIES Seed funding
    SciServer
    SciServer Compute
    Jonathan
    Ling
    Seth
    Blackshaw

    View Slide