Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Tales of scale

Tales of scale

The Sequence Read Archive now contains over a million accessions, including over 200K RNA-seq runs for mouse and over 160K for human. Large-scale projects like GTEx, ICGC and TOPmed are major contributors and huge projects on the horizon, such as the All of Us and Million Veterans programs, will further accelerate this growth. These archives are potential gold mines for researchers but they are not organized for everyday use by scientists. The situation resembles the early days of the World Wide Web, before search engines made the web easy to use.

Using the archive as a motivation, I will convey some insights -- gleaned from both successes and failures -- about how we as computational researchers can work toward the goal of making large public datasets easy to use. I will discuss some challenges that come with (a) working at scale, (b) using commercial (and non-commercial) cloud computing as a platform for this work, (c) pooling and borrowing strength across datasets, and (d) making public data available for use by everyday researchers. I will highlight ways in which we have learned how to make our tools better suited to how scientists work. This is joint work with Abhinav Nellore, Chris Wilks, Jonathan Ling, Luigi Marchionni, Jeff Leek, Kasper Hansen, Andrew Jaffe and others.

Ben Langmead

June 20, 2018
Tweet

More Decks by Ben Langmead

Other Decks in Science

Transcript

  1. Ben Langmead
    Assistant Professor, JHU Computer Science
    [email protected], langmead-lab.org, @BenLangmead
    Rocky Mountain Hackcon Keynote
    June 20, 2018
    Tales of scale

    View Slide

  2. View Slide

  3. View Slide

  4. Outline
    • Trends in biotech vs. trends in computing
    • Public data
    • Summarizing, indexing, searching
    • Case study: rod photoreceptors
    • Wild speculation, pompous pontification,
    humiliating mea culpas

    View Slide

  5. "Between 2008 and 2013, the performance
    of a single DNA sequencer increased about
    three-to fivefold per year. Using Moore’s Law
    as a benchmark...Sequencers are improving
    at a faster rate than computers. Something
    must be done now, or else we’ll need to
    put vital research on hold while the
    necessary computational techniques
    catch up—or are invented."

    View Slide

  6. Moore's law & sequencing cost
    ½ every 24 months
    ½ every 18 months
    Source: https://www.genome.gov/27541954/dna-sequencing-costs-data/

    View Slide

  7. Who said that?
    "Between 2008 and 2013, the performance of a single DNA
    sequencer increased about three-to fivefold per year. Using
    Moore’s Law as a benchmark...Sequencers are improving at
    a faster rate than computers. Something must be done
    now, or else we’ll need to put vital research on hold
    while the necessary computational techniques catch up
    —or are invented."
    Illustration: Carl DeTorres

    View Slide

  8. Pontification
    • Within 2nd gen era, there's no great disparity
    between sequencing tech & Moore's law
    • Computing trends just as worthy of attention
    and study as sequencing trends
    • Must import more computational expertise, e.g.
    in HPC and CPU architecture, into genomics
    • 2nd gen era has proceeded largely without cloud
    computing "in the loop," but clouds fill other
    roles nicely (more later)

    View Slide

  9. Terabases
    Open access
    Total
    1 Pbp
    8 -> 16 Pbp in
    ~18 months
    10 Pbp
    4 -> 8 Pbp in
    ~12 months
    Sequence Read Archive (SRA) growth

    View Slide

  10. An index is a great leveler
    GB Shaw
    Even a summary would be
    an improvement
    Not GB Shaw

    View Slide

  11. Public summaries of sequencing data
    Langmead B, Nellore A. Cloud computing for genomic data analysis and collaboration.
    Nat Rev Genet. 2018 Apr;19(4):208-219. doi: 10.1038/nrg.2017.113.

    View Slide

  12. Indexing raw sequencing data
    Mantis. Ferdman, M., Johnson, R., & Patro, R. Mantis: A Fast,
    Small, and Exact Large-Scale Sequence-Search Index. In
    Research in Computational Molecular Biology (p. 271). Springer.
    BIGSI: Bradley, P., den Bakker, H., Rocha, E., McVean, G., &
    Iqbal, Z. (2017). Real-time search of all bacterial and viral
    genomic data. bioRxiv, 234955.
    Image from Mantis paper
    Image from Split SBT paper
    Sequence Bloom Trees. Solomon B, Kingsford C. Fast
    search of thousands of short-read sequencing
    experiments. Nat Biotechnol. 2016 Mar;34(3):300-2.
    Solomon B, Kingsford C. Improved Search of Large
    Transcriptomic Sequencing Databases Using Split
    Sequence Bloom Trees. J Comput Biol. 2018 Mar 12.
    Sun C, Harris RS, Chikhi R, Medvedev P. AllSome
    Sequence Bloom Trees. J Comput Biol. 2018 May;25(5):
    467-479.
    1000 Genomes FM Index: Dolle DD, Liu Z, Cotten M,
    Simpson JT, Iqbal Z, Durbin R, McCarthy SA, Keane TM.
    Using reference-free compressed data structures to
    analyze sequencing reads from thousands of human
    genomes. Genome Res. 2017 Feb;27(2):300-309.

    View Slide

  13. Past work
    Langmead B, Schatz MC, Lin J, Pop M, Salzberg SL:
    Searching for SNPs with cloud computing. Genome Biol
    2009, 10(11):R134.
    Crossbow
    Langmead B, Hansen KD, Leek JT. Cloud-scale RNA-
    sequencing differential expression analysis with Myrna.
    Genome Biol. 2010;11(8):R83.
    Myrna
    Frazee AC, Langmead B, Leek JT. ReCount: a multi-
    experiment resource of analysis-ready RNA-seq gene
    count datasets. BMC Bioinformatics. 2011 Nov 16;12:449.
    ReCount
    http://j.mp/crossbow_proj, http://j.mp/crossbow_repo
    http://j.mp/myrna_proj, http://j.mp/myrna_repo
    http://j.mp/recount_proj

    View Slide

  14. Today: a search engine for RNA-seq
    Snaptron Index & query engine w/ REST API
    snaptron.cs.jhu.edu
    doi:10.1093/bioinformatics/btx547
    Clean summaries of data, metadata,
    packaged as R objects
    jhubiostatistics.shinyapps.io/recount/
    doi:10.1038/nbt.3838
    Scalable, cloud-based spliced alignment
    of archived RNA-seq datasets
    rail.bio
    doi:10.1093/bioinformatics/btw575

    View Slide

  15. Abhinav
    Nellore
    OHSU
    Jeff Leek, JHU
    http://rail.bio Nellore A, Collado-Torres L, Jaffe AE, Alquicira-Hernández J, Wilks C,
    Pritt J, Morton J, Leek JT, Langmead B. Rail-RNA: scalable analysis of
    RNA-seq splicing and coverage. Bioinformatics. 2016 Sep 4.
    Image by Rgocs

    View Slide

  16. Spliced RNA-seq aligner for analyzing many samples at once
    • Aggregate across samples to borrow strength and
    eliminate redundant alignment work
    http://rail.bio Nellore A, Collado-Torres L, Jaffe AE, Alquicira-Hernández J, Wilks C,
    Pritt J, Morton J, Leek JT, Langmead B. Rail-RNA: scalable analysis of
    RNA-seq splicing and coverage. Bioinformatics. 2016 Sep 4.

    View Slide

  17. Spliced RNA-seq aligner for analyzing many samples at once
    • Aggregate across samples to borrow strength and
    eliminate redundant alignment work
    • Let data prune false junction calls, not annotation
    http://rail.bio Nellore A, Collado-Torres L, Jaffe AE, Alquicira-Hernández J, Wilks C,
    Pritt J, Morton J, Leek JT, Langmead B. Rail-RNA: scalable analysis of
    RNA-seq splicing and coverage. Bioinformatics. 2016 Sep 4.

    View Slide

  18. Spliced RNA-seq aligner for analyzing many samples at once
    • Aggregate across samples to borrow strength and
    eliminate redundant alignment work
    • Let data prune false junction calls, not annotation
    • Concise outputs: junctions, junction evidence,
    coverage vectors; no alignments, unless asked for
    http://rail.bio Nellore A, Collado-Torres L, Jaffe AE, Alquicira-Hernández J, Wilks C,
    Pritt J, Morton J, Leek JT, Langmead B. Rail-RNA: scalable analysis of
    RNA-seq splicing and coverage. Bioinformatics. 2016 Sep 4.

    View Slide

  19. Spliced RNA-seq aligner for analyzing many samples at once
    • Aggregate across samples to borrow strength and
    eliminate redundant alignment work
    • Let data prune false junction calls, not annotation
    • Concise outputs: junctions, junction evidence,
    coverage vectors; no alignments, unless asked for
    • Runs easily on commercial AWS cloud, other clusters
    http://rail.bio Nellore A, Collado-Torres L, Jaffe AE, Alquicira-Hernández J, Wilks C,
    Pritt J, Morton J, Leek JT, Langmead B. Rail-RNA: scalable analysis of
    RNA-seq splicing and coverage. Bioinformatics. 2016 Sep 4.

    View Slide

  20. dbGaP
    http://docs.rail.bio/dbgap/
    Nellore A, Wilks C, Hansen KD, Leek JT, Langmead B. Rail-dbGaP:
    analyzing dbGaP-protected data in the cloud with Amazon Elastic
    MapReduce. Bioinformatics. 2016 Aug 15;32(16):2551-3.

    View Slide

  21. Working toward recount2
    • Analyzed ~21,500 human RNA-seq samples
    with Rail-RNA; about 62 Tbp
    • http://github.com/nellore/runs
    • ~ $0.72 / sample
    (Compare to sequencing costs)
    (Commands we used to run on AWS)
    jxs
    samples
    http://intropolis.rail.bio
    Nellore A, et al. Human splicing diversity and the extent of
    unannotated splice junctions across human RNA-seq samples on
    the Sequence Read Archive. Genome Biol. 2016 Dec 30;17(1):266.

    View Slide

  22. a
    0 2000 4000 6000 8000 10000 12000 14000
    0
    100000
    200000
    300000
    400000
    500000
    600000
    700000
    Minimum number S of samples in which jx is called
    Junction (jx) count J
    18.6%
    56,861 jx
    100%
    96.5%
    81.4%
    85.8%
    Novel
    Alternative donor/acceptor
    Exon skip
    Fully annotated
    800 900 1000 1100 1200
    240000
    260000
    280000
    300000
    320000
    b
    8000
    10000
    samples
    c
    2500
    3000
    Annotation includes: UCSC, GENCODE v19 & v24,
    RefSeq, CCDS, MGC, lincRNAs, SIB genes, AceView, Vega
    http://intropolis.rail.bio
    Nellore A, et al. Human splicing diversity and the extent of
    unannotated splice junctions across human RNA-seq samples on
    the Sequence Read Archive. Genome Biol. 2016 Dec 30;17(1):266.

    View Slide

  23. recount2
    • >50K human RNA-seq samples from SRA (open)
    • >10K human RNA-seq samples spanning cancer
    types in The Cancer Genome Atlas (dbGaP)
    Image: https://www.sevenbridges.com/welcome-to-the-cancer-genomics-cloud-2/
    • >10K human RNA-seq samples from
    the Genotype-Tissue Expression (GTEx)
    project (dbGaP)
    • In total, ~4.4 trillion reads, 100s of terabases
    Image: doi:10.1038/ng.2653
    Collado-Torres L, Nellore A, Kammers K, Ellis SE, Taub MA, Hansen KD, Jaffe AE, Langmead B, Leek JT.
    Reproducible RNA-seq analysis using recount2. Nature Biotechnology. 2017 Apr 11;35(4):319-321.

    View Slide

  24. recount2
    Junctions
    Genes
    Coverage
    Exons
    Summarized at levels of genes, exons, junctions,
    and coverage vectors
    Collado-Torres L, Nellore A, Kammers K, Ellis SE, Taub MA, Hansen KD, Jaffe AE, Langmead B, Leek JT.
    Reproducible RNA-seq analysis using recount2. Nature Biotechnology. 2017 Apr 11;35(4):319-321.

    View Slide

  25. Collado-Torres L, Nellore A, Kammers K, Ellis SE, Taub MA, Hansen KD, Jaffe AE, Langmead B, Leek JT.
    Reproducible RNA-seq analysis using recount2. Nature Biotechnology. 2017 Apr 11;35(4):319-321.
    https://jhubiostatistics.shinyapps.io/recount/
    recount2

    View Slide

  26. Search engine for RNA-seq
    Snaptron

    View Slide

  27. Snaptron
    Query planner delegates query components to
    appropriate systems (sqlite, tabix, lucene) and
    indexes (R-tree, B-tree, Lucene inverted text index)
    Chris Wilks
    Sample
    Filter
    8
    Region
    Limited
    Region
    Limited &
    Filtered
    Region
    Junction
    Records
    Sample
    Metadata
    Records
    Junction
    Records
    Filtered
    Region
    Filtered
    Samples
    Snaptron
    Query
    Planner
    Query Data Store/Index Output
    1
    2
    6 7
    3
    9
    4 5
    10 11 12 13
    4 7
    3
    1 2 8
    5 6
    Sample
    Metadata
    Terms Samples
    "Brain" 1,2,3,6
    "Liver" 4,6,9,11
    Sample
    Filter
    Tabix/R-tree
    Index
    Lucene/Inverted
    Document
    Index
    SQLite/B-tree
    Index
    Wilks C, Gaddipati P, Nellore A, Langmead B. Snaptron: querying and visualizing splicing
    across tens of thousands of RNA-seq samples. Bioinformatics. 2017 Sep 1. btx547.

    View Slide

  28. Snaptron
    Provides command-line tool and REST API for
    querying junctions (& more summaries coming soon)
    Wilks C, Gaddipati P, Nellore A, Langmead B. Snaptron: querying and visualizing splicing
    across tens of thousands of RNA-seq samples. Bioinformatics. 2017 Sep 1. btx547.

    View Slide

  29. Snaptron
    • For each junction in gene ABCD3, how many reads
    supported it in each of the 50K SRA samples?
    • What is a particular junction's tissue specificity in
    the GTEx dataset?
    • In which samples is splicing pattern A
    overrepresented relative to splicing pattern B?
    • (A/B might relate to alt splicing, fusions, etc)
    Examples:
    http://snaptron.cs.jhu.edu
    Wilks C, Gaddipati P, Nellore A, Langmead B. Snaptron: querying and visualizing splicing
    across tens of thousands of RNA-seq samples. Bioinformatics. 2017 Sep 1. btx547.

    View Slide

  30. Mini Snaptron case study
    • Goldstein et al searched for novel cassette exons in
    Illumina BodyMap 2.0
    • Identified 249 cassette exons within known genes
    but not overlapping any annotated exon
    • Validated 216 out of 249 in independent sample via
    paired-end RNA-seq (2 x 250 bp)
    Goldstein LD, Cao Y, Pau G, Lawrence M, Wu TD, Seshagiri S, Gentleman R. Prediction and
    Quantification of Splice Events from RNA-Seq Data. PLoS One. 2016 May 24;11(5):e0156132.

    View Slide

  31. Mini Snaptron case study
    Wilks C, Gaddipati P, Nellore A, Langmead B. Snaptron: querying and visualizing splicing
    across tens of thousands of RNA-seq samples. Bioinformatics. 2017 Sep 1. btx547.
    A. ABCD3
    B. KMT2E
    3
    1
    2
    1
    2
    3
    C. ALKATI
    1
    2
    3
    4
    • Snaptron immediately recapitulates ABCD3 exon (above)
    • Of the 249 novel exons, 236 (94.8%) occurred in GTEx
    • Used shared sample count (SSC) query to measure #
    samples the novel exons occurred in...

    View Slide

  32. Mini Snaptron case study










    0
    5000
    10000
    15000
    20000
    GTEx SRAv2
    Data compilation
    Shared sample count (SSC)
    Validation
    Failed
    Passed
    • Exons validated by
    Goldstein et al had
    higher SSC versus
    exons failing
    validation
    • SSC (prevalence) is
    related to how "real"
    they are
    Wilks C, Gaddipati P, Nellore A, Langmead B. Snaptron: querying and visualizing splicing
    across tens of thousands of RNA-seq samples. Bioinformatics. 2017 Sep 1. btx547.

    View Slide

  33. Snaptron case study: rod photoreceptors
    Collaborator Jonathan Ling studies how splicing factors affect
    splicing of certain cryptic cassette exons
    • cryptic: usually unannotated, usually unconserved
    Past work of Jonathan's showed that splicing factor protein
    TDP-43 suppresses splicing of non-conserved cryptic exons
    Implicated in ALS, frontotemporal dementia (FTD), Alzheimer’s
    Jonathan Ling
    Can we rapidly screen for regulatory
    relationships like those between TDP-43
    and its cryptic-exon targets?

    View Slide

  34. Rod photoreceptors
    Ling J, Wilks C, Charles R, Blackshaw S, & Langmead, B. "Exploratory analysis of alternative
    splicing in tens of thousands of bulk and single-cell samples" in preparation
    "Supermouse"
    Rods have characteristic
    pattern of PSI levels

    View Slide

  35. Rod photoreceptors
    Ling J, Wilks C, Charles R, Blackshaw S, & Langmead, B. "Exploratory analysis of alternative
    splicing in tens of thousands of bulk and single-cell samples" in preparation
    PSIs can reveal specific signatures for cell types that are are not
    visible at the gene level

    View Slide

  36. Rod photoreceptors
    Ling J, Wilks C, Charles R, Blackshaw S, & Langmead, B. "Exploratory analysis of alternative
    splicing in tens of thousands of bulk and single-cell samples" in preparation
    Certain cassettes have
    high PSI only in rods

    View Slide

  37. Rod photoreceptors
    Ling J, Wilks C, Charles R, Blackshaw S, & Langmead, B. "Exploratory analysis of alternative
    splicing in tens of thousands of bulk and single-cell samples" in preparation
    Certain splicing factors are expressed
    specifically in rods -- could they drive
    rod-specific exon splicing?

    View Slide

  38. Rod photoreceptors
    Ling J, Wilks C, Charles R, Blackshaw S, & Langmead, B. "Exploratory analysis of alternative
    splicing in tens of thousands of bulk and single-cell samples" in preparation
    Most of these are unannotated!

    View Slide

  39. Future: cloud computing
    Cloud computing may not be "in the loop" for most
    data-generating labs, but it's a natural fit for
    reanalyzing public data and for far-flung collaborations
    Next-generation sequencing (NGS) technologies have
    been improving rapidly and have become the work-
    horse technology for studying nucleic acids. NGS plat-
    forms work by collecting information on a large array
    of poly merase reactions working in parallel, up to bil-
    lions at a time inside a single sequencer1. The speed
    and decreasing cost of NGS have led to the rapid accu-
    mulation of raw sequencing data (sequencing reads),
    used in published studies, in public archives2 such as
    3,4
    programme17, among others (TABLE 1). gnomAD now
    spans over 120,000 exomes and over 15,000 whole
    genomes. ICGC encompasses over 70 subprojects target-
    ing distinct cancer types, which are conducted in more
    than a dozen countries and have already collected sam-
    ples from more than 20,000 donors. Aligned sequenc-
    ing reads for ICGC require over 1 petabyte (PB; that
    is, a million GB) of storage. The TOPMed programme,
    which plans to sequence more than 120,000 genomes17,
    ads
    A sequence as
    NA sequencer.
    f a computer
    .
    onent of a
    ich the
    Cloud computing for genomic data
    analysis and collaboration
    Ben Langmead1 and Abhinav Nellore2
    Abstract | Next-generation sequencing has made major strides in the past decade. Studies based
    on large sequencing data sets are growing in number, and public archives for raw sequencing
    data have been doubling in size every 18 months. Leveraging these data requires researchers to
    use large-scale computational resources. Cloud computing, a model whereby users rent
    computers and storage from large data centres, is a solution that is gaining traction in genomics
    research. Here, we describe how cloud computing is used in genomics for research and
    large-scale collaborations, and argue that its elasticity, reproducibility and privacy features make
    it ideally suited for the large-scale reanalysis of publicly available archived data, including
    privacy-protected data.
    COMPUTATIONAL TOOLS
    REVIEWS
    Langmead B, Nellore A. Cloud computing for
    genomic data analysis and collaboration. Nature
    Reviews Genetics. 2018 Apr;19(4):208-219.

    View Slide

  40. Future: public data
    "Queryability" means different things for
    different assays, scientific questions
    • Beyond targeted queries, users want bulk
    screens
    • Boiling cauldron of 10,000s samples aside,
    users want subsets with trustworthy
    metadata and particular properties
    • E.g. knocked-down splicing factor,
    carefully purified tissue, disease X

    View Slide

  41. Future: public data
    Rod photoreceptor study involved >90K
    public run accessions
    3 out of the 4 figures I showed used only
    public data
    Desire: for querying and using public data
    to be everyday activity in bio research
    One of the best ways for a neuroscientist like me to keep up to
    date with what colleagues are working on is to attend confer-
    ences. But on recent trips I have noticed a problem. Too few
    researchers are consulting and using publicly available data — my own
    included. What is going on?
    Massive amounts of biological information are being accumu-
    lated using high-throughput sequencing techniques. Many scientists
    discrepancy, and propose a biologically valid reason for it.
    Why are so many bench biologists overlooking this wealth of
    cell-type-specific expression data?
    My hunch is there are two reasons. First, researchers under estimate
    how many of these data have been published over the past few years
    because they are being generated across so many different fields.
    Second, they are wary of the data. Because you need bioinformatics
    Don’t let useful data go
    to waste
    Researchers must seek out others’ deposited biological sequences in
    community databases, urges Franziska Denk.
    MEGHNA ABRAHAM
    WORLD VIEW
    A personal take on events

    View Slide

  42. Future: data science
    Single accession or study All of SRA
    With public data we are quickly confronted by issues like
    technical confounding and missing/incorrect metadata
    What kinds of questions can be answered robustly at what
    points on this spectrum?
    Can we "fix" metadata?
    Ellis SE, Collado-Torres L, Jaffe A, Leek JT.
    Improving the value of public RNA-seq
    expression data by phenotype prediction.
    Nucleic Acids Res. 2018 May 18;46(9):e54.

    View Slide

  43. Jeff Leek
    Jacob Pritt
    Abhinav
    Nellore
    Kasper
    Hansen
    Leo Collado
    Torres
    Chris Wilks
    Andrew Jaffe
    José Alquicira-
    Hernández
    Jamie
    Morton
    Kai
    Kammers
    Shannon
    Ellis
    Margaret
    Taub
    • NIH R01GM118568
    • NSF CAREER IIS-1349906
    • Sloan Research Fellowship
    • IDIES Seed Funding program
    • Amazon Web Services
    • NIH R01GM105705 (Leek)
    langmead-lab.org, @BenLangmead
    Thank you:
    IDIES Seed funding
    SciServer
    SciServer Compute
    Jonathan
    Ling
    Seth
    Blackshaw

    View Slide