$30 off During Our Annual Pro Sale. View Details »

Making the Most of Petabases of Genomic Data

Ben Langmead
October 26, 2018

Making the Most of Petabases of Genomic Data

With the advent of modern DNA sequencing, life science is increasingly becoming a big-data science. The main public archive for sequencing data, the Sequence Read Archive (SRA), now contains over a million datasets and many petabytes of data. While large-scale projects like GTEx, ICGC and TOPmed have been major contributors, even larger projects are on the horizon, e.g. the All of Us and Million Veterans programs. The SRA and similar archives are potential gold mines for researchers but they are not organized for everyday use by scientists. The situation resembles the early days of the World Wide Web, before search engines made the web easy to use. I will describe our progress toward the goal of making it easy for researchers to ask scientific questions about public datasets, focusing on datasets that measure abundance of messenger RNA transcripts (RNA-seq). I will describe how we borrow from trends in big-data wrangling and cloud computing to make public data easier to use and query. I will motivate the work with examples of how we are applying it in research areas concerned with novel (e.g. cryptic) splicing patterns and the splicing factors that regulate them. This is work in progress, and I will highlight ways in which we are learning to make our tools better suited to how scientists work. This is joint work with Abhinav Nellore, Chris Wilks, Jonathan Ling, Luigi Marchionni, Jeff Leek, Kasper Hansen, Andrew Jaffe and others.

Ben Langmead

October 26, 2018
Tweet

More Decks by Ben Langmead

Other Decks in Research

Transcript

  1. Ben Langmead
    Assistant Professor, JHU Computer Science
    [email protected], langmead-lab.org, @BenLangmead
    IBM Research, Almaden
    Making the Most of
    Petabases of Genomic Data
    October 25, 2018

    View Slide

  2. View Slide

  3. View Slide

  4. 2nd-gen sequencing: the (Lego) Movie
    bit.ly/2genseq_1 bit.ly/2genseq_2 bit.ly/2genseq_3
    T
    G
    CCATAGTATATCTCGGCTCTAGGCCCTCATTTTTT
    CCATAGTATATCTCGGCTCTAGGCCCTCATTTTTT
    CCATAGTATATCTCGGCTCTAGGCCCTCATTTTTT
    CCATAGTATATCTCGGCTCTAGGCCCTCATTTTTT
    CCATAGTA TATCTCGG CTCTAGGCCCTC ATTTTTT
    CCA TAGTATAT CTCGGCTCTAGGCCCTCA TTTTTT
    CCATAGTAT ATCTCGGCTCTAG GCCCTCA TTTTTT
    CCATAG TATATCT CGGCTCTAGGCCCT CATTTTTT
    C
    C
    A
    T
    A
    G
    C
    A
    DNA polymerase

    View Slide

  5. CGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTG
    Input DNA
    GTATGCACGCGATAG TATGTCGCAGTATCT CACCCTATGTCGCAG GAGACGCTGGAGCCG
    Reads

    View Slide

  6. CGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTG
    GTATGCACGCGATAG TATGTCGCAGTATCT CACCCTATGTCGCAG GAGACGCTGGAGCCG
    TAGCATTGCGAGACG GGTATGCACGCGATA TGGAGCCGGAGCACC CGCTGGAGCCGGAGC
    Input DNA
    Reads

    View Slide

  7. CGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTG
    GTATGCACGCGATAG TATGTCGCAGTATCT CACCCTATGTCGCAG GAGACGCTGGAGCCG
    TAGCATTGCGAGACG GGTATGCACGCGATA TGGAGCCGGAGCACC CGCTGGAGCCGGAGC
    TGTCTTTGATTCCTG CGCGATAGCATTGCG GCATTGCGAGACGCT CCTATGTCGCAGTAT
    Input DNA
    Reads

    View Slide

  8. CGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTG
    GTATGCACGCGATAG TATGTCGCAGTATCT CACCCTATGTCGCAG GAGACGCTGGAGCCG
    TAGCATTGCGAGACG GGTATGCACGCGATA TGGAGCCGGAGCACC CGCTGGAGCCGGAGC
    TGTCTTTGATTCCTG CGCGATAGCATTGCG GCATTGCGAGACGCT CCTATGTCGCAGTAT
    GACGCTGGAGCCGGA GCACCCTATGTCGCA GTATCTGTCTTTGAT CCTCATCCTATTATT
    TATCGCACCTACGTT CAATATTCGATCATG GATCACAGGTCTATC ACCCTATTAACCACT
    TGCATTTGGTATTTT CGTCTGGGGGGTATG CACGCGATAGCATTG
    GTATGCACGCGATAG ACCTACGTTCAATAT TATTTATCGCACCTA CCACTCACGGGAGCT
    GCGAGACGCTGGAGC CTATCACCCTATTAA CTGTCTTTGATTCCT ACTCACGGGAGCTCT
    CCTACGTTCAATATT GCACCTACGTTCAAT GTCTGGGGGGTATGC AGCCGGAGCACCCTA
    GACGCTGGAGCCGGA GCACCCTATGTCGCA GTATCTGTCTTTGAT CCTCATCCTATTATT
    TATCGCACCTACGTT CAATATTCGATCATG GATCACAGGTCTATC ACCCTATTAACCACT
    CACGGGAGCTCTCCA TGCATTTGGTATTTT CGTCTGGGGGGTATG CACGCGATAGCATTG
    CACGGGAGCTCTCCA
    Input DNA
    Reads

    View Slide

  9. 100 nt
    100,000,000 nt
    Input DNA
    Reads

    View Slide

  10. Input DNA
    100 nt
    100,000,000 nt
    ?
    Reads

    View Slide

  11. Input DNA
    100,000,000 nt
    ?
    Reference genome
    +
    Reads

    View Slide

  12. Input DNA
    Reads Reference genome
    +

    View Slide

  13. Sequence Read Archive
    Langmead B, Nellore A. Cloud computing for genomic data
    analysis and collaboration. Nat Rev Genet. 2018 May;19(5):325.
    Currently ~ 20 petabases

    View Slide

  14. Lab goals
    Efficient
    Scalable
    Interpretable
    Software:
    Topics:
    Bowtie 1&2, Arioc, Dashing
    applied algorithms, text indexing,
    sketching, thread scaling
    Rail-RNA, recount2, Snaptron, Boiler
    parallel and high-performance
    computing, cloud computing, indexing
    To make high-throughput life science data as usable
    as possible for scientific labs, especially small ones
    Qtip, FORGe
    modeling mapping quality, graph-
    genome variants, addressing biases
    Software:
    Topics:
    Software:
    Topics:

    View Slide

  15. Themes
    • Cloud computing & supercomputing are poised
    to add big value to archived sequencing data
    • Archives can tell us how much we don't know
    about something
    • Archives can generate hypotheses, inform
    experimental design, even validate results
    • When one door opens, another one opens

    View Slide

  16. Sequence Read Archive
    Langmead B, Nellore A. Cloud computing for genomic data
    analysis and collaboration. Nat Rev Genet. 2018 May;19(5):325.
    Currently ~ 20 petabases

    View Slide

  17. An index is a great leveler
    GB Shaw
    Even a summary would
    be an improvement
    Not GB Shaw

    View Slide

  18. Public summaries
    Langmead B, Nellore A. Cloud computing for genomic data analysis and
    collaboration. Nat Rev Genet. 2018 Apr;19(4):208-219.

    View Slide

  19. Indexing raw sequencing data
    Mantis. Ferdman, M., Johnson, R., & Patro, R. Mantis: A
    Fast, Small, and Exact Large-Scale Sequence-Search
    Index. In Research in Computational Molecular Biology
    (p. 271). Springer.
    BIGSI: Bradley, P., den Bakker, H., Rocha, E., McVean,
    G., & Iqbal, Z. (2017). Real-time search of all bacterial
    and viral genomic data. bioRxiv, 234955.
    Image from Mantis paper
    Image from Split SBT paper
    Sequence Bloom Trees. Solomon B, Kingsford C.
    Fast search of thousands of short-read sequencing
    experiments. Nat Biotechnol. 2016 Mar;34(3):300-2.
    Solomon B, Kingsford C. Improved Search of
    Large Transcriptomic Sequencing Databases
    Using Split Sequence Bloom Trees. J Comput
    Biol. 2018 Mar 12.
    Sun C, Harris RS, Chikhi R, Medvedev P. AllSome
    Sequence Bloom Trees. J Comput Biol. 2018 May;
    25(5):467-479.
    1000 Genomes FM Index: Dolle DD, Liu Z, Cotten
    M, Simpson JT, Iqbal Z, Durbin R, McCarthy SA,
    Keane TM. Using reference-free compressed data
    structures to analyze sequencing reads from
    thousands of human genomes. Genome Res. 2017
    Feb;27(2):300-309.

    View Slide

  20. A search engine for RNA-seq
    Snaptron Index & query engine w/ REST API
    • snaptron.cs.jhu.edu
    • doi:10.1093/bioinformatics/btx547
    Summaries of data, metadata,
    packaged as R objects
    • jhubiostatistics.shinyapps.io/recount/
    • doi:10.1038/nbt.3838
    Scalable, cloud-based spliced alignment
    of archived RNA-seq datasets
    • rail.bio
    • doi:10.1093/bioinformatics/btw575

    View Slide

  21. RNA-seq
    Picture from: Roy H, Ibba M. Molecular
    biology: sticky end in protein synthesis.
    Nature. 2006 Sep 7;443(7107):41-2.
    DNA
    RNA
    Protein
    Transcription
    Translation

    View Slide

  22. Splicing
    gene
    Intron Exon
    Exon

    View Slide

  23. Splicing
    AGGGCTGGGCATAAAAGTCAGGGCAGAGCCATCTATTGCTTACATTTGCTTCTGACACAACTGTGTTCACTAGCAAC
    CTCAAACAGACACCATGGTGCATCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTG
    GATGAAGTTGGTGGTGAGGCCCTGGGCAGGTTGGTATCAAGGTTACAAGACAGGTTTAAGGAGACCAATAGAAACTG
    GGCATGTGGAGACAGAGAAGACTCTTGGGTTTCTGATAGGCACTGACTCTCTCTGCCTATTGGTCTATTTTCCCACC
    CTTAGGCTGCTGGTGGTCTACCCTTGGACCCAGAGGTTCTTTGAGTCCTTTGGGGATCTGTCCACTCCTGATGCTGT
    TATGGGCAACCCTAAGGTGAAGGCTCATGGCAAGAAAGTGCTCGGTGCCTTTAGTGATGGCCTGGCTCACCTGGACA
    ACCTCAAGGGCACCTTTGCCACACTGAGTGAGCTGCACTGTGACAAGCTGCACGTGGATCCTGAGAACTTCAGGGTG
    AGTCTATGGGACGCTTGATGTTTTCTTTCCCCTTCTTTTCTATGGTTAAGTTCATGTCATAGGAAGGGGATAAGTAA
    CAGGGTACAGTTTAGAATGGGAAACAGACGAATGATTGCATCAGTGTGGAAGTCTCAGGATCGTTTTAGTTTCTTTT
    ATTTGCTGTTCATAACAATTGTTTTCTTTTGTTTAATTCTTGCTTTCTTTTTTTTTCTTCTCCGCAATTTTTACTAT
    TATACTTAATGCCTTAACATTGTGTATAACAAAAGGAAATATCTCTGAGATACATTAAGTAACTTAAAAAAAAACTT
    TACACAGTCTGCCTAGTACATTACTATTTGGAATATATGTGTGCTTATTTGCATATTCATAATCTCCCTACTTTATT
    TTCTTTTATTTTTAATTGATACATAATCATTATACATATTTATGGGTTAAAGTGTAATGTTTTAATATGTGTACACA
    TATTGACCAAATCAGGGTAATTTTGCATTTGTAATTTTAAAAAATGCTTTCTTCTTTTAATATACTTTTTTGTTTAT
    CTTATTTCTAATACTTTCCCTAATCTCTTTCTTTCAGGGCAATAATGATACAATGTATCATGCCTCTTTGCACCATT
    CTAAAGAATAACAGTGATAATTTCTGGGTTAAGGCAATAGCAATATCTCTGCATATAAATATTTCTGCATATAAATT
    GTAACTGATGTAAGAGGTTTCATATTGCTAATAGCAGCTACAATCCAGCTACCATTCTGCTTTTATTTTATGGTTGG
    GATAAGGCTGGATTATTCTGAGTCCAAGCTAGGCCCTTTTGCTAATCATGTTCATACCTCTTATCTTCCTCCCACAG
    CTCCTGGGCAACGTGCTGGTCTGTGTGCTGGCCCATCACTTTGGCAAAGAATTCACCCCACCAGTGCAGGCTGCCTA
    TCAGAAAGTGGTGGCTGGTGTGGCTAATGCCCTGGCCCACAAGTATCACTAAGCTCGCTTTCTTGCTGTCCAATTTC
    TATTAAAGGTTCCTTTGTTCCCTAAGTCCAACTACTAAACTGGGGGATATTATGAAGGGCCTTGAGCATCTGGATTC
    intron 1
    intron 2
    exon 1
    exon 2
    exon 3
    ATGGTGCATCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGCAGGCTGC
    TGGTGGTCTACCCTTGGACCCAGAGGTTCTTTGAGTCCTTTGGGGATCTGTCCACTCCTGATGCTGTTATGGGCAACCCTAAGGTGAAGGCTCATGG
    CAAGAAAGTGCTCGGTGCCTTTAGTGATGGCCTGGCTCACCTGGACAACCTCAAGGGCACCTTTGCCACACTGAGTGAGCTGCACTGTGACAAGCTG
    CACGTGGATCCTGAGAACTTCAGGCTCCTGGGCAACGTGCTGGTCTGTGTGCTGGCCCATCACTTTGGCAAAGAATTCACCCCACCAGTGCAGGCTG
    CCTATCAGAAAGTGGTGGCTGGTGTGGCTAATGCCCTGGCCCACAAGTATCACTAA
    exon 1 exon 2 exon 3

    View Slide

  24. Alternative splicing
    Genes can have many isoforms
    Exons can be independently
    included/excluded; boundaries
    can shift

    View Slide

  25. Gene annotation
    Gene annotation:curated collection of isoforms
    UCSC genome browser

    View Slide

  26. Abhinav
    Nellore
    OHSU
    Jeff Leek,
    JHU
    Image by Rgocs
    http://rail.bio
    Nellore A, Collado-Torres L, Jaffe AE, Alquicira-Hernández J, Wilks C,
    Pritt J, Morton J, Leek JT, Langmead B. Rail-RNA: scalable analysis of
    RNA-seq splicing and coverage. Bioinformatics. 2016 Sep 4.

    View Slide

  27. Spliced RNA-seq aligner for analyzing many samples at once
    • Aggregate across samples to borrow strength and
    eliminate redundant alignment work
    • Let data prune false junction calls, not annotation
    • Concise outputs: junctions, junction evidence,
    coverage vectors; no alignments, unless asked for
    • Ready for commercial AWS cloud, other clusters
    http://rail.bio
    Nellore A, Collado-Torres L, Jaffe AE, Alquicira-Hernández J, Wilks C,
    Pritt J, Morton J, Leek JT, Langmead B. Rail-RNA: scalable analysis of
    RNA-seq splicing and coverage. Bioinformatics. 2016 Sep 4.

    View Slide

  28. dbGaP
    http://docs.rail.bio/dbgap/
    Nellore A, Wilks C, Hansen KD, Leek JT, Langmead B. Rail-dbGaP:
    analyzing dbGaP-protected data in the cloud with Amazon Elastic
    MapReduce. Bioinformatics. 2016 Aug 15;32(16):2551-3.

    View Slide

  29. Toward recount2
    • Analyzed ~21,500 human RNA-seq samples
    with Rail-RNA; about 62 Tbp
    • http://github.com/nellore/runs
    • ~ $0.72 / sample
    (Compare to sequencing costs)
    (Commands we used to run on AWS)
    jxs
    samples
    http://intropolis.rail.bio
    Nellore A, et al. Human splicing diversity and the extent of
    unannotated splice junctions across human RNA-seq samples on
    the Sequence Read Archive. Genome Biol. 2016 Dec 30;17(1):266.

    View Slide

  30. a
    0 2000 4000 6000 8000 10000 12000 14000
    0
    100000
    200000
    300000
    400000
    500000
    600000
    700000
    Minimum number S of samples in which jx is called
    Junction (jx) count J
    18.6%
    56,861 jx
    100%
    96.5%
    81.4%
    85.8%
    Novel
    Alternative donor/acceptor
    Exon skip
    Fully annotated
    800 900 1000 1100 1200
    240000
    260000
    280000
    300000
    320000
    b
    8000
    10000
    samples
    c
    2500
    3000
    Annotation includes: UCSC, GENCODE v19 & v24, RefSeq,
    CCDS, MGC, lincRNAs, SIB genes, AceView, Vega
    http://intropolis.rail.bio
    Nellore A, et al. Human splicing diversity and the extent of
    unannotated splice junctions across human RNA-seq samples on
    the Sequence Read Archive. Genome Biol. 2016 Dec 30;17(1):266.

    View Slide

  31. • Discovery of new splicing has leveled off
    • Time ripe for a more complete annotation?
    http://intropolis.rail.bio
    Nellore A, et al. Human splicing diversity and the extent of
    unannotated splice junctions across human RNA-seq samples on
    the Sequence Read Archive. Genome Biol. 2016 Dec 30;17(1):266.
    Toward recount2

    View Slide

  32. recount2
    • >50K human RNA-seq samples from SRA (open)
    • >10K human RNA-seq samples spanning cancer
    types in The Cancer Genome Atlas (dbGaP)
    Image: https://www.sevenbridges.com/welcome-to-the-cancer-genomics-cloud-2/
    • >10K human RNA-seq samples from
    Genotype-Tissue Expression (GTEx)
    project (dbGaP)
    • Total: ~4.4 trillion reads, 100s of terabases
    Image: doi:10.1038/ng.2653
    Collado-Torres L, Nellore A, Kammers K, Ellis SE, Taub MA, Hansen KD, Jaffe AE, Langmead B, Leek JT.
    Reproducible RNA-seq analysis using recount2. Nature Biotechnology. 2017 Apr 11;35(4):319-321.

    View Slide

  33. Collado-Torres L, Nellore A, Kammers K, Ellis SE, Taub MA, Hansen KD, Jaffe AE, Langmead B, Leek JT.
    Reproducible RNA-seq analysis using recount2. Nature Biotechnology. 2017 Apr 11;35(4):319-321.
    https://jhubiostatistics.shinyapps.io/recount/
    recount2
    Leo Collado
    Torres
    Abhinav
    Nellore

    View Slide

  34. Search engine for RNA-seq
    Snaptron

    View Slide

  35. Snaptron
    Query planner delegates query components to
    appropriate systems (sqlite, tabix, lucene) and
    indexes (R-tree, B-tree, Lucene inverted text index)
    Chris Wilks
    Sample
    Filter
    8
    Region
    Limited
    Region
    Limited &
    Filtered
    Region
    Junction
    Records
    Sample
    Metadata
    Records
    Junction
    Records
    Filtered
    Region
    Filtered
    Samples
    Snaptron
    Query
    Planner
    Query Data Store/Index Output
    1
    2
    6 7
    3
    9
    4 5
    10 11 12 13
    4 7
    3
    1 2 8
    5 6
    Sample
    Metadata
    Terms Samples
    "Brain" 1,2,3,6
    "Liver" 4,6,9,11
    Sample
    Filter
    Tabix/R-tree
    Index
    Lucene/Inverted
    Document
    Index
    SQLite/B-tree
    Index
    Wilks C, Gaddipati P, Nellore A, Langmead B. Snaptron: querying splicing patterns across
    tens of thousands of RNA-seq samples. Bioinformatics. 2018 Jan 1;34(1):114-116.

    View Slide

  36. Snaptron
    Provides command-line tool and REST API for querying
    junctions, gene & exon expression, coverage
    Wilks C, Gaddipati P, Nellore A, Langmead B. Snaptron: querying splicing patterns across
    tens of thousands of RNA-seq samples. Bioinformatics. 2018 Jan 1;34(1):114-116.

    View Slide

  37. Snaptron
    • How prevalent is each junction in gene ABCD3
    in each of 50K public datasets?
    • What is a junction's tissue specificity in the
    GTEx dataset?
    • In which samples is splicing pattern A
    overrepresented relative to B?
    Example queries
    http://snaptron.cs.jhu.edu
    Wilks C, Gaddipati P, Nellore A, Langmead B. Snaptron: querying splicing patterns across
    tens of thousands of RNA-seq samples. Bioinformatics. 2018 Jan 1;34(1):114-116.

    View Slide

  38. Snaptron case studies










    0
    5000
    10000
    15000
    20000
    GTEx SRAv2
    Data compilation
    Shared sample count (SSC)
    Validation
    Failed
    Passed
    A. ABCD3
    B. KMT2E
    3
    1
    2
    1
    2
    3
    C. ALKATI
    1
    2
    3
    4
    Wilks C, Gaddipati P, Nellore A, Langmead B. Snaptron: querying splicing patterns across
    tens of thousands of RNA-seq samples. Bioinformatics. 2018 Jan 1;34(1):114-116.

    View Slide

  39. In the field: splicing factors
    Dr. Ling studies how splicing factors affect
    certain cryptic splicing patterns
    • cryptic: infrequent, not conserved,
    "shouldn't happen"
    Jonathan
    Ling
    TDP-43
    Seth
    Blackshaw

    View Slide

  40. In the field: splicing factors
    Ling JP, Pletnikova O, Troncoso JC, Wong PC. TDP-43 repression of nonconserved
    cryptic exons is compromised in ALS-FTD. Science. 2015 Aug 7;349(6248):650-5.

    View Slide

  41. In the field: splicing factors
    splicing factors
    splicing patterns

    View Slide

  42. Ling JP, Wilks C, Charles R, Ghosh D, Jiang L, Santiago CP, Pang B,
    Venkataraman A, Clark BS, Nellore A, Langmead B, Blackshaw S. ASCOT
    identifies key regulators of photoreceptor-specific splicing. In preparation.
    Rods have characteristic
    patterns of exon usage
    Rod photoreceptors

    View Slide

  43. Rod photoreceptors
    Exon usage is a useful cell-type signature; often not
    visible at the gene level
    Ling JP, Wilks C, Charles R, Ghosh D, Jiang L, Santiago CP, Pang B,
    Venkataraman A, Clark BS, Nellore A, Langmead B, Blackshaw S. ASCOT
    identifies key regulators of photoreceptor-specific splicing. In preparation.

    View Slide

  44. Certain exons are
    used only in rods
    Ling JP, Wilks C, Charles R, Ghosh D, Jiang L, Santiago CP, Pang B,
    Venkataraman A, Clark BS, Nellore A, Langmead B, Blackshaw S. ASCOT
    identifies key regulators of photoreceptor-specific splicing. In preparation.
    Rod photoreceptors

    View Slide

  45. Certain splicing factors are
    specific to rods -- could they
    drive rod-specific splicing?
    Rod photoreceptors
    Ling JP, Wilks C, Charles R, Ghosh D, Jiang L, Santiago CP, Pang B,
    Venkataraman A, Clark BS, Nellore A, Langmead B, Blackshaw S. ASCOT
    identifies key regulators of photoreceptor-specific splicing. In preparation.

    View Slide

  46. Rod photoreceptors
    Ling JP, Wilks C, Charles R, Ghosh D, Jiang L, Santiago CP, Pang B,
    Venkataraman A, Clark BS, Nellore A, Langmead B, Blackshaw S. ASCOT
    identifies key regulators of photoreceptor-specific splicing. In preparation.
    Up-regulating those splicing factors yields rod-like splicing

    View Slide

  47. Future: public data
    Rod photoreceptor study involved >90K
    public datasets
    Most figures I showed used public data only
    Desire: querying public data = everyday
    activity in bio research
    • "Leveler" in a field of haves & have nots
    One of the best ways for a neuroscientist like me to keep up to
    date with what colleagues are working on is to attend confer-
    ences. But on recent trips I have noticed a problem. Too few
    researchers are consulting and using publicly available data — my own
    included. What is going on?
    Massive amounts of biological information are being accumu-
    discrepancy, and propose a biologically valid reason for it.
    Why are so many bench biologists overlooking this wealth of
    cell-type-specific expression data?
    My hunch is there are two reasons. First, researchers under estimate
    how many of these data have been published over the past few years
    because they are being generated across so many different fields.
    Don’t let useful data go
    to waste
    Researchers must seek out others’ deposited biological sequences in
    community databases, urges Franziska Denk.
    MEGHNA ABRAHAM
    WORLD VIEW
    A personal take on events

    View Slide

  48. Future: cloud computing
    Clouds are a natural fit for reanalyzing public data
    and for far-flung genomics collaborations
    • Elasticity, security, reproducibility, less copying
    Next-generation sequencing (NGS) technologies have
    been improving rapidly and have become the work-
    horse technology for studying nucleic acids. NGS plat-
    forms work by collecting information on a large array
    of poly merase reactions working in parallel, up to bil-
    lions at a time inside a single sequencer1. The speed
    and decreasing cost of NGS have led to the rapid accu-
    mulation of raw sequencing data (sequencing reads),
    used in published studies, in public archives2 such as
    programme17, among others (TABLE 1). gnomAD now
    spans over 120,000 exomes and over 15,000 whole
    genomes. ICGC encompasses over 70 subprojects target-
    ing distinct cancer types, which are conducted in more
    than a dozen countries and have already collected sam-
    ples from more than 20,000 donors. Aligned sequenc-
    ing reads for ICGC require over 1 petabyte (PB; that
    is, a million GB) of storage. The TOPMed programme,
    which plans to sequence more than 120,000 genomes17,
    ads
    A sequence as
    NA sequencer.
    f a computer
    .
    onent of a
    ich the
    Cloud computing for genomic data
    analysis and collaboration
    Ben Langmead1 and Abhinav Nellore2
    Abstract | Next-generation sequencing has made major strides in the past decade. Studies based
    on large sequencing data sets are growing in number, and public archives for raw sequencing
    data have been doubling in size every 18 months. Leveraging these data requires researchers to
    use large-scale computational resources. Cloud computing, a model whereby users rent
    computers and storage from large data centres, is a solution that is gaining traction in genomics
    research. Here, we describe how cloud computing is used in genomics for research and
    large-scale collaborations, and argue that its elasticity, reproducibility and privacy features make
    it ideally suited for the large-scale reanalysis of publicly available archived data, including
    privacy-protected data.
    COMPUTATIONAL TOOLS
    REVIEWS
    Langmead B, Nellore A. Cloud
    computing for genomic data analysis
    and collaboration. Nature Reviews
    Genetics. 2018 Apr;19(4):208-219.

    View Slide

  49. Future: data science
    One
    dataset
    All of
    SRA
    Public data quickly confronts us with technical
    confounders & missing/incorrect metadata
    What questions can we answer robustly?
    At what points on the spectrum?
    Is metadata fixable?
    Ellis SE, Collado-Torres L, Jaffe A, Leek JT.
    Improving the value of public RNA-seq
    expression data by phenotype prediction.
    Nucleic Acids Res. 2018 May 18;46(9):e54.

    View Slide

  50. Jeff Leek
    Jacob
    Pritt
    Abhinav
    Nellore
    Kasper
    Hansen
    Leo Collado
    Torres
    Chris
    Wilks
    Andrew
    Jaffe
    José Alquicira-
    Hernández
    Jamie
    Morton
    Kai
    Kammers
    Shannon
    Ellis
    Margaret
    Taub
    • NIH R01GM118568
    • NSF CAREER IIS-1349906
    • Sloan Research Fellowship
    • IDIES Seed Funding program
    • Amazon Web Services
    • NIH R01GM105705 (Leek)
    langmead-lab.org, @BenLangmead
    Thank you:
    IDIES Seed funding
    SciServer
    SciServer Compute
    Jonathan
    Ling
    Seth
    Blackshaw

    View Slide