Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Using huge public sequencing datasets to answer scientific questions

Using huge public sequencing datasets to answer scientific questions

Ben Langmead

July 30, 2018
Tweet

More Decks by Ben Langmead

Other Decks in Science

Transcript

  1. Ben Langmead
    Assistant Professor, JHU Computer Science
    [email protected], langmead-lab.org, @BenLangmead
    UCLA CGSI Tutorial, July 30, 2018
    Using huge public sequencing datasets
    to answer scientific questions

    View full-size slide

  2. Links
    Code & links: https://github.com/BenLangmead/cgsi18
    Slides: http://bit.ly/langmead-cgsi18

    View full-size slide

  3. Related readings
    • Langmead B, Nellore A. Cloud computing for genomic data analysis and collaboration. Nat
    Rev Genet. 2018 May;19(5):325. https://doi.org/10.1038/nrg.2017.113
    • Wilks C, Gaddipati P, Nellore A, Langmead B. Snaptron: querying splicing patterns across
    tens of thousands of RNA-seq samples. Bioinformatics. 2018 Jan 1;34(1):114-116. https://
    doi.org/10.1093/bioinformatics/btx547
    • Collado-Torres L, Nellore A, Kammers K, Ellis SE, Taub MA, Hansen KD, Jaffe AE, Langmead B,
    Leek JT. Reproducible RNA-seq analysis using recount2. Nat Biotechnol. 2017 Apr 11;35(4):
    319-321. https://doi.org/10.1038/nbt.3838
    • Nellore A, Jaffe AE, Fortin JP, Alquicira-Hernández J, Collado-Torres L, Wang S, Phillips RA III,
    Karbhari N, Hansen KD, Langmead B, Leek JT. Human splicing diversity and the extent of
    unannotated splice junctions across human RNA-seq samples on the Sequence Read
    Archive. Genome Biol. 2016 Dec 30;17(1):266. https://doi.org/10.1186/s13059-016-1118-6
    • Nellore A, Collado-Torres L, Jaffe AE, Alquicira-Hernández J, Wilks C, Pritt J, Morton J, Leek JT,
    Langmead B. Rail-RNA: scalable analysis of RNA-seq splicing and coverage. Bioinformatics.
    2017 Dec 15;33(24):4033-4040. https://doi.org/10.1093/bioinformatics/btw575

    View full-size slide

  4. Terabases
    Open access
    Total
    1 Pbp
    8 -> 16 Pbp in
    ~18 months
    10 Pbp
    4 -> 8 Pbp in
    ~12 months
    Sequence Read Archive (SRA) growth

    View full-size slide

  5. An index is a great leveler
    GB Shaw
    Even a summary would be
    an improvement
    Not GB Shaw

    View full-size slide

  6. Aside: Indexing raw sequencing data
    Mantis. Ferdman, M., Johnson, R., & Patro, R. Mantis: A Fast,
    Small, and Exact Large-Scale Sequence-Search Index. In
    Research in Computational Molecular Biology (p. 271). Springer.
    BIGSI: Bradley, P., den Bakker, H., Rocha, E., McVean, G., &
    Iqbal, Z. (2017). Real-time search of all bacterial and viral
    genomic data. bioRxiv, 234955.
    Image from Mantis paper
    Image from Split SBT paper
    Sequence Bloom Trees. Solomon B, Kingsford C. Fast
    search of thousands of short-read sequencing
    experiments. Nat Biotechnol. 2016 Mar;34(3):300-2.
    Solomon B, Kingsford C. Improved Search of Large
    Transcriptomic Sequencing Databases Using Split
    Sequence Bloom Trees. J Comput Biol. 2018 Mar 12.
    Sun C, Harris RS, Chikhi R, Medvedev P. AllSome
    Sequence Bloom Trees. J Comput Biol. 2018 May;25(5):
    467-479.
    1000 Genomes FM Index: Dolle DD, Liu Z, Cotten M,
    Simpson JT, Iqbal Z, Durbin R, McCarthy SA, Keane TM.
    Using reference-free compressed data structures to
    analyze sequencing reads from thousands of human
    genomes. Genome Res. 2017 Feb;27(2):300-309.

    View full-size slide

  7. Public summaries of sequencing data
    Langmead B, Nellore A. Cloud computing for genomic data analysis and collaboration. Nat Rev Genet.
    2018 Apr;19(4):208-219. doi: 10.1038/nrg.2017.113.
    Name Website Notes
    ExAC / gnomAD http://gnomad.broadinstitute.org Non-REF alleles in aligned exomes/genomes
    Cistrome http://cistrome.org/db/#/ Summarized ChIP and DNAse seq; human & mouse
    SRAdb https://github.com/seandavi/SRAdb Queryable SRA metadata, updated regularly

    View full-size slide

  8. Search engine for RNA-seq
    Snaptron Index & query engine w/ REST API
    snaptron.cs.jhu.edu
    doi:10.1093/bioinformatics/btx547
    Clean summaries of data, metadata,
    packaged as R objects
    jhubiostatistics.shinyapps.io/recount/
    doi:10.1038/nbt.3838
    Scalable, cloud-based spliced alignment
    of archived RNA-seq datasets
    rail.bio
    doi:10.1093/bioinformatics/btw575

    View full-size slide

  9. Abhinav
    Nellore
    OHSU
    Jeff Leek, JHU
    http://rail.bio Nellore A, Collado-Torres L, Jaffe AE, Alquicira-Hernández J, Wilks C,
    Pritt J, Morton J, Leek JT, Langmead B. Rail-RNA: scalable analysis of
    RNA-seq splicing and coverage. Bioinformatics. 2016 Sep 4.
    Image by Rgocs

    View full-size slide

  10. Junction-level summaries
    • Analyzed >50K human RNA-seq samples from SRA;
    trillions of reads, 100s of terabases
    Junction 1
    J 2
    J 3
    J 4
    Sample 1 S 2 S 3
    S1 S2 S3
    J1 0 0 2
    J2 0 1 1
    J3 0 1 1
    J4 8 1 1

    View full-size slide

  11. Exercise 1
    • How often are the exon-
    exon junctions we detect
    also present in annotations
    like GENCODE?
    S1 S2 S3 ... S50,000
    J1 0 0 2 0
    J2 0 1 1 3
    J3 0 1 1 40
    J4 8 1 1 2
    ... ...
    J81,066,376 0 10 0 ... 0

    View full-size slide

  12. a
    0 2000 4000 6000 8000 10000 12000 14000
    0
    100000
    200000
    300000
    400000
    500000
    600000
    700000
    Minimum number S of samples in which jx is called
    Junction (jx) count J
    18.6%
    56,861 jx
    100%
    96.5%
    81.4%
    85.8%
    Novel
    Alternative donor/acceptor
    Exon skip
    Fully annotated
    800 900 1000 1100 1200
    240000
    260000
    280000
    300000
    320000
    b
    8000
    10000
    samples
    c
    2500
    3000
    Annotation includes: UCSC, GENCODE v19 & v24,
    RefSeq, CCDS, MGC, lincRNAs, SIB genes, AceView, Vega
    http://intropolis.rail.bio
    Nellore A, et al. Human splicing diversity and the extent of
    unannotated splice junctions across human RNA-seq samples on
    the Sequence Read Archive. Genome Biol. 2016 Dec 30;17(1):266.

    View full-size slide

  13. Search engine for RNA-seq
    Snaptron Index & query engine w/ REST API
    snaptron.cs.jhu.edu
    doi:10.1093/bioinformatics/btx547
    Clean summaries of data, metadata,
    packaged as R objects
    jhubiostatistics.shinyapps.io/recount/
    doi:10.1038/nbt.3838
    Scalable, cloud-based spliced alignment
    of archived RNA-seq datasets
    rail.bio
    doi:10.1093/bioinformatics/btw575

    View full-size slide

  14. Collado-Torres L, Nellore A, Kammers K, Ellis SE, Taub MA, Hansen KD, Jaffe AE, Langmead B, Leek JT.
    Reproducible RNA-seq analysis using recount2. Nature Biotechnology. 2017 Apr 11;35(4):319-321.
    https://jhubiostatistics.shinyapps.io/recount/
    recount2

    View full-size slide

  15. recount2
    Junctions
    Genes
    Coverage
    Exons
    Summarized at levels of genes, exons, junctions,
    and coverage vectors
    Collado-Torres L, Nellore A, Kammers K, Ellis SE, Taub MA, Hansen KD, Jaffe AE, Langmead B, Leek JT.
    Reproducible RNA-seq analysis using recount2. Nature Biotechnology. 2017 Apr 11;35(4):319-321.

    View full-size slide

  16. Exercise 2
    Adapted from the recount2 quick-start guide by Leo Collado-Torres
    http://bioconductor.org/packages/devel/bioc/vignettes/recount/inst/doc/recount-quickstart.html

    View full-size slide

  17. Search engine for RNA-seq
    Snaptron

    View full-size slide

  18. Snaptron
    Query planner delegates query components to
    appropriate systems (sqlite, tabix, lucene) and
    indexes (R-tree, B-tree, Lucene inverted text index)
    Chris Wilks
    Sample
    Filter
    8
    Region
    Limited
    Region
    Limited &
    Filtered
    Region
    Junction
    Records
    Sample
    Metadata
    Records
    Junction
    Records
    Filtered
    Region
    Filtered
    Samples
    Snaptron
    Query
    Planner
    Query Data Store/Index Output
    1
    2
    6 7
    3
    9
    4 5
    10 11 12 13
    4 7
    3
    1 2 8
    5 6
    Sample
    Metadata
    Terms Samples
    "Brain" 1,2,3,6
    "Liver" 4,6,9,11
    Sample
    Filter
    Tabix/R-tree
    Index
    Lucene/Inverted
    Document
    Index
    SQLite/B-tree
    Index
    Wilks C, Gaddipati P, Nellore A, Langmead B. Snaptron: querying and visualizing splicing
    across tens of thousands of RNA-seq samples. Bioinformatics. 2017 Sep 1. btx547.

    View full-size slide

  19. Snaptron
    Provides command-line tool and REST API for
    querying junctions (& more summaries coming soon)
    Wilks C, Gaddipati P, Nellore A, Langmead B. Snaptron: querying and visualizing splicing
    across tens of thousands of RNA-seq samples. Bioinformatics. 2017 Sep 1. btx547.

    View full-size slide

  20. Snaptron
    • For each junction in gene ABCD3, how many reads
    supported it in each of the 50K SRA samples?
    • What is a particular junction's tissue specificity in
    the GTEx dataset?
    • In which samples is splicing pattern A
    overrepresented relative to splicing pattern B?
    • (A/B might relate to alt splicing, fusions, etc)
    Examples:
    http://snaptron.cs.jhu.edu
    Wilks C, Gaddipati P, Nellore A, Langmead B. Snaptron: querying and visualizing splicing
    across tens of thousands of RNA-seq samples. Bioinformatics. 2017 Sep 1. btx547.

    View full-size slide

  21. Mini Snaptron case study
    • Goldstein et al searched for novel cassette exons in
    Illumina BodyMap 2.0
    • Identified 249 cassette exons within known genes
    but not overlapping any annotated exon
    • Validated 216 out of 249 in independent sample via
    paired-end RNA-seq (2 x 250 bp)
    Goldstein LD, Cao Y, Pau G, Lawrence M, Wu TD, Seshagiri S, Gentleman R. Prediction and
    Quantification of Splice Events from RNA-Seq Data. PLoS One. 2016 May 24;11(5):e0156132.

    View full-size slide

  22. Mini Snaptron case study










    0
    5000
    10000
    15000
    20000
    GTEx SRAv2
    Data compilation
    Shared sample count (SSC)
    Validation
    Failed
    Passed
    • Exons validated by
    Goldstein et al had
    higher SSC versus
    exons failing
    validation
    • SSC (prevalence) is
    related to how "real"
    they are
    Wilks C, Gaddipati P, Nellore A, Langmead B. Snaptron: querying and visualizing splicing
    across tens of thousands of RNA-seq samples. Bioinformatics. 2017 Sep 1. btx547.

    View full-size slide

  23. Future: public data
    Desire: for querying and using public data
    to be everyday activity in bio research
    Is training keeping up? If not, can we fix?
    One of the best ways for a neuroscientist like me to keep up to
    date with what colleagues are working on is to attend confer-
    ences. But on recent trips I have noticed a problem. Too few
    researchers are consulting and using publicly available data — my own
    included. What is going on?
    Massive amounts of biological information are being accumu-
    lated using high-throughput sequencing techniques. Many scientists
    have used some of those resources, such as the Encyclopedia of DNA
    Elements (ENCODE) launched by the US National Human Genome
    Research Institute. But many more laboratories in neuroscience and
    other subdisciplines of cell and molecular biology generate their
    own data sets. These data are piling up in community databases and
    offer information on gene expression and regulation. Unless this
    information is used, it is wasted.
    For instance, I study brain cells thought to be
    important for the maintenance of chronic pain.
    Called microglia, these cells are also investi-
    gated by immunologists interested in the cells’
    role in, say, multiple sclerosis. Together, these
    results provide a full profile of which genes
    these cells express.
    discrepancy, and propose a biologically valid reason for it.
    Why are so many bench biologists overlooking this wealth of
    cell-type-specific expression data?
    My hunch is there are two reasons. First, researchers under estimate
    how many of these data have been published over the past few years
    because they are being generated across so many different fields.
    Second, they are wary of the data. Because you need bioinformatics
    knowledge to generate and analyse sequencing results, people assume
    that they also need such expertise to locate and interpret them.
    Not so. In the past five years, improvements in technology, together
    with stricter deposition guidelines, mean that simple Excel files com-
    monly accompany papers. These can be downloaded in minutes from
    the Supplementary Information of a relevant paper, or from the ‘GEO
    Datasets’ tab on the NCBI website using search
    terms. It is like PubMed for spreadsheets. They
    require minimal knowledge to browse.
    It is often difficult to share big data in science.
    Sequencing data are fairly unusual, in that it
    is easy to standardize, display and judge them
    from the outside. This is not the case for many
    other kinds of scientific output. For instance,
    TAKING
    NO NOTICE
    OF DEPOSITED
    DATA IS AKIN TO
    Don’t let useful data go
    to waste
    Researchers must seek out others’ deposited biological sequences in
    community databases, urges Franziska Denk.
    MEGHNA ABRAHAM
    WORLD VIEW
    A personal take on events

    View full-size slide

  24. Future: public data
    Single accession or study All of SRA
    With public data we are quickly confronted by issues like
    technical confounding and missing/incorrect metadata
    How do we know what questions can be answered robustly
    at what points on the spectrum?
    Can we "fix" metadata?
    Ellis SE, Collado-Torres L, Jaffe A, Leek JT.
    Improving the value of public RNA-seq
    expression data by phenotype prediction.
    Nucleic Acids Res. 2018 May 18;46(9):e54.

    View full-size slide

  25. Jeff Leek
    Jacob Pritt
    Abhinav
    Nellore
    Kasper
    Hansen
    Leo Collado
    Torres
    Chris Wilks
    Andrew Jaffe
    José Alquicira-
    Hernández
    Jamie
    Morton
    Kai
    Kammers
    Shannon
    Ellis
    Margaret
    Taub
    • NIH R01GM118568
    • NSF CAREER IIS-1349906
    • Sloan Research Fellowship
    • IDIES Seed Funding program
    • Amazon Web Services
    • NIH R01GM105705 (Leek)
    langmead-lab.org, @BenLangmead
    Thank you:
    IDIES Seed funding
    SciServer
    SciServer Compute
    Jonathan
    Ling
    Seth
    Blackshaw

    View full-size slide