Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Human RNA-seq data from recount2 and related packages

Human RNA-seq data from recount2 and related packages

Slides for the recountWorkshop2020 for the #BioC2020 conference https://bioc2020.bioconductor.org/

Leonardo Collado-Torres

July 27, 2020
Tweet

More Decks by Leonardo Collado-Torres

Other Decks in Science

Transcript

  1. 11
    Human RNA-seq data from recount2 and
    related packages
    Leonardo Collado-Torres
    @fellgernon @LieberInstitute
    #BioC2020

    View Slide

  2. https://jhubiostatistics.shinyapps.io/recount/

    View Slide

  3. http://rail.bio/
    Slide adapted from Ben Langmead
    by Abhinav Nellore

    View Slide

  4. http://blogs.citrix.com/2012/10/17/announcing-general-availability-of-sharefile-with-storagezones/

    View Slide

  5. GTEx TCGA
    slide adapted from Shannon Ellis

    View Slide

  6. SRA

    View Slide

  7. jx 1 jx 2 jx 3 jx 4
    jx 5
    jx 6
    Coverage
    Reads
    Gene
    Isoform 1
    Isoform 2
    Potential
    isoform 3
    exon 1 exon 2 exon 3 exon 4
    Expressed region 1:
    potential exon 5
    doi.org/10.12688/f1000research.12223.1

    View Slide

  8. doi.org/10.12688/f1000research.12223.1

    View Slide

  9. > library('recount')
    > download_study( 'ERP001942', type='rse-gene')
    > load(file.path('ERP001942 ', 'rse_gene.Rdata'))
    > rse https://github.com/leekgroup/recount-analyses/

    View Slide

  10. slide adapted from Jeff Leek

    View Slide

  11. related projects
    • Bioconductor recountWorkflow: doi.org/10.12688/f1000research.12223.1
    • Shannon Ellis & Leek: phenotype prediction doi.org/10.1093/nar/gky102
    • Jack Fu & Taub: transcript estimations doi.org/10.1101/247346
    • Madugundu & Pandey (JHU):
    proteomics doi.org/10.1002/pmic.201800315
    • Luidi-Imada & Marchionni (JHU):
    FANTOM (non-coding) and cancer doi.org/10.1101/gr.254656.119
    • Kuri-Magaña & Martínez-Barnetche (INSP Mexico):
    immune expression doi.org/10.3389/fimmu.2018.02679
    • Ryten (UCL):
    Guelfi: validate expressed region (ER) eQTLs doi.org/10.1038/s41467-020-14483-x
    Zhang: improving the detection of ERs doi.org/10.1126/sciadv.aay8299

    View Slide

  12. related projects
    • Bioconductor recountWorkflow: doi.org/10.12688/f1000research.12223.1
    • Shannon Ellis & Leek: phenotype prediction doi.org/10.1093/nar/gky102
    • Jack Fu & Taub: transcript estimations doi.org/10.1101/247346
    • Madugundu & Pandey (JHU):
    proteomics doi.org/10.1002/pmic.201800315
    • Luidi-Imada & Marchionni (JHU):
    FANTOM (non-coding) and cancer doi.org/10.1101/gr.254656.119
    • Kuri-Magaña & Martínez-Barnetche (INSP Mexico):
    immune expression doi.org/10.3389/fimmu.2018.02679
    • Ryten (UCL):
    Guelfi: validate expressed region (ER) eQTLs doi.org/10.1038/s41467-020-14483-x
    Zhang: improving the detection of ERs doi.org/10.1126/sciadv.aay8299
    docs
    Other annotations
    Custom regions
    methods

    View Slide

  13. NEW as of BioC 3.11 (2020):
    http://bioconductor.org/packages/snapcount/

    View Slide

  14. expression data for ~70,000 human samples
    samples
    phenotypes
    ?
    GTEx
    N=9,962
    TCGA
    N=11,284
    SRA
    N=49,848
    samples
    expression
    estimates
    gene
    exon
    junctions
    ERs
    Answer meaningful
    questions about
    human biology and
    expression
    slide adapted from Shannon Ellis

    View Slide

  15. Category Frequency
    F 95
    female 2036
    Female 51
    M 77
    male 1240
    Male 141
    Total 3640
    Even when information is provided, it’s not always
    clear…
    sra_meta$Sex
    “1 Male, 2 Female”, “2 Male, 1 Female”,
    “3 Female”, “DK”, “male and female”
    “Male (note: ….)”, “missing”, “mixed”,
    “mixture”, “N/A”, “Not available”, “not
    applicable”, “not collected”, “not
    determined”, “pooled male and female”,
    “U”, “unknown”, “Unknown”
    slide adapted from Shannon Ellis

    View Slide

  16. Goal :
    to accurately
    predict critical
    phenotype
    information for
    all samples in
    recount
    gene, exon, exon-exon junction and expressed region RNA-Seq data
    SRA
    Sequence Read Archive
    N=49,848
    GTEx
    Genotype Tissue Expression
    Project
    N=9,662
    divide
    samples
    build and
    optimize
    phenotype
    predictor
    training
    set
    predict
    phenotypes
    across SRA
    samples
    test
    accuracy
    of
    predictor
    predict
    phenotypes
    across
    samples in
    TCGA
    test
    set
    TCGA
    The Cancer Genome
    Atlas
    N=11,284
    slide adapted from Shannon Ellis

    View Slide

  17. Sex
    prediction is
    accurate
    across data
    sets
    Number of Regions 20 20 20 20
    Number of
    Samples (N)
    4,769 4,769 11,245 3,640
    99.8% 99.6% 99.4%
    88.5%
    slide adapted from Shannon Ellis

    View Slide

  18. Number of Regions 589 589 589 589 589
    Number of Samples
    (N)
    4,769 4,769 613 6,579 8,951
    97.3% 96.5% 91.0%
    70.2%
    Prediction
    is more
    accurate in
    healthy
    tissue
    50.6%
    slide adapted from Shannon Ellis

    View Slide

  19. > library('recount')
    > download_study( 'ERP001942', type='rse-gene')
    > load(file.path('ERP001942 ', 'rse_gene.Rdata'))
    > rse > rse_with_pred https://github.com/leekgroup/recount-analyses/

    View Slide








































  20. ● ●



    adipose tissue
    adrenal gland
    bladder
    blood
    blood vessel
    bone
    bone marrow
    brain
    breast
    cervix
    cervix uteri
    colon
    epithelium
    esophagus
    fallopian tube
    heart
    intestine
    kidney
    liver
    lung
    melanoma
    monocytes
    muscle
    nerve
    ovary
    pancreas
    penis
    pituitary
    placenta
    prostate
    salivary gland
    skin
    small intestine
    spinal cord
    spleen
    stem cell
    stomach
    testis
    thyroid
    tonsil
    umbilical cord
    urinary bladder
    uterus
    vagina
    0
    3000
    6000
    9000
    12000
    0 1000 2000 3000
    reported
    predicted

    View Slide

  21. • 62 SRA studies
    • 4,431 rows by 48 columns
    Ashkaun Razmara, et al doi.org/10.1101/618025

    View Slide

  22. Sex Female Male
    Age/Development Fetus Child Adolescent Adult
    Race/Ethnicity Asian Black Hispanic White
    Tissue Site 1 Cerebral cortex Hippocampus Brainstem Cerebellum
    Tissue Site 2 Frontal lobe Temporal lobe Midbrain Basal ganglia
    Tissue Site 3 Dorsolateral
    prefrontal cortex
    Superior temporal
    gyrus
    Substantia nigra Caudate
    Hemisphere Left Right
    Brodmann Area 1-52
    Disease Status Disease Neurological control
    Disease Brain tumor Alzheimer’s disease Parkinson’s disease Bipolar disorder
    Tumor Type Glioblastoma Astrocytoma Oligodendroglioma Ependymoma
    Clinical Stage 1 Grade I Grade II Grade III Grade IV
    Clinical Stage 2 Primary Secondary Recurrent
    Viability Postmortem Biopsy
    Preparation Frozen Thawed
    Ashkaun Razmara, et al doi.org/10.1101/618025

    View Slide

  23. View Slide

  24. The recount-brain team
    Hopkins
    Ashkaun Razmara
    Shannon E. Ellis
    Jeff T. Leek
    University of
    Toronto
    Dustin J. Sokolowski
    Michael D. Wilson
    NIH
    Sean Davis
    LIBD
    Andrew E. Jaffe
    Funding
    NIH R01 GM105705
    NIH 1R21MH109956
    NIH R01 GM121459
    CIHR, NSERC
    Ontario Ministry of Research
    IDIES SciServer
    Hosting recount2
    github.com/LieberInstitute/recount-brain

    View Slide

  25. View Slide

  26. > library('recount')
    > download_study( 'ERP001942', type='rse-gene')
    > load(file.path('ERP001942 ', 'rse_gene.Rdata'))
    > rse https://github.com/leekgroup/recount-analyses/

    View Slide

  27. Reference genome
    Reads

    View Slide

  28. View Slide

  29. jx 1 jx 2 jx 3 jx 4
    jx 5
    jx 6
    Coverage
    Reads
    Gene
    Isoform 1
    Isoform 2
    Potential
    isoform 3
    exon 1 exon 2 exon 3 exon 4
    Expressed region 1:
    potential exon 5

    View Slide

  30. View Slide

  31. exon 1 exon 2
    exon 3

    View Slide

  32. disjoint exon 1
    disjoint exon 2 disjoint exon 3

    View Slide

  33. View Slide

  34. View Slide

  35. View Slide

  36. 5 10 15
    0 1 2 3 4 5
    Genome
    Coverage
    3 3 5 4 4 2 2 3 1 3 3 1 4 4 2 1
    AUC = area under coverage = 45

    View Slide

  37. View Slide

  38. View Slide

  39. View Slide

  40. View Slide

  41. jx 1 jx 2 jx 3 jx 4
    jx 5
    jx 6
    Coverage
    Reads
    Gene
    Isoform 1
    Isoform 2
    Potential
    isoform 3
    exon 1 exon 2 exon 3 exon 4
    Expressed region 1:
    potential exon 5

    View Slide

  42. Collado-Torres et al, NAR, 2017

    View Slide

  43. Fetal Infant
    Child Teen
    Adult 50+
    6 / group, N = 36
    Discovery data
    Jaffe et al, Nat. Neuroscience, 2015
    Postmortem Human Brain Samples
    Fetal Infant
    Child Teen
    Adult 50+
    6 / group, N = 36
    Replication data

    View Slide

  44. Jaffe et al, Nat. Neuroscience, 2015

    View Slide

  45. BrainSpan data
    Jaffe et al, Nat. Neuroscience, 2015

    View Slide

  46. Since 2014

    View Slide

  47. NEW since BioC 3.10 (2019)

    View Slide

  48. Collaborators
    UCSD
    Shannon Ellis
    Hopkins
    Jeff Leek
    Ben Langmead
    Christopher Wilks
    Kai Kammers
    Kasper Hansen
    Margaret Taub
    OHSU
    Abhinav Nellore
    LIBD
    Andrew Jaffe
    Funding
    NIH R01 GM105705
    NIH 1R21MH109956
    CONACyT 351535
    AWS in Education
    Seven Bridges
    IDIES SciServer

    View Slide

  49. expression data for ~70,000 human samples
    (multiple) positions available
    This project involves the Hansen, Leek, Langmead and Battle labs at JHU & the
    Nellore lab at OHSU & the Jaffe lab at LIBD
    Contact:
    • Kasper D. Hansen www.hansenlab.org
    • Jeff Leek jtleek.com/
    • Ben Langmead www.langmead-lab.org/
    • Alexis Battle battlelab.jhu.edu/
    • Abhinav Nellore nellore.bio/
    • Andrew Jaffe aejaffe.com/ + Leonardo Collado-Torres lcolladotor.github.io/

    View Slide

  50. 11
    help(package = recountWorkshop2020)
    vignette('recount-workshop', 'recountWorkshop2020')
    Leonardo Collado-Torres
    @fellgernon
    #BioC2020

    View Slide

  51. View Slide

  52. recount2 issues
    • Rail-RNA is hard to run outside a Hadoop cluster
    • No major updates & human-only: “where is my favorite dataset?”
    • Requires a lot of manual post-alignment work: hard to auto-update
    • Annotation choice (Gencode v25) is engrained in the R files

    View Slide

  53. Issue: overlapping genes
    doi.org/10.1101/gr.254656.11

    View Slide

  54. The future: recount3
    • Different aligner: still scalable
    • Mouse & human: studies + collections
    • Can be auto-updated (hopefully!)
    • Several annotation choices included
    • Faster tools for re-annotation quantification (faster than rtracklayer, bwtool, …)
    • R interface is more flexible: builds RangedSummarizedExperiment objects on
    the fly
    Coming to your nearest Bioconductor mirror in 2020!

    View Slide

  55. View Slide

  56. Questions?
    Got more?
    • Public questions >>>>> email (help others with similar questions to yours, thx!)
    • recount package: https://support.bioconductor.org/t/recount
    • recount2 feature requests: https://github.com/leekgroup/recount/issues
    Project R package Year
    ReCount none 2011
    recount2 recount 2017
    recount3 recount3 2020

    View Slide