Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Human RNA-seq data from recount2 and related pa...

Human RNA-seq data from recount2 and related packages

Slides for the recountWorkshop2020 for the #BioC2020 conference https://bioc2020.bioconductor.org/

Leonardo Collado-Torres

July 27, 2020
Tweet

More Decks by Leonardo Collado-Torres

Other Decks in Science

Transcript

  1. 11 Human RNA-seq data from recount2 and related packages Leonardo

    Collado-Torres @fellgernon @LieberInstitute #BioC2020
  2. SRA

  3. jx 1 jx 2 jx 3 jx 4 jx 5

    jx 6 Coverage Reads Gene Isoform 1 Isoform 2 Potential isoform 3 exon 1 exon 2 exon 3 exon 4 Expressed region 1: potential exon 5 doi.org/10.12688/f1000research.12223.1
  4. > library('recount') > download_study( 'ERP001942', type='rse-gene') > load(file.path('ERP001942 ', 'rse_gene.Rdata'))

    > rse <- scale_counts(rse_gene) https://github.com/leekgroup/recount-analyses/
  5. related projects • Bioconductor recountWorkflow: doi.org/10.12688/f1000research.12223.1 • Shannon Ellis &

    Leek: phenotype prediction doi.org/10.1093/nar/gky102 • Jack Fu & Taub: transcript estimations doi.org/10.1101/247346 • Madugundu & Pandey (JHU): proteomics doi.org/10.1002/pmic.201800315 • Luidi-Imada & Marchionni (JHU): FANTOM (non-coding) and cancer doi.org/10.1101/gr.254656.119 • Kuri-Magaña & Martínez-Barnetche (INSP Mexico): immune expression doi.org/10.3389/fimmu.2018.02679 • Ryten (UCL): Guelfi: validate expressed region (ER) eQTLs doi.org/10.1038/s41467-020-14483-x Zhang: improving the detection of ERs doi.org/10.1126/sciadv.aay8299
  6. related projects • Bioconductor recountWorkflow: doi.org/10.12688/f1000research.12223.1 • Shannon Ellis &

    Leek: phenotype prediction doi.org/10.1093/nar/gky102 • Jack Fu & Taub: transcript estimations doi.org/10.1101/247346 • Madugundu & Pandey (JHU): proteomics doi.org/10.1002/pmic.201800315 • Luidi-Imada & Marchionni (JHU): FANTOM (non-coding) and cancer doi.org/10.1101/gr.254656.119 • Kuri-Magaña & Martínez-Barnetche (INSP Mexico): immune expression doi.org/10.3389/fimmu.2018.02679 • Ryten (UCL): Guelfi: validate expressed region (ER) eQTLs doi.org/10.1038/s41467-020-14483-x Zhang: improving the detection of ERs doi.org/10.1126/sciadv.aay8299 docs Other annotations Custom regions methods
  7. expression data for ~70,000 human samples samples phenotypes ? GTEx

    N=9,962 TCGA N=11,284 SRA N=49,848 samples expression estimates gene exon junctions ERs Answer meaningful questions about human biology and expression slide adapted from Shannon Ellis
  8. Category Frequency F 95 female 2036 Female 51 M 77

    male 1240 Male 141 Total 3640 Even when information is provided, it’s not always clear… sra_meta$Sex “1 Male, 2 Female”, “2 Male, 1 Female”, “3 Female”, “DK”, “male and female” “Male (note: ….)”, “missing”, “mixed”, “mixture”, “N/A”, “Not available”, “not applicable”, “not collected”, “not determined”, “pooled male and female”, “U”, “unknown”, “Unknown” slide adapted from Shannon Ellis
  9. Goal : to accurately predict critical phenotype information for all

    samples in recount gene, exon, exon-exon junction and expressed region RNA-Seq data SRA Sequence Read Archive N=49,848 GTEx Genotype Tissue Expression Project N=9,662 divide samples build and optimize phenotype predictor training set predict phenotypes across SRA samples test accuracy of predictor predict phenotypes across samples in TCGA test set TCGA The Cancer Genome Atlas N=11,284 slide adapted from Shannon Ellis
  10. Sex prediction is accurate across data sets Number of Regions

    20 20 20 20 Number of Samples (N) 4,769 4,769 11,245 3,640 99.8% 99.6% 99.4% 88.5% slide adapted from Shannon Ellis
  11. Number of Regions 589 589 589 589 589 Number of

    Samples (N) 4,769 4,769 613 6,579 8,951 97.3% 96.5% 91.0% 70.2% Prediction is more accurate in healthy tissue 50.6% slide adapted from Shannon Ellis
  12. > library('recount') > download_study( 'ERP001942', type='rse-gene') > load(file.path('ERP001942 ', 'rse_gene.Rdata'))

    > rse <- scale_counts(rse_gene) > rse_with_pred <- add_predictions(rse_gene) https://github.com/leekgroup/recount-analyses/
  13. • • • • • • • • • •

    • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • adipose tissue adrenal gland bladder blood blood vessel bone bone marrow brain breast cervix cervix uteri colon epithelium esophagus fallopian tube heart intestine kidney liver lung melanoma monocytes muscle nerve ovary pancreas penis pituitary placenta prostate salivary gland skin small intestine spinal cord spleen stem cell stomach testis thyroid tonsil umbilical cord urinary bladder uterus vagina 0 3000 6000 9000 12000 0 1000 2000 3000 reported predicted
  14. • 62 SRA studies • 4,431 rows by 48 columns

    Ashkaun Razmara, et al doi.org/10.1101/618025
  15. Sex Female Male Age/Development Fetus Child Adolescent Adult Race/Ethnicity Asian

    Black Hispanic White Tissue Site 1 Cerebral cortex Hippocampus Brainstem Cerebellum Tissue Site 2 Frontal lobe Temporal lobe Midbrain Basal ganglia Tissue Site 3 Dorsolateral prefrontal cortex Superior temporal gyrus Substantia nigra Caudate Hemisphere Left Right Brodmann Area 1-52 Disease Status Disease Neurological control Disease Brain tumor Alzheimer’s disease Parkinson’s disease Bipolar disorder Tumor Type Glioblastoma Astrocytoma Oligodendroglioma Ependymoma Clinical Stage 1 Grade I Grade II Grade III Grade IV Clinical Stage 2 Primary Secondary Recurrent Viability Postmortem Biopsy Preparation Frozen Thawed Ashkaun Razmara, et al doi.org/10.1101/618025
  16. The recount-brain team Hopkins Ashkaun Razmara Shannon E. Ellis Jeff

    T. Leek University of Toronto Dustin J. Sokolowski Michael D. Wilson NIH Sean Davis LIBD Andrew E. Jaffe Funding NIH R01 GM105705 NIH 1R21MH109956 NIH R01 GM121459 CIHR, NSERC Ontario Ministry of Research IDIES SciServer Hosting recount2 github.com/LieberInstitute/recount-brain
  17. > library('recount') > download_study( 'ERP001942', type='rse-gene') > load(file.path('ERP001942 ', 'rse_gene.Rdata'))

    > rse <- scale_counts(rse_gene) https://github.com/leekgroup/recount-analyses/
  18. jx 1 jx 2 jx 3 jx 4 jx 5

    jx 6 Coverage Reads Gene Isoform 1 Isoform 2 Potential isoform 3 exon 1 exon 2 exon 3 exon 4 Expressed region 1: potential exon 5
  19. 5 10 15 0 1 2 3 4 5 Genome

    Coverage 3 3 5 4 4 2 2 3 1 3 3 1 4 4 2 1 AUC = area under coverage = 45
  20. jx 1 jx 2 jx 3 jx 4 jx 5

    jx 6 Coverage Reads Gene Isoform 1 Isoform 2 Potential isoform 3 exon 1 exon 2 exon 3 exon 4 Expressed region 1: potential exon 5
  21. Fetal Infant Child Teen Adult 50+ 6 / group, N

    = 36 Discovery data Jaffe et al, Nat. Neuroscience, 2015 Postmortem Human Brain Samples Fetal Infant Child Teen Adult 50+ 6 / group, N = 36 Replication data
  22. Collaborators UCSD Shannon Ellis Hopkins Jeff Leek Ben Langmead Christopher

    Wilks Kai Kammers Kasper Hansen Margaret Taub OHSU Abhinav Nellore LIBD Andrew Jaffe Funding NIH R01 GM105705 NIH 1R21MH109956 CONACyT 351535 AWS in Education Seven Bridges IDIES SciServer
  23. expression data for ~70,000 human samples (multiple) positions available This

    project involves the Hansen, Leek, Langmead and Battle labs at JHU & the Nellore lab at OHSU & the Jaffe lab at LIBD Contact: • Kasper D. Hansen www.hansenlab.org • Jeff Leek jtleek.com/ • Ben Langmead www.langmead-lab.org/ • Alexis Battle battlelab.jhu.edu/ • Abhinav Nellore nellore.bio/ • Andrew Jaffe aejaffe.com/ + Leonardo Collado-Torres lcolladotor.github.io/
  24. recount2 issues • Rail-RNA is hard to run outside a

    Hadoop cluster • No major updates & human-only: “where is my favorite dataset?” • Requires a lot of manual post-alignment work: hard to auto-update • Annotation choice (Gencode v25) is engrained in the R files
  25. The future: recount3 • Different aligner: still scalable • Mouse

    & human: studies + collections • Can be auto-updated (hopefully!) • Several annotation choices included • Faster tools for re-annotation quantification (faster than rtracklayer, bwtool, …) • R interface is more flexible: builds RangedSummarizedExperiment objects on the fly Coming to your nearest Bioconductor mirror in 2020!
  26. Questions? Got more? • Public questions >>>>> email (help others

    with similar questions to yours, thx!) • recount package: https://support.bioconductor.org/t/recount • recount2 feature requests: https://github.com/leekgroup/recount/issues Project R package Year ReCount none 2011 recount2 recount 2017 recount3 recount3 2020