Human RNA-seq data from recount2 and related packages

Human RNA-seq data from recount2 and related packages

Slides for the recountWorkshop2020 for the #BioC2020 conference https://bioc2020.bioconductor.org/

7382f7fe30561274624635116513ca37?s=128

Leonardo Collado-Torres

July 27, 2020
Tweet

Transcript

  1. 11 Human RNA-seq data from recount2 and related packages Leonardo

    Collado-Torres @fellgernon @LieberInstitute #BioC2020
  2. https://jhubiostatistics.shinyapps.io/recount/

  3. http://rail.bio/ Slide adapted from Ben Langmead by Abhinav Nellore

  4. http://blogs.citrix.com/2012/10/17/announcing-general-availability-of-sharefile-with-storagezones/

  5. GTEx TCGA slide adapted from Shannon Ellis

  6. SRA

  7. jx 1 jx 2 jx 3 jx 4 jx 5

    jx 6 Coverage Reads Gene Isoform 1 Isoform 2 Potential isoform 3 exon 1 exon 2 exon 3 exon 4 Expressed region 1: potential exon 5 doi.org/10.12688/f1000research.12223.1
  8. doi.org/10.12688/f1000research.12223.1

  9. > library('recount') > download_study( 'ERP001942', type='rse-gene') > load(file.path('ERP001942 ', 'rse_gene.Rdata'))

    > rse <- scale_counts(rse_gene) https://github.com/leekgroup/recount-analyses/
  10. slide adapted from Jeff Leek

  11. related projects • Bioconductor recountWorkflow: doi.org/10.12688/f1000research.12223.1 • Shannon Ellis &

    Leek: phenotype prediction doi.org/10.1093/nar/gky102 • Jack Fu & Taub: transcript estimations doi.org/10.1101/247346 • Madugundu & Pandey (JHU): proteomics doi.org/10.1002/pmic.201800315 • Luidi-Imada & Marchionni (JHU): FANTOM (non-coding) and cancer doi.org/10.1101/gr.254656.119 • Kuri-Magaña & Martínez-Barnetche (INSP Mexico): immune expression doi.org/10.3389/fimmu.2018.02679 • Ryten (UCL): Guelfi: validate expressed region (ER) eQTLs doi.org/10.1038/s41467-020-14483-x Zhang: improving the detection of ERs doi.org/10.1126/sciadv.aay8299
  12. related projects • Bioconductor recountWorkflow: doi.org/10.12688/f1000research.12223.1 • Shannon Ellis &

    Leek: phenotype prediction doi.org/10.1093/nar/gky102 • Jack Fu & Taub: transcript estimations doi.org/10.1101/247346 • Madugundu & Pandey (JHU): proteomics doi.org/10.1002/pmic.201800315 • Luidi-Imada & Marchionni (JHU): FANTOM (non-coding) and cancer doi.org/10.1101/gr.254656.119 • Kuri-Magaña & Martínez-Barnetche (INSP Mexico): immune expression doi.org/10.3389/fimmu.2018.02679 • Ryten (UCL): Guelfi: validate expressed region (ER) eQTLs doi.org/10.1038/s41467-020-14483-x Zhang: improving the detection of ERs doi.org/10.1126/sciadv.aay8299 docs Other annotations Custom regions methods
  13. NEW as of BioC 3.11 (2020): http://bioconductor.org/packages/snapcount/

  14. expression data for ~70,000 human samples samples phenotypes ? GTEx

    N=9,962 TCGA N=11,284 SRA N=49,848 samples expression estimates gene exon junctions ERs Answer meaningful questions about human biology and expression slide adapted from Shannon Ellis
  15. Category Frequency F 95 female 2036 Female 51 M 77

    male 1240 Male 141 Total 3640 Even when information is provided, it’s not always clear… sra_meta$Sex “1 Male, 2 Female”, “2 Male, 1 Female”, “3 Female”, “DK”, “male and female” “Male (note: ….)”, “missing”, “mixed”, “mixture”, “N/A”, “Not available”, “not applicable”, “not collected”, “not determined”, “pooled male and female”, “U”, “unknown”, “Unknown” slide adapted from Shannon Ellis
  16. Goal : to accurately predict critical phenotype information for all

    samples in recount gene, exon, exon-exon junction and expressed region RNA-Seq data SRA Sequence Read Archive N=49,848 GTEx Genotype Tissue Expression Project N=9,662 divide samples build and optimize phenotype predictor training set predict phenotypes across SRA samples test accuracy of predictor predict phenotypes across samples in TCGA test set TCGA The Cancer Genome Atlas N=11,284 slide adapted from Shannon Ellis
  17. Sex prediction is accurate across data sets Number of Regions

    20 20 20 20 Number of Samples (N) 4,769 4,769 11,245 3,640 99.8% 99.6% 99.4% 88.5% slide adapted from Shannon Ellis
  18. Number of Regions 589 589 589 589 589 Number of

    Samples (N) 4,769 4,769 613 6,579 8,951 97.3% 96.5% 91.0% 70.2% Prediction is more accurate in healthy tissue 50.6% slide adapted from Shannon Ellis
  19. > library('recount') > download_study( 'ERP001942', type='rse-gene') > load(file.path('ERP001942 ', 'rse_gene.Rdata'))

    > rse <- scale_counts(rse_gene) > rse_with_pred <- add_predictions(rse_gene) https://github.com/leekgroup/recount-analyses/
  20. • • • • • • • • • •

    • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • adipose tissue adrenal gland bladder blood blood vessel bone bone marrow brain breast cervix cervix uteri colon epithelium esophagus fallopian tube heart intestine kidney liver lung melanoma monocytes muscle nerve ovary pancreas penis pituitary placenta prostate salivary gland skin small intestine spinal cord spleen stem cell stomach testis thyroid tonsil umbilical cord urinary bladder uterus vagina 0 3000 6000 9000 12000 0 1000 2000 3000 reported predicted
  21. • 62 SRA studies • 4,431 rows by 48 columns

    Ashkaun Razmara, et al doi.org/10.1101/618025
  22. Sex Female Male Age/Development Fetus Child Adolescent Adult Race/Ethnicity Asian

    Black Hispanic White Tissue Site 1 Cerebral cortex Hippocampus Brainstem Cerebellum Tissue Site 2 Frontal lobe Temporal lobe Midbrain Basal ganglia Tissue Site 3 Dorsolateral prefrontal cortex Superior temporal gyrus Substantia nigra Caudate Hemisphere Left Right Brodmann Area 1-52 Disease Status Disease Neurological control Disease Brain tumor Alzheimer’s disease Parkinson’s disease Bipolar disorder Tumor Type Glioblastoma Astrocytoma Oligodendroglioma Ependymoma Clinical Stage 1 Grade I Grade II Grade III Grade IV Clinical Stage 2 Primary Secondary Recurrent Viability Postmortem Biopsy Preparation Frozen Thawed Ashkaun Razmara, et al doi.org/10.1101/618025
  23. None
  24. The recount-brain team Hopkins Ashkaun Razmara Shannon E. Ellis Jeff

    T. Leek University of Toronto Dustin J. Sokolowski Michael D. Wilson NIH Sean Davis LIBD Andrew E. Jaffe Funding NIH R01 GM105705 NIH 1R21MH109956 NIH R01 GM121459 CIHR, NSERC Ontario Ministry of Research IDIES SciServer Hosting recount2 github.com/LieberInstitute/recount-brain
  25. None
  26. > library('recount') > download_study( 'ERP001942', type='rse-gene') > load(file.path('ERP001942 ', 'rse_gene.Rdata'))

    > rse <- scale_counts(rse_gene) https://github.com/leekgroup/recount-analyses/
  27. Reference genome Reads

  28. None
  29. jx 1 jx 2 jx 3 jx 4 jx 5

    jx 6 Coverage Reads Gene Isoform 1 Isoform 2 Potential isoform 3 exon 1 exon 2 exon 3 exon 4 Expressed region 1: potential exon 5
  30. None
  31. exon 1 exon 2 exon 3

  32. disjoint exon 1 disjoint exon 2 disjoint exon 3

  33. None
  34. None
  35. None
  36. 5 10 15 0 1 2 3 4 5 Genome

    Coverage 3 3 5 4 4 2 2 3 1 3 3 1 4 4 2 1 AUC = area under coverage = 45
  37. None
  38. None
  39. None
  40. None
  41. jx 1 jx 2 jx 3 jx 4 jx 5

    jx 6 Coverage Reads Gene Isoform 1 Isoform 2 Potential isoform 3 exon 1 exon 2 exon 3 exon 4 Expressed region 1: potential exon 5
  42. Collado-Torres et al, NAR, 2017

  43. Fetal Infant Child Teen Adult 50+ 6 / group, N

    = 36 Discovery data Jaffe et al, Nat. Neuroscience, 2015 Postmortem Human Brain Samples Fetal Infant Child Teen Adult 50+ 6 / group, N = 36 Replication data
  44. Jaffe et al, Nat. Neuroscience, 2015

  45. BrainSpan data Jaffe et al, Nat. Neuroscience, 2015

  46. Since 2014

  47. NEW since BioC 3.10 (2019)

  48. Collaborators UCSD Shannon Ellis Hopkins Jeff Leek Ben Langmead Christopher

    Wilks Kai Kammers Kasper Hansen Margaret Taub OHSU Abhinav Nellore LIBD Andrew Jaffe Funding NIH R01 GM105705 NIH 1R21MH109956 CONACyT 351535 AWS in Education Seven Bridges IDIES SciServer
  49. expression data for ~70,000 human samples (multiple) positions available This

    project involves the Hansen, Leek, Langmead and Battle labs at JHU & the Nellore lab at OHSU & the Jaffe lab at LIBD Contact: • Kasper D. Hansen www.hansenlab.org • Jeff Leek jtleek.com/ • Ben Langmead www.langmead-lab.org/ • Alexis Battle battlelab.jhu.edu/ • Abhinav Nellore nellore.bio/ • Andrew Jaffe aejaffe.com/ + Leonardo Collado-Torres lcolladotor.github.io/
  50. 11 help(package = recountWorkshop2020) vignette('recount-workshop', 'recountWorkshop2020') Leonardo Collado-Torres @fellgernon #BioC2020

  51. None
  52. recount2 issues • Rail-RNA is hard to run outside a

    Hadoop cluster • No major updates & human-only: “where is my favorite dataset?” • Requires a lot of manual post-alignment work: hard to auto-update • Annotation choice (Gencode v25) is engrained in the R files
  53. Issue: overlapping genes doi.org/10.1101/gr.254656.11

  54. The future: recount3 • Different aligner: still scalable • Mouse

    & human: studies + collections • Can be auto-updated (hopefully!) • Several annotation choices included • Faster tools for re-annotation quantification (faster than rtracklayer, bwtool, …) • R interface is more flexible: builds RangedSummarizedExperiment objects on the fly Coming to your nearest Bioconductor mirror in 2020!
  55. None
  56. Questions? Got more? • Public questions >>>>> email (help others

    with similar questions to yours, thx!) • recount package: https://support.bioconductor.org/t/recount • recount2 feature requests: https://github.com/leekgroup/recount/issues Project R package Year ReCount none 2011 recount2 recount 2017 recount3 recount3 2020